Learning to Classify from Impure Samples with High-Dimensional Data

A persistent challenge in practical classification tasks is that labeled training sets are not always available. In particle physics, this challenge is surmounted by the use of simulations. These simulations accurately reproduce most features of data, but cannot be trusted to capture all of the complex correlations exploitable by modern machine learning methods. Recent work in weakly supervised learning has shown that simple, low-dimensional classifiers can be trained using only the impure mixtures present in data. Here, we demonstrate that complex, high-dimensional classifiers can also be trained on impure mixtures using weak supervision techniques, with performance comparable to what could be achieved with pure samples. Using weak supervision will therefore allow us to avoid relying exclusively on simulations for high-dimensional classification. This work opens the door to a new regime whereby complex models are trained directly on data, providing direct access to probe the underlying physics.

Data analysis methods at the Large Hadron Collider (LHC) rely heavily on simulations. These simulations are generally excellent and allow us to explore the mapping between truth information (particles from collisions) and observables (tracks or calorimeter deposits). In particular, simulations let us train complex algorithms to extract the truth information from the observables. Machine learning methods trained on low-level inputs have been developed for collider physics [1] to identify boosted W/Z/H bosons [2][3][4][5][6][7][8], top quarks [9][10][11][12][13], b-quarks [14][15][16], and light quarks [17][18][19][20], for removing pileup [21], and for generating fragmentation and calorimeter showers [22][23][24]. These new methods achieve excellent performance by exploiting subtle features of the simulations, which are presumed to be similar to the features in the data. Unfortunately, the simulations are known to be imperfect. For instance, the particle multiplicity and radiation distributions within quark-and gluon-initiated jets are known to be quite variable among different simulations and between simulations and data [25][26][27]. In addition, non-negligible corrections ("scale factors") are required experimentally (see e.g. Refs. [26,[28][29][30][31][32][33][34]). Thus it is natural to question the performance of machine learning algorithms trained on simulations: how do we know they are not just learning unphysical artifacts of the simulation? This objection certainly has merit, as the power of these methods for physics applications stems precisely from their ability to find features we do not fully understand and cannot easily interpret.
Data-driven approaches avoid the pitfalls of relying on simulations in experimental analyses. For simple observables, such as the invariant mass of a photon pair, a traditional experimental approach has been to perform sideband fits directly to the data. This avoids relying on the simulation altogether. Unfortunately, most of the sophisticated discrimination techniques developed in recent years use full supervision, where truth information is needed in order to train the classifier. However, real data generally consist only of mixed samples without truth information, arising from underlying statistical or quantum mixtures of signal and background. Occasionally one can find a small region of phase space where the signal or background is pure, but these regions are generally sparsely populated and may not produce representative distributions. Recent work on weak supervision [35] allows classifiers to be trained using only the information available from mixed samples. Two weakly supervised paradigms tailored to physics applications are Learning from Label Proportions (LLP) [36] and Classification Without Labels (CWoLa) [37]. Ref. [36] considered the problem of quark/gluon (q/g) jet discrimination using three standard jet observables and showed how to achieve fully supervised discrimination power by using LLP with two samples of different but known quark jet fractions. In Ref. [37], it was shown that the proportions are not necessary for training since the likelihood ratio of the mixed samples is monotonically related to the signal/background likelihood ratio, the optimal binary classifier for signal vs. background.
One potential objection to the weak-learning demonstrations in Refs. [36][37][38] is that the dimensionality of the inputs used is small. Indeed, for a one-dimensional discriminant, such as the jet mass, one can extract the exact pure distributions from mixed samples using the fractions. It is not obvious that weak supervision will succeed when trained on high-dimensional inputs where the feature space may be sparsely populated. Indeed, the most powerful modern methods are trained on high-dimensional, low-level inputs, where numerically approximating and weighting the probability distribution is completely intractable.
In this paper, we demonstrate that weak supervision can approach the effectiveness of full supervision on complex models with high-dimensional inputs. As a concrete illustration, we use jet images [2] with convolutional neural networks (CNNs) applied to quark versus arXiv:1801.10158v1 [hep-ph] 30 Jan 2018 gluon jet tagging, where the dimensionality of the inputs is O(1000) and simulation mis-modeling issues are a challenge [26,[39][40][41][42][43]. We find that CWoLa more robustly generalizes to learning with high-dimensional inputs than LLP, with the latter requiring careful engineering choices to achieve comparable performance. Though we use a particle physics problem as an example, the lessons about learning from data using mixtures of signal and background are applicable more broadly.
We begin by establishing some notation and formulating the problem. Let x represent a vector of observables (features) useful for discriminating two classes we call signal (S) and background (B). For example, x might be the momenta of observed particles, calorimeter energy deposits, or a complete set of observables [7,8]. In fully supervised learning, each training sample is assigned a truth label such as 1 for signal and 0 for background. Then the fully supervised model is trained to predict the correct labels for each training example by minimizing a loss function. For a sufficiently large training set, an appropriate model parameterization, and a suitable minimization procedure, the learned model should approach the optimal classifier defined by thresholding the likelihood ratio.
Data collected from a real detector do not come with signal/background labels. Instead, one typically has two or more mixtures M a of signal and background with different signal fractions f a , such that the distribution of the features, p Ma (x), is given by: where p S and p B are the signal and background distributions, respectively. Weak supervision assumes sample independence, that Eq. 1 holds with the same distributions p S (x) and p B (x) for all mixtures. Although in most situations sample independence does not hold perfectly (see e.g. Ref. [44]), it is often a very good approximation (cf . Table II below). LLP uses any fully supervised classification method and modifies the loss function to globally match the signal fraction predicted by the model on a batch of training samples to the known truth fractions f a . Breaking the training set into batches, normally done to parallelize training, takes on a new significance with LLP since the loss function is evaluated globally on each batch. The batch size, which for LLP we define as the number of samples drawn from each mixture during one update of the model, is a critical hyperparameter of LLP.
The loss functions we use for LLP are slightly different from those in Ref. [36]. Analogous to the mean squared error (MSE) loss function for fully supervised (or CWoLa) training, we introduce the weak MSE (WMSE) loss for the LLP framework: where N is the batch size, a indexes the mixed samples, and h is the model. Analogous to the crossentropy, we also introduce the weak cross entropy (WCE) loss: . One caveat we discovered while exploring LLP is that the range of h(x) must be restricted to [0, 1], otherwise the model falls into trivial minima of the loss function. We also observe the effect of model outputs becoming effectively binary at 0 and 1, necessitating additional care to avoid numerical precision issues. CWoLa works without requiring the fractions f a to be known for training (the fractions on smaller test sets can be used to calibrate the classifier operating points). It acts on two mixtures, treating one as signal and the other as background. CWoLa uses any fully supervised classification method to distinguish the "signal" mixture from the "background" mixture. Amazingly, a classifier trained in this way asymptotically (as the number of training samples goes to infinity) results in the same classifier as if the samples were pure [37,45,46]. The CWoLa framework has the nice property that as the samples approach complete purity (f 1 → 0, f 2 → 1) it smoothly approaches the fully supervised paradigm. CWoLa presently only works with two mixtures; if more than two are available they can be pooled in some way at the cost of diluting their purity. Table I summarizes some differences between CWoLa and LLP.
To explore weak supervision methods with highdimensional inputs, we simulate Z + q/g events at √ s = 13 TeV using Pythia 8.226 [47] and create artificially mixed samples with various quark (signal) fractions. Jets with transverse momentum p jet T ∈ [250, 275] GeV and rapidity |y| ≤ 2.0 are obtained from final-state, non-neutrino particles clustered using the anti-k t algorithm [48] with radius R = 0.4 implemented in FastJet 3.3.0 [49]. Single-channel, 33 × 33 jet images [2,3,17] are constructed from a patch of the pseudorapidity-azimuth plane of size 0.8 × 0.8 centered on the jet, treating the particle p T values as pixel intensities. The images are normalized so the sum of the pixels is 1 and standardized by subtracting the mean and dividing by the standard deviation of each pixel as calculated from the training set.
All instantiations and trainings of neural networks were performed with the python deep learning library Keras [50] with the TensorFlow [51] backend. A CNN architecture similar to that employed in Ref. [17] was used: three 32-filter convolutional layers with filter sizes of 8×8, 4×4, and 4×4 followed by a 128-unit dense layer. Maxpooling of size 2 × 2 was performed after each convolutional layer with a stride length of 2. The dropout rate was taken to be 0.1 for all layers. Keras VarianceScaling initialization was used to initialize the weights of the convolutional layers. Due to numerical precision issues caused by the tendency of LLP to push outputs to 0 or 1, a softmax activation function was included as part of the loss function rather than the model output layer. Validation and test sets were used consisting each of 50k equally mixed quark and gluon jet images. Training was performed with the Adam algorithm [52] with a learning rate of 0.001 and a validation performance patience of 10 epochs. Each network was trained 10 times and the variation of the performance was used as a measure of the uncertainty. Unless otherwise specified, the following are used by default: Exponential Linear Unit (ELU) [53] activation functions for all non-output layers, the CE loss function for CWoLa, and the WCE loss function for LLP.
The performance of a binary classifier can be captured by its receiver operating characteristic (ROC) curve. To condense the classifier performance into a single number, we use the area under the ROC curve (AUC). The AUC is also the probability that the classifier correctly sorts a randomly drawn signal and background event. Random classifiers have AUC = 0.5 and perfect classifiers have AUC = 1.0. We also confirmed that our conclusions are unchanged when using the background mistag rate at 50% signal efficiency as a performance metric instead.
As previously noted, the LLP paradigm works by matching the predicted fraction of signal events to the known fraction for multiple mixed samples. In Ref. [36], the averaging took place over the entire mixed sample. Averaging over the entire training set at once is effectively impossible for high-dimensional inputs such as jet images because the graphics processing units (GPUs) that are needed to train the CNNs in a reasonable amount of time typically do not have enough memory to hold the entire training set at one time. Hence, the ability to train with batches is highly desirable for using LLP with high-dimensional inputs.
There are many tradeoffs inherent with choosing the LLP batch size. Smaller batch sizes are susceptible to shot noise in the sense that the actual signal fraction on that batch may differ significantly from the fraction for the entire mixed sample, an effect which decreases as the batch size increases. parallelization capabilities of the GPU cannot be used) but often require fewer epochs to train. Larger batch sizes have shorter training times per epoch but typically require more epochs to train. For CWoLa, the batch size plays the same role as in full supervision, with the performance being largely insensitive to it but the total training time varying slightly. These tradeoffs are captured in Fig. 1, which shows both the performance and training time for CWoLa and LLP models as the batch size is swept in powers of two from 64 to 16384, trained on two mixtures with f 1 = 0.2 and f 2 = 0.8. The expected independence of CWoLa performance and the degradation of LLP performance for low batch sizes can clearly be seen. The training time curves are concave with optimum batch sizes toward the middle of the swept region. Based on this figure, we choose default batch sizes of 4000 for LLP and 400 for CWoLa. In order to explore a slightly more realistic scenario than artificially mixing samples from the same distribution of quarks and gluons, we generate Z + jet and dijet events with the same generation parameters and cuts as described previously. These "naturally" mixed samples have quark fractions f Z+jet = 0.88 and f dijets = 0.37. The signal and background fractions have been systematically explored for these and many other processes in Ref. [54]. As indicated by Table II, there is no significant difference in performance on the naturally mixed or artificially mixed samples. Hence, artificially mixed samples are used in the rest of this study in order to evaluate weak supervision performance at different quark purities. Fig. 2 compares CWoLa and LLP performance for various quark/gluon purities as a function of the number There is no significant difference in classifier performance between the naturally mixed (Z+jet vs. dijets) samples and the artificially mixed (Z + q/g) samples with the same signal fractions.
of training samples. Each network is trained using two samples, one with quark fraction f 1 and the other with quark fraction f 2 = 1 − f 1 . Each point in the figure is the median of 10 independent network trainings and the error bars show the 25 th and 75 th percentiles. Full supervision performance corresponds to CWoLa with f 1 = 0. The most important takeaway from Fig. 2 is that we have achieved good performance with both weak supervision methods over a large variety of sample purities and training sample sizes. We also see that CWoLa consistently outperforms LLP and continues to get better as additional training samples are used, likely a result of the increasingly-populated feature space, whereas LLP performance tends to level off. It should be noted that given the binary output nature of LLP models, classifiers trained in this way effectively come with a working point and sweeping the threshold to produce a ROC curve may not be ideal. The purity/data tradeoff analysis of Fig. 2 can provide valuable information for practical applications of weak supervision methods in physics, particularly in cases where more data can be acquired at the expense of worsening sample purity.
The sensitivity of LLP to different choices of loss function and activation function was examined. We studied the choices of the symmetric squared loss of Eq. (2) and the weak crossentropy loss of Eq. (3) with Rectified Linear Unit (ReLU) [55] and ELU activation functions. We found a significant improvement in LLP classification performance in using ELU activations instead of ReLU activations, particularly at high signal efficiencies. The choice of loss function was found to be less important than the choice of activation function, but minor improvements in AUC were observed with the WCE loss function over WMSE. We also studied the dependence of CWoLa on the choice of activation function and found consistent performance between ELU and ReLU activations. These results justify our default choices of ELU activation and WCE loss functions. With the choice of ELU activation, LLP achieves almost the same performance to our CWoLa-trained network near the operating point with equal signal and background efficiencies. We suspect this is a result of the tendency of LLP to output binary predictions (near 0 or 1) rather than a continuous output that can be easily thresholded.
Lastly, LLP has the potential advantage over the present implementation of CWoLa that it can naturally encompass multiple mixed samples with different purities. While in principle adding more samples should help, it is not obvious whether the network will effectively take advantage of them. Indeed, we did not find significant improvement to LLP when adding additional samples with intermediate purities, even after significant, dedicated architecture engineering.
In conclusion, we have shown that machine learning approaches using very high-dimensional inputs can be trained directly on mixtures of signal and background, and therefore on data. This addresses one of the main objections to the use of modern machine learning in jet tagging: sensitivity to untrustworthy simulations. We have implemented and tested weakly supervised learning with both LLP and CWoLa, finding that for the quark/gluon discrimination problem considered here CWoLa outperforms LLP and is less sensitive to particular hyperparameter choices. We have developed a method for training LLP with high-dimensional inputs in batches and demonstrated that the batch size is a critical hyperparameter for both performance and training time. Given any fully supervised classifier, CWoLa works "out-of-the-box" whereas LLP requires additional engineering to achieve good performance and is generally harder to train. Nonetheless, the success in using both of these weak supervision approaches on high-dimensional data is encouraging for the future of modern machine learning techniques in particle physics and beyond.
The authors would like to thank Lucio Dery and Francesco Rubbo for collaboration in the initial stages of this work. We are grateful to Jesse Thaler for helpful discussions. PTK and EMM would like to thank the MIT Physics Department for its support. Computations for this paper were performed on the Odyssey cluster