Uncertainty Aware Learning for High Energy Physics

Machine learning techniques are becoming an integral component of data analysis in High Energy Physics (HEP). These tools provide a significant improvement in sensitivity over traditional analyses by exploiting subtle patterns in high-dimensional feature spaces. These subtle patterns may not be well-modeled by the simulations used for training machine learning methods, resulting in an enhanced sensitivity to systematic uncertainties. Contrary to the traditional wisdom of constructing an analysis strategy that is invariant to systematic uncertainties, we study the use of a classifier that is fully aware of uncertainties and their corresponding nuisance parameters. We show that this dependence can actually enhance the sensitivity to parameters of interest. Studies are performed using a synthetic Gaussian dataset as well as a more realistic HEP dataset based on Higgs boson decays to tau leptons. For both cases, we show that the uncertainty aware approach can achieve a better sensitivity than alternative machine learning strategies.


I. INTRODUCTION
The usefulness of physical measurements is tied to the magnitude and reliability of their estimated uncertainties. Whether one is measuring the mass of the Higgs boson [1] or the value of the Hubble constant [2], it is the size of the uncertainty that communicates the quality of the information and allows measurements to be contrasted or accurately combined. While statistical uncertainties can be reduced with the collection of additional data, the most troublesome type of uncertainty is systematic uncertainty. These uncertainties can arise from many sources and are often modeled as the dependence of a parameter of interest on other degrees of freedom known as nuisance parameters.
The challenging task of quantifying the systematic uncertainty on a measured physical parameter has become even more important due to the growing use of complex statistical procedures based on modern machine learning [3][4][5][6][7][8][9]. While more powerful techniques can extract more information from higher-dimensional datasets, they may also introduce or enhance the dependence of those measurements on nuisance parameters. This is because, despite their importance, systematic uncertainties are often not part of the learning procedure for a typical machine learning model. Instead, models are typically trained on synthetic datasets generated with assumed values of the nuisance parameters; the impact of uncertainties is typically quantified post-hoc by varying those nuisance parameters. In cases where systematic uncertainties are dominant, such machine learning models may improve the statistical uncertainty, but increase the systematic uncertainty by exacerbating the dependence on nuisance parameters. To minimize the total uncertainty, it is essential to incorporate uncertainties directly into the learning procedure.
To date, several approaches have been considered. Data augmentation trains a model on a concoction of synthetic data with different values of the nuisance pa-rameters. This exposes the classifiers to the range of possible nuisance parameter values, so that at inference time, the result may be less sensitive to their precise value. The disadvantage of augmentation is that the model is unaware of the values of the nuisance parameters, and so learns the average, rather than the optimal response. Another possibility is to train a model to explicitly be insensitive to nuisance parameters or other parameters [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]. One implementation of this approach involves training two machine learning models at the same time: one that achieves the primary learning task and a second model that tries to learn information about the nuisance parameter/feature from the output of the classifier. When this second model (adversary) is unable to perform its task, then the primary classifier is insensitive, as desired. The data augmentation and adversarial learning approaches will serve as important baselines in this paper.
Rendering a classifier insensitive to nuisance parameters may increase analysis precision and decrease analysis complexity, but it may also have undesirable consequences. First, systematic uncertainties often involve guesswork; in many cases, the corresponding nuisance parameters do not have a strict statistical origin. Reducing the sensitivity to a particular nuisance parameter may not eliminate the underlying uncertainty; instead, it may eliminate the only existing handle to probe the source of uncertainty. For example, a measurement which used a model constructed to be insensitive to differences between two example parton shower algorithms, such as Pythia [27] and Herwig [28], may still have significant systematic uncertainty due to our lack of a complete understanding of fragmentation. Second, some values of the nuisance parameter achieve a better sensitivity to the parameters of interest than others and this could be exploited to improve the performance of a classifier. A classifier that is insensitive to a nuisance parameter will not be able to exploit features that are highly sensitive to that parameter. The first example in this paper demonstrates an extreme case where a classifier insensitive to a nuisance parameter will result in no separation power.
We advocate for the opposite of decorrelation. Classifiers are constructed to be explicitly dependent on nuisance parameters as if they are parameters of interest. As nuisance parameters are profiled, the classifier will change and the best classifier will be used for each value of the nuisance parameter. Parameterized classifiers have been studied in the context of parameters or features of interest [29,30], and full dependence on nuisance parameters for inference has been advocated in Ref. [31][32][33][34][35].
In this paper, we provide specific examples of profiled classifiers and show explicitly that they can enhance analysis sensitivity over strategies that render networks insensitive to nuisance parameters. We focus on only the construction of classifiers as useful statistics for downstream analysis and not on full likelihood (ratio) estimation. In this way, our uncertainty-aware classifier approach is a straightforward extension of existing analyses performed at the Large Hadron Collider (LHC) and therefore may result in immediate improvements in sensitivity. In addition, this prescription allows for easy post-hoc histogrambased diagnostics. These may include quantification of the impact of additional sources of systematic uncertainties that are not used for training, and checks for whether the measurement over-constrains the nuisance parameter.
While we focus on the profiling aspect of uncertainty awareness, there is a complementary line of research on the use of uncertainty-aware loss functions [36][37][38][39][40][41][42]. These approaches construct classifiers that are optimized using the final test statistic, including uncertainties. We leave the combination of profile-aware training and uncertainty-aware loss to future work. There have also been recent proposals to use Bayesian neural networks for estimating uncertainties [43][44][45][46]. Additional information about the interplay between uncertainties and machine learning can be found in recent reviews [35,47].
This paper is organized as follows. The uncertaintyaware methods are introduced in Sec. II. Evaluation criteria for assessing the performance of the methods is described in Sec. III. To build intuition, a Gaussian example is presented in Sec. IV. A physics example based on Higgs boson decays to τ leptons using a benchmark dataset is provided in Sec. V. The paper ends with conclusions and outlook in Sec. VI.

II. UNCERTAINTY-AWARE METHODS
This section describes the four methods of training classifiers studied in this paper. All neural networks were trained using Keras 2.2.4 [48] with a TensorFlow 1.12.0 [49] backend on a single Nvidia GeForce GTX GPU.

A. Notation
Features used for classification are denoted by X ∈ R n , where lower case x is a realization of the random variable X. A dataset of many examples will be written {x i }. The nominal value of the nuisance parameter z ∈ R m is z 0 and the true value is z T . In some cases, we will promote the nuisance parameters to random variables, in which case they will be represented by the capital letter Z.

B. Baseline Classifier
The baseline method is a classifier trained to distinguish signal and background using data simulated at the nominal value of the nuisance parameter, as is done routinely in LHC analyses. A network trained optimally to minimise a Binary Cross-Entropy (BCE) loss learns to output a score (see e.g. [50]), where p(·) denotes a probability density, S represents the signal class and B represents the background class. The score of the network is used as an observable with high sensitivity to the parameter of interest for the final measurement.

C. Data Augmentation
An alternative method is to augment the training data to include signal and background samples with several values of the nuisance parameters. A network trained optimally to minimise a BCE loss learns the score, where p Z is the probability density over the nuisance parameter Z, treated as a random variable with some probability density chosen by the experimenter. Typically, Z is discrete and has a nonzero probability mass at only a few values. The score s(x) is then treated in the same way as in the baseline case (Eq. 1).

D. Adversarial Training
An orthogonal strategy is to train a classifier with the explicit objective of being insensitive to the effects of the nuisance parameter. Our implementation follows the adversarial training prescription of Ref. [12]. However, to improve the training stability and speed, the classifier and adversary are concatenated together through a gradient reversal layer [51] and trained simultaneously. The classifier is trained with the objective to minimize the classification loss and maximise the adversarial loss and the second loss has a relative weight of λ, a tunable hyperparameter.
While training for exact invariance in this adversarial setup can be tricky [52], maximizing overall sensitivity requires a compromise between the level of invariance to nuisance parameters and the classification power. The Gaussian case described in Sec. IV is an extreme example where exact invariance to the nuisance parameter requires zero discriminating power for the classifier.
In the end, the score of the classifier on observed data is used as an observable in the final measurement, in the same way as for the baseline classifier.

E. Uncertainty-Aware Classifier
The concept explored in this paper is to parameterize the network in the nuisance parameters; see Fig. 1. Specifically, the network is trained with the true value of the nuisance parameter z as an input to the network in additional to the observables x. A network trained optimally to minimise a BCE loss learns the score, .
The score of this classifier is not used as a single observable for the final fit as in the previous methods. At evaluation time, while the x values remain fixed as inputs to the network, the unknown z is left as a parameter, allowing for later profiling over the nuisance parameters in the final measurement.
Importantly, note that Eq. 3 depends on z. This means that the calculation of analysis observable(s) depends on z and change as the nuisance parameter is varied, during the evaluation of uncertainties and/or during nuisance parameter profiling. This is in contrast to the standard search paradigm in which the calculation of the analysis observables are fixed and the sensitivity to z is evaluated post-hoc. Allowing the calculation of the analysis observables to depend explicitly on the value of z is not the traditional approach, but it does not require that the experimenter have any special knowledge of z. Formation of a confidence interval in the space of model parameters (either parameters of interest or nuisance parameters) naturally requires calculating the likelihood ratio of the model as those parameters vary, relative to the best-fit parameters. It is natural for the calculation of the analysis observable, a proxy for the likelihood ratio, to vary with those parameters. One can later profile over the nuisance parameters to capture the impact of our lack of knowledge of its true value. The traditional approach of fixing the analysis observable calculation can be thought of as an ad-hoc approximation of the full method.

FIG. 1:
The architecture of an uncertainty-aware network, in which the nuisance parameter z is treated as a feature alongside the observed data x, learning a decision function which varies with the nuisance parameter.

III. EVALUATION METHODOLOGY
To evaluate the power of each approach above, we apply them to a common use case, fitting a signal hypothesis in the presence of background, where both signal and background depend on nuisance parameters. Relevant to many measurements of Standard Model (SM) processes as well as searches for physics beyond the SM, the parameter of interest is the signal strength µ, the cross section of the signal relative to the reference value. In the Gaussian example below, we use low-dimensional datasets for simpler visualization, but the results generalize. Similarly, for ease of calculations we perform a binned likelihood fit, although the unbinned nature of neural networks should allow application to unbinned cases; we leave that investigation to future work.
For each of the strategies described, template histograms of the classifier score are constructed from simulated signal and background events for several values of the nuisance parameter z. These templates are the basis of the binned likelihood calculation L(µ, z|{x i }) over the parameters µ, z, where {x i } is the full observed dataset. The likelihood is a product of a Poisson term for each histogram bin and a Gaussian constraint on the nuisance parameter. The Gaussian constraint can readily be replaced with any other prior or a Poisson term from an auxiliary measurement if z is directly constrained with control region data (demonstrated in Appendix B). If no additional prior or constraint on the nuisance parameter is used then only information from the primary measurement constrains z. The Negative Log-Likelihood (NLL) is (up to an irrelevant constant), where s j , b j are the expected number of signal and background events in bin j, respectively, and N j is the number of events observed in data for that bin. The Γ function is the generalized factorial function which can handle decimal values in the simulated test dataset. Although usually irrelevant, the log(Γ(N i )) term is not a constant while using an uncertainty-aware network and cannot be ignored. For this approach, the decision function changes with z and therefore the bin counts in simulation and observed data also change with z.
In practice, samples at various values of z can often be produced cheaply from a single simulated MC sample by shifting the value of z and recomputing all the relevant physics variables, and this approach will be used for the studies in Sec. V. Care must be taken to apply any kinematic selection on these variables only after the shift. In these studies, the templates and the 'observed dataset' are built using the same test dataset because the dataset used in Sec. V is not large enough to split into three representative datasets.
The fitted value of µ is obtained by minimizing Eq. 4. Uncertainties are accounted for by studying the dependence of the likelihood near the fitted valueμ while optimizing over z. The power of each approach is determined by their relative uncertainties in µ. As a diagnostic, the parameter of interest may be profiled over instead to check if the measurement over-constrains the nuisance parameter.

IV. GAUSSIAN EXAMPLE
To illustrate the different approaches in a simple setting with complete analytic control, we begin with a Gaussian example with a two-dimensional feature space and a single nuisance parameter. Signal events are drawn from Gaussian distributions in the two features, with means at cos (z) and sin (z), respectively; the width of each is set to 0.7. Background events are generated in same fashion, but with means for the two features at − cos (z) and − sin (z) respectively. An example of the signal and background distributions for z = π 4 is shown in Fig. 2.
A set of 4.2 × 10 7 events are generated at 21 values of z equally spaced between 0 and π/2. The dataset is split into training and test sets with a ratio of 3:1. All signal events in the test set have a weight of 10 −3 and all background events have a weight of 10 −1 to mimic a rare signal typical of LHC analyses. Ten bins are used to construct the template and observed histograms. The parameter of interest is the signal strength µ with a true value of 1.
FIG. 2: Contour of probability densities for signal and background hypotheses in the two-dimensional feature space for the simple Gaussian demonstration case, with the nuisance parameter fixed to z = π 4 .

A. Models
In a simple case where the signal and background probabilities are well known, it is possible to derive the classifier analytically for the baseline and uncertainty-aware approaches. The results below use the analytical expressions, but as a cross check, neural networks were also trained for the same objective and produced nearly identical results.

Baseline and Uncertainty-Aware Classifiers
The baseline classifier computes the score using the the probability density functions for the Gaussian distributions used to generate the two features for signal and background at an assumed fixed value of z = π 4 . The uncertainty-aware classifier, on the other hand, does not make assumptions about the value of the nuisance parameter and instead calculates a score as a function of the nuisance parameter .
The score s, for the each of the two classifiers are shown in Fig. 3 as a function of the input features, for datasets generated with z = π 4 or z = π 2 . The uncertainty-aware classifier is parameterized as a function of z, and given the correct value of the nuisance parameter, it can provide the appropriate classifier. Examples of histogram templates of the classifier outputs are shown in Fig. 4. The separation power of the baseline classifier is clearly reduced for cases where the data are generated with values of the nuisance parameter which do not match its assumed value of z = π 4 . Using the Area Under the Receiver Operating Characteristic Curve as a metric to quantify separation power of a model, the separation power for the baseline classifier falls from 0.978 for data generated at z = π 4 to 0.924 for data generated at z = π 2 , while it remains 0.978 on both datasets for the uncertainty-aware classifier.

Data Augmentation
A Linear Discriminant Analysis (LDA) classifier from Scikit-Learn [53] is trained on a training dataset which includes samples with all 21 values 1 of z. As a cross 1 The data augmentation classifier was also trained on a dataset with a continuous distribution of z sampled from the Gaussian prior of z and found to provide near identical results. Also shown are score functions for the uncertainty-aware classifier on the same datasets, evaluated at the correct value of z for each dataset. check, a neural network was trained on the same data and produced a nearly identical score function.

Adversarial Training
The adversarial architecture was trained using samples from all 21 values of z. The classifier and the adversarial network each consist of 10 hidden layers with 64 nodes and a rectified linear unit (ReLU) activation and a single node output layer with sigmoid and linear activations respectively. An L2 kernel regularizer [54] was applied to all but the first and final layer of each network. The two networks were attached with a gradient reversal layer which scales the gradient by −0.2 and trained with the RMSProp [55] optimizer and a batch size of 4096. BCE is used as the classification loss while Mean Squared Error (MSE) is used for the loss of the adversary. An adversarial loss weight of λ = 1 was used. For this dataset, a classifier exactly invariant to z would have zero separation power between signal and background. Therefore, a compromise between invariance and classification power was made in model selection, finding the largest value of λ which did not deteriorate performance. Minimal hyperparameter tuning was performed beyond tuning λ.

B. Results
The negative log-likelihood (Eq. 4) is calculated as a function of the parameter of interest µ and the nuisance parameter z. Examples are shown in Fig. 5 using templates from the baseline and uncertainty-aware classifiers. Due to its assumption that z = π 4 in the calculation of the classifier score, the likelihood from the baseline classifier can strongly exclude z = π 4 when evaluated on a dataset generated with z = π 2 , but finds z = 0, π 2 equally likely. The uncertainty-aware classifier, on the other hand, is also able to exclude the low z region.
The measurement of the nuisance parameter is not the final objective and it can be profiled away. The most relevant metric for determining the relative power of the various approaches is the profile likelihood, max z L(µ, z).
The profile likelihood for each method is shown in Fig. 6 for data generated with z = π 4 and z = π 2 . In the case of z = π 4 , which matches the assumption of the baseline classifier, the uncertainty-aware and baseline classifiers both achieve ideal performance. The adversarial and data-augmentation approaches are somewhat weaker due to the inherent compromises of their methods.
When evaluated on data generated with z = π 2 , in conflict with the assumption of the baseline classifier, the performance of all approaches other than the uncertainty-aware classifier deteriorate significantly. The data-augmented classifier has been trained on 21 values of z in the first quadrant centred around the nominal value which makes it perform worse at extreme values of , and loses separation power for data generated with z = {0, π 2 }, manifested by the lower heights of the signal and background histograms near 1 and 0, respectively. The uncertainty-aware classifier score is evaluated for the correct value of z, providing the optimal score in each case.
z. No setting of the adversarially-trained classifier was found to perform well for datasets with both values of z.

V. REALISTIC EXAMPLE
A more realistic application of the uncertainty-aware classifier in the presence of nuisance parameters can be performed using the datasets [56] produced for the Hig-gsML Kaggle challenge [57] by the ATLAS Collaboration. This dataset was originally simulated by the ATLAS collaboration to measure the decay of the Higgs boson to a pair of τ leptons [58]. This dataset was chosen for our study because it has been used as a benchmark for uncertainty aware learning in the past [52,59].
The signal process is the production of Higgs bosons through gluon-gluon fusion (ggF), vector boson fusion (VBF), and associated production with a vector boson (VH), which decay to pairs of τ leptons. The ggF and VBF production processes were simulated with Powheg [60][61][62][63] interfaced to Pythia8 [27] while the VH production is simulated with Pythia8. Further details on corrections applied can be found in Sec 3. of Ref. [58]. The detector response is simulated with GEANT4 [64] and object reconstruction performed with The negative log-likelihood (Eq. 4) as a function of the parameter of interest µ and the nuisance parameter z for two example datasets, using templates from the baseline (top) and uncertainty-aware classifier (bottom). In the left column, the data are generated with z = π 4 , which matches the assumption made by the baseline classifier. In the right column, the data are generated with = π 2 . The red dot indicates the maximum likelihood estimate which coincides with the true value of µ, z in each case. Note the different z-axis scales for the two classifiers in the bottom row.
the official ATLAS software [65]. The three largest backgrounds from Z/γ * → τ τ , tt and W + jets are simulated with the same chain and mixed in proportions determined by their relative cross sections. Different aspects of the Z/γ * → τ τ background are simulated with Alpgen, Pythia8, Herwig, and Sherpa [66]; the details can be found in Table 1 of Ref. [58]. The tt background is simulated with Powheg and Pythia8 and the W + jets background is simulated with Alpgen FIG. 6: The profile likelihood max z L(µ, z) as a function of the parameter of interest, µ for likelihoods calculated with templates built from the various classifiers. Narrower curves indicate more precise measurements having accounted for systematic and statistical uncertainties. The baseline classifier assumes z = π 4 , and matches the performance of the uncertainty-aware classifier in data generated with z = π 4 (top). In data generated with z = π 2 , the power of all classifiers other than the uncertainty-aware classifier become significantly weaker.
Each event is characterized by 29 features 2 , including the lepton momenta and angles, the magnitude and direction of missing transverse momentum, the energy and angles of leading and sub-leading jets, and several other primary and derived variables. See Ref. [56] for details. The most important nuisance parameter is the unknown absolute energy scale of the hadronically decaying τ leptons. We follow prior studies [52,59] and model this using a skewing function [69] which is applied to the τ lepton E T , for signal and background alike. The minimum E T threshold of 22 GeV is applied after skewing.
At the nominal value of the nuisance parameter, z = 1, the τ lepton energies are left unchanged. The impact of z = 0.9 or 1.1, on several features is shown in Fig. 7. The (unweighted) total number of events that pass the E T threshold for the z = 0.9, z = 1 and z = 1.1 datasets are 618906, 719349 and 818201 respectively. The data are split into training and test set in the ratio 2:1. Since the data at various values of z are generated from the nominal sample, the samples are to a large extent correlated. The train-test split therefore is determined before the skewing function and E T threshold are applied, ensuring complete independence between training and test sets.
Thirty bins are used to construct the template and observed histograms.

A. Description of Trained Models
All methods were implemented using neural networks. The baseline classifier was trained only on data at z = 1, while the data augmentation classifier, uncertainty-aware classifier and the adversarial classifier are all trained at 24 values spaced between z = 0.7 and z = 1.4. Two additional classifiers were also trained on data at z = 0.8 and z = 1.1 to estimate the best possible performance for an unparameterized classifier at these values of the nuisance parameter.
Technical details about the training procedure and architectures of the models are given below.

Baseline Classifier
The neural network comprises 10 hidden layers with 512 nodes each, ReLU activations and L2 kernel regularizers for all but the first hidden layer and a final layer with a single node and sigmoid activation. It was trained with an RMSProp optimizer, BCE loss and a batch size of 4096.

Data Augmentation
The network comprises 10 hidden layers, each with 64 nodes, a ReLU activation, and L2 kernel regularizers for all but the first hidden layer and a final layer with sigmoid activation. The network was trained with an Adam optimizer [70], BCE loss and a batch size of 4096.

Adversarial Training
Both the classifier and the adversary consist of 10 hidden layers with 64 nodes and ReLU activations and L2 kernel regularizers for all but the first hidden layer and a final layer with a single node with a sigmoid and linear activation for the classifier and adversary respectively. BCE is used as the classification loss and MSE as the adversarial loss. The two networks are attached with a gradient reversal layer which scales the gradient by −0.2 and trained with an RMSProp optimizer with λ = 1 and a batch size of 4096.

Uncertainty-Aware Classifier
The uncertainty-aware classifier is comprised of two sub-networks combined with a custom 'if-else' layer which outputs the result of the first sub-network if the input z is less than 1 and the result of the second subnetwork otherwise. This approach allows training of one sub-network longer than another if the performance in the z ≥ 1 and z < 1 regions converge at different rates.
Each sub-network consists of 10 hidden layers with 64 nodes, ReLU activations and L2 kernel regularizers for all but the first hidden layer and a final layer with a single node and sigmoid activation. They were trained using RMSProp optimizer, BCE loss and a batch size of 4096.

Classifiers Trained On Data From True z
Two neural networks were trained on data generated at z = 0.8 and z = 1.1 respectively. The network trained on data at z = 0.8 comprises 10 hidden layers with 64 nodes each, ReLU activations and L2 kernel regularizers for all but the first hidden layer and a final layer with a single node with a sigmoid activation. The network trained on data at z = 1.1 has an identical structure except that the hidden layers are 512 nodes wide. Both networks were trained using an RMSProp optimizer, BCE loss and a batch size of 4096.

B. Results
We evaluate the power of each method by examining the width of the profile likelihood curves in the parameter of interest, µ, see Fig. 9. The true value of µ was set to 1 (for comparisons when the true value of µ is set to 2 instead, refer to Appendix C). When the data are generated with z = 1, matching the assumption of the baseline classifier, we see that both the baseline and uncertaintyaware classifiers achieve the ideal performance, while the adversarial and data-augmented classifiers are slightly weaker. However, when the data are generated with a shift in the nuisance parameter (z = 0.8, 1.1) relative to the value used to build the baseline classifier, the uncertainty aware classifier maintains its ideal performance while the baseline classifier becomes less powerful.

VI. CONCLUSIONS
In this paper 3 , we have advocated for uncertaintyaware classifiers where the dependence on nuisance parameter is maximized during training by exploiting parameterized classifiers [29,30]. Using a Gaussian example and a realistic H → τ τ example, we have shown that the uncertainty-aware approach outperforms alternative methods that either are unaware of uncertainties or try to reduce the dependence on them during training. Our approach is successful because it provides the most effective classifier for all values of the nuisance parameter. This is useful when uncertainties are evaluated and when the nuisance parameter is profiled. It should be straightforward to apply this approach to multiple nuisance parameters although it was demonstrated on a single nuisance parameter in this paper.
While we advocate for maximally depending on the nuisance parameters, there could be cases where reducing the dependence could be beneficial. For example, eliminating the dependence on a particular nuisance parameter reduces the analysis complexity. If the classifier is used to make a simple selection (cut-and-count), then reducing the dependence could also improve analysis sensitivity when the uncertainty on the nuisance parameters is large [35]. For example, consider the case of an analysis that is considering adding a cut on a feature that is independent from all other features and that the cut has a signal efficiency S , a background efficiency B , and a relative uncertainty on B that is δ. Clearly, one needs S / √ B 1 for this cut to be useful. If δ is small, then the cut will certainly be useful. If δ is large, it may be beneficial to not make the cut on the feature, which is the analog of rendering the analysis insensitive to δ. Caution must be taken in such cases because reducing the sensitivity to a nuisance parameter may hide the size of the true uncertainty. This is the case for many two-point uncertainties such as hadronization modeling. Ultimately, the decision to use the additional feature or not depends on how the test statistic will be used in the analysis.
A related topic to uncertainty-awareness is inferenceawareness, where classifiers are trained using the final analysis objective as the loss. The ultimate sensitivity will be achieved when both of these approaches are combined, which will be the topic of future investigation.
Uncertainty awareness is a relatively straightforward extension of existing analyses performed at the LHC. Many nuisance parameters can be varied without significant computational overhead and with the minimal  nominal data (z T = 1.0) and c systematic up data (z T = 1.1) where the true value of µ is 1. Narrower curves indicate more precise measurements having accounted for systematic and statistical uncertainties.
changes we propose, analyses may be able to improve their sensitivity. The biggest improvements are expected for analyses that are limited by experimental systematic uncertainties. A growing number of analyses will fall into this category with the high-statistics of the High-Luminosity LHC and other HEP experiments across frontiers. The negative log-likelihood (Eq. 4) as a function of the parameter of interest µ and the nuisance parameter z for two example datasets in the realistic example, using templates from the baseline (first row), systematic aware (second row), data augmentation (third row) and adversarial classifier (fourth row). On the left column, the data are generated with z = 1, while on the right column, the data are generated with z = 1.1. The red dot indicates the maximum likelihood estimate which coincides with the true value of µ, z in each case. Note that the z-axis scale is not uniform in all figures. FIG. 13: The profile likelihood max z L(µ, z) as a function of the parameter of interest, µ for likelihoods calculated with templates built from the various classifiers in the auxiliary measurement study. The baseline classifier assumes z = π 4 , and matches the performance of the uncertainty-aware classifier in data generated with z = π 4 (top). In data generated with z = π 2 , the power of all classifiers other than the uncertainty-aware classifier become significantly weaker despite a better constraint on z compared to Sec. IV.  nominal data (z T = 1.0) and c systematic up data (z T = 1.1) where the true value of µ is 2. Narrower curves indicate more precise measurements having accounted for systematic and statistical uncertainties.