Learning Uncertainties the Frequentist Way: Calibration and Correlation in High Energy Physics

Calibration is a common experimental physics problem, whose goal is to infer the value and uncertainty of an unobservable quantity Z given a measured quantity X. Additionally, one would like to quantify the extent to which X and Z are correlated. In this paper, we present a machine learning framework for performing frequentist maximum likelihood inference with Gaussian uncertainty estimation, which also quantifies the mutual information between the unobservable and measured quantities. This framework uses the Donsker-Varadhan representation of the Kullback-Leibler divergence -- parametrized with a novel Gaussian Ansatz -- to enable a simultaneous extraction of the maximum likelihood values, uncertainties, and mutual information in a single training. We demonstrate our framework by extracting jet energy corrections and resolution factors from a simulation of the CMS detector at the Large Hadron Collider. By leveraging the high-dimensional feature space inside jets, we improve upon the nominal CMS jet resolution by upwards of 15%.

Abstractly, the calibration task can be described as quantifying the relationship between two random variables X ∈ R M and Z ∈ R N .Here, X is the measured quantity and Z is the unobservable ("latent") quantity. 1A reconstruction technique produces a function ẑ : R M → R N , which is determined by minimizing a loss functional over sample data (real or synthetic).While ML methods are effective even when M and N are large, most existing methods have the undesirable property of being prior dependent [30].This means that ẑ depends on the probability density p(z) used during training.As a result, the calibration is not universal and caution must be taken when applying it to different event samples.
Furthermore, some calibration methods simply produce a point estimate, with no estimation of the cor- responding uncertainty.In the HEP context, this uncertainty is usually called the resolution.Quantifying the reconstruction resolution is relevant for a variety of purposes, including the computation of significance variables [31,32] and background estimation [33,34].Various ML approaches for resolution determination have been recently studied for HEP [35][36][37][38][39][40][41], but they typically require additional training or model complexity.See Ref. [42] for a complementary approach to frequentist inference.
In this paper, we introduce a simple ML framework for calibration that simultaneously estimates the following quantities: 1.A prior-independent maximum-likelihood calibration, ẑ(x) = argmax z p(x|z); 2. A Gaussian resolution around ẑ(x), σz (x); 3. The log-likelihood ratio, log p(x|z) p(x) ; and 4. The mutual information between X and Z, I(X; Z).
To extract ẑ(x) and σz (x) in a single training, we use a novel Gaussian Ansatz to parametrize the log-likelihood ratio with an interpretable network architecture.Mutual information is a powerful statistic for quantifying the (non-linear) correlation between two random variables, and it appears due to our choice of loss function.
After describing the Gaussian Ansatz construction, we illustrate the above features in a case study involving jet reconstruction at the Large Hadron Collider (LHC).
Our calibration method builds upon the Mutual Information Neural Estimator (MINE) introduced in Ref. [43].With MINE, the Donsker-Varadhan representation [44] of the Kullback-Leibler divergence [45] is used to estimate I(X; Z) by training a network to minimize a particular loss functional.We first show that a network minimizing this loss functional yields the likelihood p(x|z), which in principle contains all the information necessary for frequentist inference.Performing this inference in practice, though, involves difficult optimization tasks, which are even more difficult if one wants to extract the resolution.With the Gaussian Ansatz, we parametrize the MINE network such that the inferred value, and especially the resolution, are easy to extract after ML training.
The starting point for our calibration method is the concept of mutual information (MI), defined as: where p denotes the probability density of the respective random variable.This equation has the property that I(X; Z) = 0 if and only if X and Z are independent, which is equivalent to p(x, z) = p(x) p(z).Therefore, the MI quantifies the interdependence between X and Z.
The MI is a special case of the Kullback-Leibler (KL) divergence, D KL (P ||Q), when P = P XZ is the joint probability distribution of X and Z (i.e.p(x, z)), and Q = P X ⊗ P Z is the product of the marginals (i.e.p(x) p(z)).It is well known that the KL divergence can be cast in the Donsker-Varadhan (DV) representation [46]: where E • represents the expectation value over probability density •, and the supremum is over the space of functions T such that both expectations are finite.Following the MINE construction in Ref. [43], we use the DV representation to build an estimator for the mutual information from a finite dataset.For functions T : R M × R N → R, we can place a lower bound on I(X; Z) by minimizing a loss functional L DVR over T ∈ T : where the DV representation (DVR) loss is: Given a finite dataset of (x, z) pairs, the expectations in Eq. ( 4) can be estimated from sample averages.To estimate the second term, one can simply shuffle the x's and z's, as done in Ref. [43].The space of functions T can be parametrized by neural networks, in which case the DVR loss functional can be minimized using standard gradient descent.As long as T is sufficiently expressive, the bound in Eq. ( 3) will be saturated, so the minimum loss is an estimate of −I(X; Z). 2 2 Other loss functionals exist that are capable of providing lower Taking the functional derivative of the DVR loss functional with respect to T , we see that the supremum of L[T ] is obtained when: where c is any constant that we can set to zero without loss of generality. 3Therefore, if the MINE is well trained, we can use T as an estimate of the log-likelihood density ratio.As with most machine learning applications, this requires that the space of neural networks T is sufficiently expressive, that there is enough training data, and that the gradient descent algorithm successfully finds the minimum of Eq. ( 4).Given this, we can then perform maximum likelihood inference given x: Crucially, this inference strategy for z is independent of the prior p(z), which is a property desirable for calibration tasks.Unlike for standard regression [30], the learned estimate ẑ does not depend on the distribution of z samples in the training set. 4f X does not contain complete information about Z, then there will be uncertainty in our inference of z.Assuming the likelihood p(x|z) is well approximated by a Gaussian density, the uncertainty in the inference is given by the covariance matrix: which is again prior independent.So far, we have shown that the MINE network can be used to perform frequentist inference.While T itself depends on the prior p(z), the inference ẑ and resolution σz do not.However, both the maximum likelihood estimate in Eq. ( 7) and the local resolution in Eq. ( 8) are difficult to evaluate numerically.In the case of maximization, the learned T may be highly non-convex and the true maxima difficult to find using gradient descent.In the case of bounds on MI.For example, if we write the f -divergence representation of the KL divergence [47,48], the corresponding loss functional is a variation of the maximum likelihood classifier (MLC) loss [49][50][51]: Numerical and analytic studies Ref. [43,52], as well as our own empirical studies, show that the DVR loss has better numerical convergence properties than the MLC loss.
the second derivative, its evaluation is numerically sensitive to the choice of activation function in the MINE network.For example, if one uses the common Rectified Linear Unit (ReLU) activation function or its variants, then all analytic second derivatives of the network are zero.
In order to facilitate a numerical estimate of the maximum likelihood and local resolution, we introduce the following Gaussian Ansatz parametrization for T : where R), and D : R N − → R M are each neural networks.We call this the Gaussian Ansatz, since it resembles the logarithm of a Gaussian likelihood density.Unlike a Gaussian likelihood, though, the Gaussian Ansatz is highly expressive, and is in fact a universal function approximator.Specifically, any function f (x, z) that admits a Taylor expansion in z around B(x) can be expanded in this form.The functions A(x) and D(x) capture the zeroth and first order dependencies of f on z, respectively.The function C(x, z) captures any quadratic or higher dependence of the Taylor expansion of f on z.
The Gaussian Ansatz enables an elegant strategy to extract Eqs. ( 7) and (8).Since the optimal T (x, z) is bounded from above, we can take D(x) to be everywhere zero without loss of expressivity.In this case, T will achieve critical values at z = B(x).Moreover, if C(x, B(x)) < 0, then these critical values will be (local) likelihood maxima: While the Gaussian Ansatz does not necessarily protect against local maxima, it does yield a numerical estimate of the local resolution: Moreover, the (negative) loss of the Gaussian Ansatz with respect to the functional in Eq. ( 4) will be a lower bound for the mutual information I(X; Z), which is saturated in the asymptotic limit of an infinitely large network with infinite data.The Gaussian Ansatz is therefore capable of estimating-from a single dataset of (x, z) pairs and a single training-the maximum likelihood inferred value of z given x, the local resolution on that inference, and the mutual information between X and Z.This can be achieved without having to perform any additional optimization problems, derivative estimations, or postprocessing beyond the single matrix inversion in Eq. (11).In practice, we find it convenient to start the training with non-zero D(x) to aid the convergence of the model, and then numerically force D → 0 through an increasing L 1 regularization.This helps the model achieve a global, rather than local, minimum.
We now demonstrate the Gaussian Ansatz on an experimental collider physics task: determining jet energy corrections (JECs) and resolutions (JERs) [53].(At the LHC, one typically calibrates transverse momenta p T instead of energies, but the terms JECs and JERs are still used.)Jets are collimated sprays of particles that are produced ubiquitously in high-energy collisions.One does not have access to the "true" jet energy, however, because its constituent particles are filtered through a complicated and nonlinear detector response.
Assuming one has a good detector model, though, one can generate truth-level quantities (GEN, corresponding to Z) and then simulate the detector response (SIM, corresponding to X). Performing a simulation-based calibration, one can infer the "true" jet energy from a set of measured particle momenta in a jet.The multiplicative JEC factor is then defined such that the inferred jet momenta is: JEC factors are often further refined through a databased calibration using well-understood control samples, though this is separate from the procedure considered here.The JER factor arises because the inferred and generated p T values in Eq. ( 12) are not identical.The JER is typically expressed as a fractional quantity: In the language of statistics, the JER is a type of "uncertainty", since it represents the limited information about Z contained in X.In the HEP context, though, this quantity is instead called a "resolution"; see Ref. [30] for further discussion.
The JEC factor is a function of the measured quantities, primarily the detector-level jet p T and pseudorapidity η.The JEC can be obtained from fits to simulation [54-58] using a technique called numerical inversion [59].The JER can be also determined in simulation by fitting the peak region of the detector response pT /p T,SIM to a Gaussian distribution.Here, we consider an alternate (and arguably simpler) approach to JEC and JER extraction.
For our case study, we use the Gaussian Ansatz to improve upon the JEC factors provided by the CMS experiment in their 2011 public data release [60].We use the same 2011 CMS Open Simulation [61] samples as in Ref. [62], which are based on dijets generated in Pythia 6 [63] with a Geant4-based [64] simulation of the CMS detector.This dataset was translated from the original CMS AOD (analysis object data) ROOT-based format into an easier-to-use MIT Open Data (MOD) HDF5 format [65].Each SIM event consists of a list of particle flow candidates (PFCs), which are the reconstructed fourmomentum and particle identification (PID) for each measured particle.The PFCs are clustered into jets, using the anti-k t jet algorithm with R = 0.5 [66][67][68].For each jet, truth-level GEN jet information is also provided, as well as the CMS-prescribed JEC.CMSprescribed JERs are estimated using Ref. [53].
We select jets whose GEN transverse momentum is in the range p T ∈ [500, 1000] GeV.The lower bound of 500 GeV is to avoid any turn-on effects due to the p T,SIM > 375 GeV cut applied to the dataset as a whole.We require that the GEN pseudorapidity satisfies |η| < 2.4, and that jets are at least "medium" jet quality [69].The latent variable of interest is Z = p T,GEN , and the measured quantity X = X SIM depends on the choice of ML architecture.All momenta are divided by a fixed scale of 1000 GeV, so that the data values are roughly O(1).In total, 5 × 10 6 jets are used for training, across the whole p T ∈ [500, 1000] GeV range.
We consider four different ML models, of increasing sophistication: 1. DNN : The input X consists only of the overall jet kinematic properties, with X = (p T , η, ϕ) SIM , which is the same information used in the CMS calibration procedure in Ref. [53].Each of the functions A, B, C, and D are constructed as fully connected neural networks, with three hidden layers of size 64 and ReLU activations.
2. EFN : The input X consists of the entire set of PFC three-momenta from the jet.Each of the functions A, B, C, and D are constructed as Energy Flow Networks (EFNs) [70].EFNs are permutationinvariant functions of point clouds, inspired by the Deep Sets formalism [71].They take the form f ({⃗ p i }) = F ( i p T i Φ(η i , ϕ i )), which exhibits manifest infrared and colinear (IRC) safety.For each EFN, the Φ and F functions consist of three hidden layers of respective sizes (50,50,64) with ReLU activations.Since C is a function of both X and Z, the Z is appended as an input to the F function.
3. PFN : The same input features as the EFN, but inserted into a Particle Flow Network (PFN) [70,71], which does not impose IRC safety.PFNs take the form f ({⃗

PFN-PID:
The same as the PFN model, but in addition to the 3-momenta of each PFC, the reconstructed PID is included as an input feature.We follow the PID labeling scheme of Ref. [70] for photon, charged hadron, etc.
Each of these models is trained for 200 epochs using the Adam optimizer [72], with a learning rate of α = 10 The D network is given an overall L 1 regularization loss of λ D = 10 −3 to slowly force it to zero by the end of the training.Every 50 epochs, α is reduced by a factor of 5 and λ D is increased by a factor of 10.
To aid the numerical convergence, each model is pretrained with a mean squared error loss, In Table I, we show the results of the training in a narrow bin of p T,GEN ∈ [695, 705] GeV.If our models yield unbiased estimators of the GEN p T , then the inferred pT distribution should be centered near 700 GeV, which it is for all models.Adding more information to the model should not decrease the mutual information, and if useful, that information should improve the resolution.We see indeed that the resolution improves with increasing model sophistication, as does the mutual information I(X; Z).The resolution from the DNN, which uses the same information as the CMS procedure, is marginally better than the nominal CMS 2011 jet resolution from Ref. [53].The PFN-PID model exhibits the best resolution, which is roughly 15% better on average than the CMS baseline.
In Fig. 1, we show the distribution of σp T in the same p T,GEN ∈ [695, 705] GeV bin.As the model sophistication increases, the resolution increases (i.e. the σp T shift downward).The non-Gaussian behavior of the ML models is expected, since these models are exploiting additional information beyond the p T .In principle, the resolution should never degrade by adding more information, but we do find a long right tail for the PFN-PID model due to incomplete ML convergence. 5We conclude that  I. On average, the PFN-PID exhibits 15% better resolution (i.e.smaller values) than the CMS default.
the measured PFC momenta, along with the PIDs, contain useful information for jet energy calibration that is lost when only considering the total jet momentum.
In this paper, we presented an extension of the MINE framework, the Gaussian Ansatz, capable of simultaneously performing frequentist inference, extracting Gaussian uncertainties, and quantifying mutual information between random variables.All of these tasks are performed in a single training, with no additional postprocessing.Using this ML framework, we were able to take advantage of the full jet particle information in the CMS Open Simulation to improve the measured jet resolution by approximately 15%.Studies by the ATLAS collaboration have used sequential calibration on a handful of observables to improve their resolution [56][57][58], and the Gaussian Ansatz may allow for further improvements by allowing for simultaneous calibrations of any number of input features.We look forward to further developments in ML-based calibration and correlations methods in HEP and beyond.

FIG. 1 .
FIG.1.Learned JER distribution for the four models, compared to the CMS 2011 baseline.The dataset is the same as in TableI.On average, the PFN-PID exhibits 15% better resolution (i.e.smaller values) than the CMS default.

TABLE I .
[53]d a batch size of 2048.All model parameters are given an L 2 regularization loss with weight λ 2 = 10 −6 .Gaussian Ansatz results for the four ML models, compared to the CMS 2011 baseline[53]. O a test dataset of GEN jets with pT ∈ [695, 705] GeV, we show the inferred pT , its resolution σp T , and the learned mutual information between X = XSIM and Z = pT,GEN.The ± values correspond to the standard deviation of the pT and σp T distributions across the test set, and bold face indicates the best resolution and highest mutual information.