Bias and Priors in Machine Learning Calibrations for High Energy Physics

Machine learning offers an exciting opportunity to improve the calibration of nearly all reconstructed objects in high-energy physics detectors. However, machine learning approaches often depend on the spectra of examples used during training, an issue known as prior dependence. This is an undesirable property of a calibration, which needs to be applicable in a variety of environments. The purpose of this paper is to explicitly highlight the prior dependence of some machine learning-based calibration strategies. We demonstrate how some recent proposals for both simulation-based and data-based calibrations inherit properties of the sample used for training, which can result in biases for downstream analyses. In the case of simulation-based calibration, we argue that our recently proposed Gaussian Ansatz approach can avoid some of the pitfalls of prior dependence, whereas prior-independent data-based calibration remains an open problem.

Calibration is the task of removing bias from an inference -that is, to ensure the inference is "correct on average". There are two major classes of calibration: simulation-based calibration, where the goal is to infer a truth reference object, and data-based calibration, where the goal is to match simulation and data distributions.
Both simulation-based calibrations and data-based calibrations are essential components of the experimental program in high-energy physics (HEP), and a significant amount of time is spent deriving these results to enable downstream analyses. We focus on the ATLAS and CMS experiments at the Large Hadron Collider (LHC) for our examples, but this discussion is relevant for all of HEP (and really any experiment). ATLAS and CMS have performed many recent calibrations, including the energy calibration of single hadrons [1, 2], jets [3,4], muons [5, 6], electrons/photons [7][8][9], and τ leptons [10,11]. The reconstruction efficiencies of all of these objects are also calibrated and include the classification efficiency of jets from heavy flavor [12,13] and even more massive particles [14,15].
Caution is needed to ensure that calibrations resulting from a machine learning approach satisfy certain important properties. One critical property of a calibration is that it should be universal -a calibration derived in one arXiv:2205.05084v2 [hep-ph] 31 Aug 2022 place should be applicable elsewhere. A non-universal calibration would have a rather limited utility, and can produce undesirable results if applied to a dataset that does not exactly match the calibration dataset. Statistically, universality is synonymous with prior independence. Most of the existing machine-learning-based calibration proposals, though, are inherently prior dependent, as we will explain below.
A second critical property of a calibration is closure, which means that on average, the calibration produces the correct answer. 1 To quantify closure, one often computes the bias of a calibration, which is the average deviation of the calibrated result from the target value. A calibration can be biased due to the choice of estimator or fitting procedure used, even if the usual pitfalls of dataset-induced biases are taken care of. As explained below, universality and closure are related, and a prior-dependent calibration will necessarily have irreducible bias. 2 In this paper, we explain the origin of prior dependence for common calibration techniques, with explicit illustrative examples, and demonstrate the associated bias that these procedures incur. For simulation-based calibrations, we advocate for our Gaussian Ansatz [43] as a machinelearning-based strategy that is prior independent and bias-free. For data-based calibrations, we are unaware of any prior-independent methods in the literature. We hope that by highlighting these issues, we can inspire the development of prior-independent calibration methods.
The remainder of this paper is organized as follows. In Sec. II, we review the statistical properties of machinelearning-based calibration. In Sec. III, we clarify the meaning of resolution and uncertainty in the HEP context. To demonstrate the issue of prior dependence, we present Gaussian examples in Sec. IV. In Sec. V, we study an HEP application of calibration in the context of jet energy measurements at the LHC. The paper ends in Sec. VI with our conclusions and outlook.

II. THE STATISTICS OF CALIBRATION
In this section, we review some of the basic features of simulated-based and data-based calibration, and discuss the issues of prior dependence and bias.

A. Simulation-based Calibration
In simulation-based calibration, the goal is to infer target (or true) features z T ∈ R N from detector-level features x D ∈ R M -that is, to construct an estimator or calibration function f : is the inferred estimate. To carry out simulation-based calibration, one starts with a set of (x D , z T ) pairs, which typically come from an in-depth numerical simulation of an experiment. For the case study in Sec. V, x D will be the experimentally measurable features of hadronic jets and z T will be the true jet energy. For concreteness, one can think of the calibration function f as being parametrized by a universal function approximator such as a neural network, whose weights and biases are learned. This is often done by minimizing the mean squared error (MSE) loss: where capital letters correspond to random variables and E represents the expectation value over the training sample used to derive the calibration. The calibration function is then deployed on the testing sample, which could be the dataset of interest or a hold-out control region.
Using the calculus of variations, one can show that with enough training data, a flexible enough functional parametrization, and a sufficiently exhaustive training procedure, the asymptotic solution to Eq. (2) is: where lowercase letters correspond to an instance of a random variable. In this way, f learns the mean value of z T for a given x D in the training set. Alternative loss functions result in statistics other than the mean. See e.g. Ref.
[44] for alternative approaches, including mode learning, which is a standard target for many traditional calibrations (usually in the form of truncated Gaussian fits; see e.g. [9]).

B. Prior Dependence and Bias
A key assumption of simulation-based calibration is that the detector response is universal: This equation says that for a given truth input z T , the detector response is the same between the training data used for deriving the calibration and the testing data used for deploying the calibration. In some cases, the detector response might depend on more features than z T , and if these hidden features are mismodeled, then Eq. (4) may not hold. For our analysis of simulation-based calibration, we assume Eq. (4) throughout.
Calibrations of the form of Eq. (3) are not universal, even if the detector response is. Writing out the MSE-based calibration in integral form, we have: Here, we have used Bayes' theorem to make explicit the dependence of f on p train (z T ), the prior of true values used for the training. Thus, even if p train (x D |z T ) is universal via Eq. (4), the truth distribution is not: The non-universality of the calibration function leads to bias, as we now explain. The bias b(z T ) of a calibration quantifies the degree of non-closure. Specifically, bias is the average difference between the reconstructed value and the truth reference value. It is evaluated over the test sample, conditioned on the truth values: A bias of zero means that, on average, the reconstructed and truth values agree. For MSE regression, the bias is: This bias is dependent on the training prior through p train (z T |x D ). Thus, a prior-dependent calibration is necessarily biased, since it depends on the choice of p train (z T ). 3 Note that even if the training dataset is statistically identical to the testing dataset (i.e. p test (x D , z T ) = p train (x D , z T )), it is not guaranteed that the calibration will be unbiased. One way to reduce the bias is if the prior is "wide and flat enough", such that the prior asymptotically approaches a uniform sampling over the real line relative to the detector response. For example, one can show using Eq. (8) that if the prior p(z T ) is Gaussian with standard deviation σ, the detector response p(x D |z T ) is a Gaussian noise model with standard deviation , and the test set is statistically identical to the training set, then the bias scales as: In cases with steeply falling spectra, as is common in HEP, prior dependence usually leads to large biases in calibration, even if the testing and training sets follow the same distribution.
3 Note that the bias does not depend on the choice of testing prior, ptest(z T ), but rather only on ptest(x D |z T ). Depending on the choice of ptest(x D |z T ), it is possible for the bias to be zero, but this does not imply the inference is prior independent. For example, if ptest(x D |z T ) = δ(x D − z T ), and E train [x D |Z T = z T ] = z T , then one can show that b(z T ) = 0.

C. Mitigating Prior Dependence
A majority of simulation-based calibrations (with or without machine learning) are set up using the MSE loss as described above, which means that they are biased. That said, there are alternative methods to mitigate the prior dependence and thereby reduce the bias. For example, simulation-based jet calibrations at the LHC use a technique called numerical inversion (see e.g. Ref. [45]). The idea of numerical inversion is to regress x D from z T with a function g(z T ) and then define the calibration function through the inverse: Traditionally, x D is one dimensional and g is parametrized with functions that can easily be inverted numerically, hence the name. The function g is given by: Since the detector response p(x D |z T ) is universal, g is universal, and thus the derived f is also universal. Under certain assumptions, the f from numerical inversion is also unbiased [45]. Numerical inversion has been extended to work with neural networks [23,24], where the inversion step is accomplished with a second neural network. Alternatively, it may be possible to also achieve this with a natively invertible neural network such as a normalizing flow [46,47].
A key challenge with numerical inversion and its neural network generalizations are that they do not scale well to high dimensions.
In Ref. [43], we propose an alternative way to achieve a prior-independent calibration that scales well to highand variable-dimensional settings. This approach is based on finding the local maximum likelihood, such that the learned calibration function becomes: where MLC stands for maximum likelihood classifier -see Ref. [48]. Again, because the detector response p(x D |z T ) is universal, maximum likelihood calibrations are universal, 4 and in certain configurations, are provably unbiased. In particular, if the detector response p(x D |z T ) is a Gaussian noise model centered on z T , then one can show that the bias is zero using Eq. (7): 4 One important caveat is that universality here means prior independence over the space of priors that share the same support as the training set. One cannot get away with training a model on a single z T instance and expecting it to work everywhere! Here, we have made use of the fact that for a Gaussian, p(x D |z T ) is maximized at x D = z T , and that the average of this Gaussian is simply z T . This conclusion holds even if the detector response includes offsets, or if the noise depends on z T . 5 The strategy in Ref. [43] is to estimate the (local) likelihood density by extremizing the Donsker-Varadhan representation (DVR) [49,50] of the Kullback-Leibler divergence [51]: By parametrizing f (x D , z T ) via a specially chosen Gaussian Ansatz (see Ref. [43] for details), one can extract the local maximum likelihood estimate and resolution with a single neural network training. We focused on regression in the above discussion, but prior dependence also appears in classification calibration. A classifier trained with the MSE loss function or the binary cross entropy (BCE) will learn the probability of the signal given an observed x D . If the fraction of signal is different in the training set and the test set, that is, p test (z T ) = p train (z T ), then the output can no longer be interpreted as the probability of the signal. Luckily, classifiers are almost never used this way in HEP, since the classification score is not interpreted directly as a probability. 6 In this case, simulation-based calibrations may not be required, 7 though data-based calibrations are still essential, as described next.

D. Data-based Calibration
In data-based calibration, the goal is to account for possible differences between a true detector response, p data (x D ) and a simulated detector model p sim (x D ). That is, the goal is to match detector level features x D between data and a simulation at the distribution level, in contrast to simulation-based distribution, where the goal is to match x D and a target feature z T at the object level. Usually, p data (x D ) is a control dataset, and 5 It is not always true that a maximum likelihood calibration is unbiased. For instance, if X D is drawn from a uniform distribution U (0, z T ), then the maximum likelihood estimate from a single x D sample isẑ T = x D , whereas an unbiased estimate would bê z T = 2x D . 6 See Ref. [52] for a review in the machine learning literature and Ref. [53] for related studies in the context of HEP likelihood ratios. 7 There may be practical issues associated with prior dependence, e.g., if there is an extreme class imbalance, the classifier may not learn well. In the extreme limit of only one class present in the training, then there is a prior dependence also on the result.
The authors of Ref. [42] propose to use tools from the field of optimal transport (OT) to perform the data-based calibration using machine learning. The central idea is to learn a map h : R N → R N that "moves" x D as little as possible, but still achieves p sim (x D ) → p data (x D ). In this case, the OT-based calibration is: where |h (x D )| is the Jacobian factor. The precise transportation map depends on the choice of OT metric. Eq. (15) can be interpreted as shifting simulated samples , and additionally reweighting each sample by |h (x D )|. One can also write a corresponding expression for the OT-calibrated detector model, conditioned on z T : Eq. (16) can be thought of as a "corrected simulated response" function that accounts for mismodeling in the original simulation, p sim (x D |z T ). At first glance, Eq. (16) might seem prior independent, since it is conditioned on the truth-level z T . As we will see, though, there is implicit prior dependence in h. For simplicity, consider the special case of one dimension. Here, for any OT metric, the OT map h : R → R is simply given by: where P λ is the cumulative distribution function of λ, i.e.
. This function maps quantiles of the simulated distribution to quantiles of the data distribution. The Jacobian of this transformation is: Thus, since the prior p train (z T ) explicitly appears, the derived OT-based detector model in Eq. (16) is prior dependent.
In line with simulation-based calibration, the bias of a data-based calibration is the average difference between the estimatorp(x D ) and the desired value p data (x D ), conditioned on x T . 8 For OT-based calibration, the bias for a If p test (z T ) = p train (z T ), then the bias is zero. Otherwise, the calibration is biased, a consequence of prior dependence. Note that this is in contrast to simulation-based calibration, where non-universality can imply a bias even if p test (z T ) = p train (z T ).

E. Unbiased Data-based Approaches?
As defined above, the goal of a data-based calibration is to match p sim (x D ) to p data (x D ). This is an inherently prior dependent task, however, sincep( -that is to say, the simulated detector output depends on the simulation input. Instead, one can ask if the corrected response function,p(x D |z T ), is universal. If it is, then one can use the same corrected response function to generatep(x D ) for a variety of priors p test (z T ). At least in the special case of one-dimensional OT-based calibration, however, we have shown above that the corrected response function is not universal.
To our knowledge, no one has proposed a data-based calibration method that is prior independent, whether using machine learning or not. This implies that all databased calibration methods in use are biased, though the degree of bias may be small if the testing and training truth-level densities are similar enough. We encourage the community to develop a prior-independent data-based calibration strategy, or prove that it is impossible.

III. RESOLUTION AND UNCERTAINTY IN CALIBRATIONS
The discussion thus far has focused on mitigating bias in calibration. Two related concepts are the resolution and uncertainty of a calibration. In this section, we review calibration resolution and uncertainty, and we clarify important nomenclature in HEP settings.

A. Resolution
As already mentioned, the bias of a calibration refers to the difference in central tendency (such as the mean, median, or mode) between a reconstructed quantity and a reference quantity. By contrast, the resolution of a calibration refers to the spread in the difference between the reconstructed and reference quantities. Using variance as our measure of spread, the resolution Σ 2 (z T ) can be written as the variance of differences between the reconstructed and truth values, conditioned on the truth values, evaluated over the test sample: Resolutions, like biases, can be prior dependent. When using the MSE-based calibration (Eq. (3)), this becomes: The prior dependence is seen by applying Bayes' Theorem to p train (z T |x D ). As before, this prior dependence can be reduced if the prior is wide compared to the detector response. If the prior p(z T ) is Gaussian with standard deviation σ, and the detector response p(x D |z T ) is a Gaussian noise model with standard deviation , then by applying Eq. (22), one can show that the resolution scales as: On the other hand, for the prior-independent MLC calibration (Eq. (12)), the resolution can be shown to be: In HEP (and many other) applications, however, it is common to instead refer to the resolution with respect to a measurement x D rather than the true value z T . That is, for an inferenceẑ T = f (x D ), we would like a measure of the spread of z T values consistent with this measurement, which we will denote Σ(x D ) (distinguished by the x D argument rather than z T ). Depending on the context and type of calibration, there are a variety of ways to define Σ(x D ) -for instance, as the standard deviation from a Gaussian fit to the distribution of reconstructed over true energies (see e.g. Ref. [45]). For our purposes, we can define the point resolution Σ 2 (x D ) as the variance of z T 's conditioned on x D : For the MSE-based calibration, this is simply the variance of the posterior, p(z T |x D ). However, for frequentist approaches where the posterior is not well defined, such as the maximum likelihood calibration, the resolution cannot be defined this way and care must be taken. For Gaussian noise models p(x D |z T ), the likelihood is symmetric under interchanging the arguments x D and z T , so one can take the resolution to be (applying Eq. (20)): Calibrations do not necessarily improve the resolution and can sometimes make the resolution seem worse. For example, if a calibration requires multiplying the reconstructed quantity by a fixed number greater than one, then the resolution will grow by the same amount. 9 It is therefore important to compare resolutions only after calibration.
If a calibration incorporates many features that determine the resolution of a given quantity, then the resolution can improve from calibration. For example, suppose the reconstructed value x D is some function of observable quantities y D = (y D1 , y D2 , ..., y Dn ), i.e. x D = g( y D ). For instance, in the context of jet energy calibrations, x D = α η for some constant α and an observable quantity η (e.g. energy dependence on the pseudorapidity). If any of the y D have a non-trivial probability density, this will be inherited by the reconstructed value x D and thus x D will have a non-zero resolution. This resolution is completely reducible, however, through a calibration that is y D dependent -that is, a calibration functionẑ T = f ( y D ) rather thanẑ T = f (x D ). The ability to incorporate many auxiliary features is why machine-learning-based approaches, such as the Gaussian Ansatz [43], have the potential to improve analyses at HEP experiments.

B. Uncertainty
In the machine learning literature, "resolution" would be referred to as a type of "uncertainty". Uncertainty in the statistical context refers to the limited information about z T contained in x D . In the HEP literature, though, we use uncertainty in a different way, to instead refer to the limited information we have about the bias and resolution of a calibration.
The reason for this difference in nomenclature is that HEP research is based primarily on simulation-based inference, where data are analyzed by comparison to model predictions. (This is the case for the vast majority of analyses at the LHC.) In this context, the word "uncertainty" is reserved to refer to uncertainties on model parameters. A worse resolution can degrade the statistical precision of a measurement, but if it is well modeled by the simulation, then there is no associated systematic uncertainty (though there will still be statistical uncertainties).
Both simulation-based and data-based calibrations can have associated uncertainties. For simulation-based calibrations, even if they are prior independent, there can be uncertainties in the detector models themselves. For data-based calibrations, there are additional uncertainties associated with the truth-level prior; see Sec. II E.
One of the goals of data-based calibration is to improve the modeling of the calibration in simulation to match the data. Typically, data-based calibrations are performed in dedicated event samples with well-understood physics processes. The residual uncertainty following the databased calibration is dominated by the modeling of the 9 This is also true if we had used the relative resolution, z T |Z T = z T , which is also commonly used in HEP, rather than the absolute resolution. underlying process. For example, data-based jet calibrations (called "in situ" calibrations) compare the jet to a well-measured reference object such as a Z boson. The momentum imbalance between the jet and the Z boson will be due in part to differences in the calibration between data and simulation and in part due to the mismodeling of initial and final state radiation. Uncertainties on the latter are then incorporated into the data-based calibration uncertainty. In nearly all cases, data-based calibrations are performed independent of the uncertainties, which are computed post-hoc. In the future, these uncertainties may be improved with uncertainty/inference-aware machine learning methods [32,.

IV. GAUSSIAN EXAMPLES
In this section, we demonstrate some of the calibration issues related to bias and prior dependence in a simple Gaussian example. We assume that the truth information (the "prior") is distributed according to a Gaussian distribution with mean µ and variance σ 2 : The detector response is assumed to induce Gaussian smearing centered on the truth input with variance 2 : For the simulation-based calibration in Sec. IV A, the goal is to learn Z T given X D , assuming perfect knowledge of the detector response. For the data-based calibration in Sec. IV B, the goal is to map X D in "simulation" to X D in "data". In this latter study, we assume that data and simulation have the same true probability density and differ only in their detector response, sim = datathat is, p sim "mismodels" p data .

A. Simulation-based Calibration
If we use the MSE approach in Eq. (3), there is a prior dependence in the calibration, which induces bias. Perhaps counter-intuitively, this bias persists even if the prior is the same as the data density: as we now show.
In the Gaussian case, the reconstructed data are distributed according to: and it is possible to solve Eq. (5) analytically, in the asymptotic limit: For comparison, we can also compute the unbiased maximum likelihood calibration using Eq. (12): It is also possible to analytically compute the point resolutions, Σ ( x D ), for both the MSE and MLC fits (Eqs. (24) and (25), respectively): To illustrate this setup, we simulate this scenario numerically for µ = 0, σ = 1, and = 2. In Fig. 1a, we show the simulated data, for which both the true and reconstructed values follow a Gaussian distribution. The first step of a typical calibration is to predict the true z T from the reconstructed x D . Since we know that the average dependence of the true z T on the reconstructed x D is linear, we perform a first-order polynomial fit to the data using numpy polyfit, which is represented by the blue dashed line in Fig. 1a. This calibration function is then applied to all reconstructed values: The resulting calibration curve is presented in blue in Fig. 1b, along with the associated resolution Σ MSE (x D ).
For comparison, we perform a maximum likelihood calibration using the Gaussian Ansatz introduced in Ref. [43]: where we have dropped the subscripts (x D → x, z T → z) for compactness of notation. As described in Ref. [43], the calibration function B(x) is obtained by minimizing the DVR loss function from Eq. (14), such that after training: For Gaussian noise models, this maximum likelihood estimate is unbiased, as confirmed by the numerical results in Fig. 1b  to get the bias from the MSE calibration approach: As expected, b(z T ) → 0 as → 0. For > 0, though, there is a non-zero bias with the MSE approach. The z Tbinned resolutions can also be computed using Eqs. (22) and (23): The fitted biases and resolutions are presented in Fig. 2, which exhibits the bias expected from Eq. (39). This illustrates the large bias introduced by the MSE regression procedure.
To further highlight the role of prior dependence, we repeat the MSE calibration procedure, where we test multiple values of the prior parameters µ and σ to confirm the predictions in Eq. (39). As shown in Fig. 2a, changes in µ simply shift the calibration up and down, but do not improve the calibration quality across the true values of z T . As shown in Fig. 2b, changes in σ change the slope of the calibration. In the limit σ → ∞, the calibration curve approaches the unbiased curve, as anticipated from Eq. (9).

B. Data-based Calibration
As discussed in Sec. II E, we are unaware of any priorindependent data-based calibration. To highlight this challenge, we study the OT-based technique introduced in Ref. [42] and mentioned in Sec. II D. In our Gaussian example, the goal is to calibrate a "simulation" sample with (µ sim. , σ sim. , sim. ) to match a "data" sample with (µ data , σ data , data ).
For simplicity, we assume that the true spectra (determined by (µ, σ)) are the same in data and in simulation, such that there is no systematic uncertainty in the calibration (see Sec. III B). Only , the parameter governing the detector response, is different between simulation and data -the simulation mismodels the real detector. To highlight the issue of prior dependence, we consider a "training" set with one value of µ train = 0 and a "testing" set with a different value of µ test , with a shared value of σ. The calibration will be derived on the training set and deployed on the testing set. Again for simplicity, we assume that detector effects (determined by ) are the same in both the train and test sets.
The one-dimensional OT map h from one Gaussian A to another Gaussian B can be computed analytically: where the mean and standard deviation of sample i are µ i and σ i , respectively. This equation can be derived following Eq. For the training set with µ train = 0, we have The test set only differs in the value of µ test , so the correct calibration function should be: As long as α = 1, then h train = h test and so the calibration is not universal.
A numerical demonstration of this bias is presented in Fig. 3, where histograms of the data and simulation are presented along with the calibrated result. In Fig. 3a, we see the calibration derived in the training sample, where by construction, the calibrated simulation matches the data. Since the truth distribution is different in the test set, however, the training calibration applied in the test set is biased, as shown in Fig. 3b. The actual calibration function is plotted in Fig. 4 and compared to the analytic expectation from Eqs. (44) and (43). The fact that the calibration derived on the train set is not the same as the calibration derived on the test set shows that the calibration derived in one and applied to the other will lead to a residual bias.

V. CALIBRATING JET ENERGY RESPONSE
Jets are ubiquitous at the LHC, and their calibration is an essential input to a majority of physics analyses performed by ATLAS and CMS. In this section, we consider a simplified version of simulation-based and data-based jet energy calibrations. To illustrate the impact of the prior dependence, we use a realistic and also extreme example where calibrations are derived in a sample of generic quark and gluon jets and then applied to a test sample of jets from the decay of a heavy new resonance.
To further simplify the problem, we consider a calibration of the invariant mass m jj of the leading two jets. In prac-tice, jet energy calibrations are derived for individual jets, but this requires at least including calibrating the jet rapidity in addition to the jet energy. We keep the problem one-dimensional in order to ensure the problem is easy to visualize and to mitigate the dependence on features that are not explicitly modeled. For a high-dimensional study of jet energy calibrations in a prior-independent way, see Ref. [43].

A. Datasets
Our study is based on generic dijet production in quantum chromodynamics (QCD). For these studies, we will consider two different datasets to demonstrate simulationbased and data-based jet energy calibrations. The first dataset is made with a full detector simulation. The full simulation sample uses Pythia 6.426 [92] with the Z2 tune [93] and interfaced with a Geant4-based [94][95][96] full simulation of the CMS experiment [97]. In simulationbased calibration, our goal will be to reconstruct the truth-level z T = m true jj from the detector-level x D = m reco jj . The second dataset is constructed with a fast detector simulation. The fast simulation uses Pythia 8.219 [98] interfaced with Delphes 3.4.1 [99][100][101] using the default CMS detector card. In data-based calibration, our goal will be to match this fast simulation to "data", which will be represented by the full simulation. The full simulation sample comes from the CMS Open Data Portal [102][103][104] and processed into an MIT Open Data format [105][106][107][108]. The fast simulation sample is available at Ref. [109,110].
For each dataset, we have access to the parton-level hard-scattering scalep T from Pythia, which is in general different from the jet-level transverse momentum p T we are interested in studying. To avoid any issues related to the trigger, we focus on events wherep T > 1 TeV. Particles (at truth level) or particle flow candidates (at reconstructed level) are used as inputs to jet clustering, implemented using FastJet 3.2.1 [111,112] and the antik t algorithm [113] with radius parameter R = 0.5. No calibrations are applied to the reconstructed jets.
To emulate two different physics processes while controlling for all hidden variables, we consider dijet events with two different sets of event weights. This will allow us to study the prior-dependent effects of each calibration.
• QCD. This set of weights {w i } comes from the original Pythia event generation. The resulting spectra are steeply falling in the invariant mass of the two jets, m jj .
• BSM. To emulate a narrow dijet resonance, we consider a second set of weights given by where µ = 2.8 TeV and σ = 10 GeV. Note that the weighting is applied using the true m jj . The m jj distributions as described above are shown in

B. Simulation-based Calibration
The goal for the simulation-based calibration is to learn a function to predict z T = m true jj from x D = m reco jj in the full simulation. In contrast to the Gaussian example in Sec. IV A, we do not know the functional form of the calibration. Therefore, we use a neural network to provide a flexible parametrization of the calibration and numerically minimize the MSE loss. The neural network has three hidden layers with 50 nodes per layer, with the rectified linear unit activation for intermediate layers and a linear activation for the output. The network is implemented in Keras with the Tensorflow backend and optimized with Adam using a batch size of 1000 and 50 epochs. Training is performed over the QCD sample to obtain the calibration function. The learned calibration function is then applied to both the QCD and BSM test samples.
The result of MSE calibration is shown in Fig. 6a. Prior to any calibration, the detector response is about 5% low in both the QCD and BSM test samples. After calibration, the mean is nearly unity for the QCD sample, albeit with a large width -that is to say, the average bias is close to zero over the prior, but the average resolution is large. For the BSM sample, though, the calibrated mean is far from unity, demonstrating the bias and prior dependence of the MSE calibration. The MSE-based calibration obtained from the QCD fit is not universal, and gives poor results when applied to the BSM sample. 10 For comparison, in Fig. 6b we show results from a maximum-likelihood-based calibration trained on the QCD sample, using the Gaussian Ansatz in Eq. (35). The A, B, C, and D networks of the Gaussian Ansatz each consist of three hidden layers with 32 nodes per layer, with the same activation functions, batch size, and epochs as in the Gaussian example. The calibration function trained on the QCD sample can be used for the BSM sample, and as Fig. 6b shows, the calibration is indeed universal and unbiased, as expected. 10 The converse is also true -attempting to use a calibration fitted on the BSM sample will lead to bias on the QCD sample, or any other BSM sample for that matter. These non-universal fits lead to mass sculpting, in which a fit depends strongly on the mass point used in training. See e.g. [114] for discussions on sculpting and mass decorrelation.

C. Data-based Calibration
The goal for the data-based calibration task is to "correct" p sim (m reco jj ), given by the fast simulation (Delphes), to the observed data distribution p data (m reco jj ), given by the full simulation (Geant4). We now apply the same procedure described in Sec. IV B to the dijet example.
An OT-based calibration is derived using QCD jets, to align the fast simulation Delphes) sample with the full simulation Geant4 sample. The calibration function, given by the optimal transport map (Eq. (17)), can be computed numerically by sorting and integrating the weighted data points to build the cumulative distribution functions. On the QCD sample, this calibration closes by construction. In particular, as shown in Fig. 7a, the blue dashed line in the ratio plot fluctuates around unity, with deviations due to statistical fluctuations that differ between the two halves of the event samples.
When this calibration is applied to the BSM events, however, the calibration overshoots, as shown with the red dashed line in the ratio plot in Fig. 7b. While the resulting dashed distribution agrees better with the data histogram in dark red than does the fast sim histogram in light red, the overall agreement is still rather poor. This again highlights the issue of prior dependence in data-based calibrations. The calibration is performed on the QCD sample, which closes, and the same calibration is applied to the BSM sample. Note that for the BSM sample, the ratio plot is in log-scale, indicating a very large bias.

VI. CONCLUSIONS
In this paper, we explored the prior dependence of machine-learning-based calibration techniques. There is a growing number of machine learning proposals for simulation-based and data-based calibration and in nearly all cases, there is a prior dependence. We highlighted the resulting calibration bias in a synthetic Gaussian example and a more realistic particle physics example of dijet production at the LHC.
In the simulation-based calibration case, most proposals learn a truth target from detector-level observables using loss functions like the MSE. A neural network trained in this way will learn the average true value given the detector-level inputs, which depends on the spectrum of truth values. However, we have shown that this will yield a calibration that lacks the critical properties of universality and closure.
There are fewer proposals for machine learning databased calibrations, but we studied one recent idea based on OT and showed its prior dependence. While we focused on one-dimensional examples, the prior dependence is a generic feature of these approaches. Going to higher dimensions may even exacerbate the issue since it is harder to visualize and control prior differences in many dimensions.
New learning approaches are required to ensure that machine learning-based calibrations are universal. For simulation-based calibration, the ATLAS collaboration has proposed a prior-independent method called generalized numerical inversion [23,24]. While prior independent, this technique is typically biased and does not scale well to many dimensions. We proposed a new approach based on maximum likelihood estimation in Ref. [43], based on parametrizing the log-likelihood with a Gaussian Ansatz. Maximum-likelihood-based approaches are prior independent by construction and are well-motivated statistically. Parametrizing the maximum likelihood estimator with neural networks requires a different learning paradigm than current approaches, but it extends well to many dimensions. To our knowledge, there are currently no prior-independent data-based calibration approaches.
To make the most use of the complex data from the LHC and other HEP experiments, it is essential to use all of the available information for object calibration. This will require modern machine learning to account for all of the subtle correlations in high dimensions. It is important, however, that we construct these machine learning calibration functions in a way that integrates all of the features of classical calibration methods. We highlighted prior independence in this paper as a cornerstone of calibration. In the future, innovations that incorporate knowledge of the detector response or physics symmetries may further enhance the precision and accuracy of machine learning calibrations.

CODE AND DATA
The code for this paper can be found at https://github.com/hep-lbdl/calibrationpriors, which makes use of Jupyter notebooks [115] employing NumPy [116] for data manipulation and Matplotlib [117] to produce figures. All of the machine learning was performed on a Nvidia RTX6000 Graphical Processing Unit (GPU). The physics datasets are hosted on Zenodo at Refs. [106][107][108]110].