Improving Generative Model-based Unfolding with Schr¨odinger Bridges

Machine learning-based unfolding has enabled unbinned and high-dimensional differential cross section measurements. Two main approaches have emerged in this research area: one based on discriminative models and one based on generative models. The main advantage of discriminative models is that they learn a small correction to a starting simulation while generative models scale better to regions of phase space with little data. We propose to use Schr¨odinger Bridges and diffusion models to create SBUnfold , an unfolding approach that combines the strengths of both discriminative and generative models. The key feature of SBUnfold is that its generative model maps one set of events into another without having to go through a known probability density as is the case for normalizing flows and standard diffusion models. We show that SBUnfold achieves excellent performance compared to state of the art methods on a synthetic Z +jets dataset.

Let X represent 1 an event at detector-level and Z represent the same event at particle-level.The goal of unfolding is to infer the most likely density p(z) using simulated pairs (Z, X) and observations x from data.In classical approaches, X and Z are discretized and unfolding proceeds via regularized matrix inversion to approximate the maximum likelihood solution.The likelihood is a product of Poisson probability mass functions, although most measurements do not directly maximize this likelihood.For example, one of the most common classical unfolding algorithms is called Lucy-Richardson deconvolution [24,25] (also known as Iterative Bayesian Unfolding [26]) which uses an Expectation-Maximization (EM) algorithm to converge to a maximum likelihood estimator.
In the unbinned case, the likelihood is not known.One solution is to use the EM algorithm, which is at the core of two maximum likelihood machine learning approaches.The first is OmniFold, which proceeds as follows2 : Both the Expectation (E) and Maximization (M) steps are achieved in practice by training classifiers and interpreting the resulting score as the target likelihood ratio, e.g.take samples from p data and from psim.(samples from p sim. weighted by ν) and train a classifier to distinguish them for the E step.The final result is the set of simulated events weighted by ν with probability density p sim. (z) ν(z) in the continuum limit.An alternative approach called IcINN [11] is instead based on generative models: The E step of the IcINN is achieved by training a generative model (a normalizing flow [27]) to emulate p sim. (x|z) ν i (z) while the M step uses a classifier as in OmniFold.The groundwork for the IcINN was laid in Ref. [7], which focused on the E step, as have other papers using normalizing flows, diffusion models, and generative adversarial networks [2,6,14].Comparing both the E and M steps of OmniFold and the IcINN, one can see that they are formally the same and thus the EM proof in Ref. [5] applies to both approaches.
While formally both algorithms achieve the same EM algorithm step by step, they have complementary strengths in practice.One of OmniFold's strengths is that it starts from an existing simulation and if that simulation is close to nature, then the neural networks only have to learn a small correction.In contrast3 , the normalizing flows in IcINN have to map a known probability density (e.g. a multidimensional Gaussian) to the data and these can be quite different.However, the E step of OmniFold trains directly on data and thus its performance degrades when there are not many events.In contrast, the E step of the IcINN is trained using simulation and so can better handle the case of fewer events.
We propose an approach called SBUnfold that incorporates strengths of both OmniFold and the IcINN.We employ a technique called a Schrödinger Bridge (SB) using a diffusion model to learn the generative model in the IcINN workflow.In contrast to a normalizing flow or standard diffusion model, a Schrödinger Bridge learns to map one dataset into another without needing to know the probability density of one of the datasets.Thus, the Schrödinger Bridge should ideally learn a small correction while also preserving the E step learning with simulation and not data.The latter property means that SBUnfold should outperform OmniFold when there are few events in data.Since OmniFold and the IcINN differ only in the E step, we focus exclusively on the E step of the first iteration for all methods and will refer to IcINN as cINN for disambiguation.
This paper is organized as follows.Section II introduces Schrödinger Bridges and Sec.III describes how this can be used for unfolding.Numerical results from the OmniFold public Z+jets dataset are presented in Sec.IV.The paper ends with conclusions and outlook in Sec.V.

II. SCHRODINGER BRIDGE AND THE CONNECTION WITH DIFFUSION MODELS
Diffusion models have become popular choices for generative modeling due to their large degree of flexibility, stable training, and often competitive results compared to other approaches.The core idea is to design a time-dependent perturbation process that slowly perturbs data towards a tractable noise distribution.The goal of the generative model is to then reverse this process, starting from a noise sample and denoising towards new data observations.From a fixed choice of stochastic differential equation (SDE) described by parameters f (x, t) ∈ R d and g(t) ∈ R, the evolution over time of the data observation x ∈ R d is determined.The same initial data point can undergo different paths due to the additional stochastic term identified by the Wiener process, or Brownian motion, w(t) ∈ R d , often sampled from a normal distribution with the same dimension as the data.The reverse process follows the reverse SDE equation that reads: The only unknown term in the Eq. 2 is the score function, ∇ log p(x, t), which can be approximated using denoising score matching [29] and minimizing the loss function where a neural network with trainable parameters θ is optimized to approximate the score function of data that have been perturbed by a Gaussian distribution with time-dependent parameters α t and σ t .For this Gaussian perturbation, q(x t |x) = N (x t , α t x, σ 2 t I), requiring f (x, t) to be affine with respect x and resulting in a Gaussian noise at the end of the diffusion process.For an arbitrary noise distribution, the corresponding choice of f (x, t) cannot be easily identified, often restricting the family of distributions one is allowed to choose for the diffusion process.A more general framework that allows the mapping between general distributions corresponds to the solution of the Schrödinger Bridge (SB) problem.Initially proposed by Erwin Schrödinger [30], the problem concerned the inference of the trajectories from particles undergoing a diffusion process where experimental observations of the particle's trajectory are available only at specific time values, fixing the boundary conditions of the problem.The connection with diffusion generative models becomes more clear by considering the following forward and backward SDEs: The wave-functions Ψ and Ψ satisfy with Ψ(x, 0) Ψ(x, 0) = p A (x) and Ψ(x, 1) Ψ(x, 1) = p B (x) for densities p A (x) and p B (x).Compared to Eq. 2, the presence of an additional non-linear term to the drift function, g(t) 2 ∇ log Ψ(x, t), enables the diffusion between densities that are not necessarily represented by a standard normal distributions at the end of the diffusion process.Additionally, ∇ log Ψ(x, t) no longer represents the score function of the perturbed data, but is related to it since Ψ(x, t) Ψ(x, t) = q(x, t), hence While the equations describing the general SB problem have similarities with the standard framework for diffusion generative models, a general strategy to solve the problem is not immediately obvious.The authors of Ref. [31] propose a tractable solution named I 2 SB, by considering the presence of pairs of observations in the dataset such that p(x a , x b ) = p A (x a )p B (x b |x a ).This assumption also holds for the unfolding methodology where pairs of particle collisions before and after detector interactions are always available in the simulation.With this approximation, the authors of Ref. [31] have shown that by setting the linear drift term f (x, t):=0, the posterior of Eq. 4, q(x|x a , x b ), has the analytic form: with σ 2 t := t 0 g 2 (τ )dτ and σ2 t := 1 t g 2 (τ )dτ .From this expression, given pairs (x a , x b ), we can directly determine x t = µ t + Σ t ϵ, ϵ ∼ N (0, 1) d for any time step t.The loss function is then identified similar to Eq. 3 as: where the right term of the loss function approximates the score function of the backward drift ∇ log Ψ(x, t), which can then be used during sampling to transport samples from p B to p A .During sampling, standard recursive samplers like DDPM [32] can be used with the prediction of the denoised data x θ 0 = x n+1 − σ n+1 ϵ θ (x n+1 , n + 1) at time step n < N defined as: The stochastic solver is able to produce different observations given the same exact inputs, which can then be used for different coverage tests regarding the validity of the generated outputs.A second option is to consider the deterministic case where the posterior distributions are effectively replaced by their means.This particular case is described by the solution of the following ODE: which describes an optimal transport plan [33].In our studies, we observe similar sample quality between the stochastic and deterministic settings and report the results based on the deterministic case for simplicity.We set f (x, t):=0 and g(t) = β(t), with β(t) the triangular function: with β 0 = 10 −5 and β 1 = 10 −4 .

III. UNFOLDING METHODOLOGY
We test SBUnfold using the public dataset from Ref. [5], available on Zenodo [34] and briefly summarized in the following.Proton-proton collisions producing a Z boson are generated at a center-of-mass energy of √ s = 14 TeV.A non-trivial test of unfolding requires at least two datasets, one that acts as the 'data' and one that is the 'simulation'.For the 'data', collisions are simulated with the default tune of Herwig 7.1.5[35][36][37].We only make use of the reconstructed events from Herwig as would be the case with real data.The Herwig particle-level events are only used when evaluating the performance of different methods.For the 'simulation', events are simulated with Tune 26 [38] of Pythia 8.243 [39][40][41].Detector distortions are simulated with Delphes 3.4.2[42] and the CMS tune that uses a particle flow reconstruction.In future work, we will investigate the performance of SBUnfold on the full phase space of particles (particle-level) and particle flow objects (detector-level).For this study, we focus on a fixed set of six observables that serve as the benchmark for fixed-dimension unfolding.These observables are computed from the substructure of the leading jet.The jets are clustered using all particle flow objects at detector level and all stable non-neutrino truth particles at particle level.They are defined by the anti-k T algorithm [43] with radius parameter R = 0.4 as implemented in Fast-Jet 3.3.2[44,45].The first four observables are the jet mass m, constituent multiplicity M , the N -subjettiness ratio τ 21 = τ (β=1) 2 /τ (β=1) 1 [46,47], and the jet width w (implemented as τ (β=1) 1 ).The remaining two observables are the jet mass ln ρ = ln m 2 SD /p 2 T and momentum fraction z g after Soft Drop grooming [48,49] with z cut = 0.1 and β = 0.Many of the observables are computed with FastJet Contrib 1.042 [50].
The Z bosons are required to have p T > 200 GeV in order to render acceptance effects negligible 4 .Each dataset consists of about 1.6 millon events.
The noise prediction model of SBUnfold is implemented using a fully-connected architecture incorporating multiple skip connections.Specifically, the model employs four ResNet [51] blocks, where each residual layer is connected to the output of a single-layer network through a skip connection.The activation function used is LeakyRelu [52] with a slope of α = 0.01, and all layer sizes are set to 32.The implementation of the model is carried out using Pytorch [53].The generation is carried out by starding from data observations at reconstruction level and using the SB to move towards the generator level events using 1000 time steps to improve accuracy.The cINN implementation uses the normalizing flow model presented in Ref. [7] while the first step of the first iteration of OmniFold uses a fully-connected model with three hidden layers, ReLu activation function, and node sizes set to 50, 150, 50, before the output layer with a single node and sigmoid activation function.Both cINN and OmniFold are implemented in TensorFlow [54].For all models, the number of trainable parameters are kept similar at around 13k.The initial learning rate is set to 10 −3 and all models are trained for 30 epochs with batch size of 128.A total of 1M training events are used to train all models taken from the Pythia simulations.The pseudo-data used to evaluate the performance of the unfolding is taken from Herwig.Both SBUnfold and cINN do not have access to the data during training while OmniFold uses a total of 2M training events (1M from Pythia and 1M from Herwig).We investigate the impact of the pseudo-data training size in Sec.IV when determining the unfolded results.All trainings are carried out using the Perlmutter supercomputer with a single NVIDIA A100 GPU.

IV. RESULTS
We evaluate the performance of SBUnfold compared to other unfolding methodologies by first looking at unfolded distributions where both pseudo-data and samples used to train the different unfolding methods originate from the Pythia simulation.The results showing the unfolded distributions are shown in Fig. 1.In App.A, we also provide the comparison of the results obtained by a standard diffusion model.
We observe a good agreement with the expected distributions from Pythia at generator level and the responses of the generative models for unfolding.We omit the OmniFold results since the weighting function becomes trivial when both data and simulation are statistically identical.We also investigate the distribution of the pairwise correlation between each pair of unfolded features to verify that SBUnfold is also capable of learning the correct correlations between distributions.The results are shown in Fig. 2.
We again observe a good agreement between the generated distributions and expected distributions from the simulation for both generative models, with SBUnfold showing improved description compared to the cINN for a few distributions such as z g vs. log ρ and z g vs τ 21 .The distribution of z g shows a sharp cutoff at z g = 0 which is hard to reproduce with generative models.Since the same sharp distribution is observed at reconstructed level events, SBUnfold can take advantage of the reconstructed prior when starting the diffusion process, contrary to the cINN that always requires samples from the standard normal distribution for the prior transformation.We also quantify the agreement between unfolded distributions with expected values derived from Pythia by calculating both the triangular discriminator metric [55][56][57] over histograms as presented in Fig. 1, as well as using the Earth mover's distance (EMD) directly on the marginalized one-dimensional features generated by the unfolding methods.The results are listed in Table I.For all distributions we observe significantly lower EMD values for SBUnfold compared to cINN with four out of six also showing lower values of the triangular discriminator.Uncertainties from the EMD calculation are derived from the standard deviation of 100 bootstraps with replacement taken from the unfolded data.We observe the EMD uncertainties from cINN to be often higher than SBUnfold, which points to improved stability from SBUnfold due the more informative prior.
Next, we keep the same unfolding methodology from SBUnfold and cINN trained over Pythia samples but instead use reconstructed events from Herwig as the data representative.We compare the unfolded results between the different distributions in Fig. 3.
We observe improved agreement in all unfolded distributions compared to the distributions of reconstructed events, while systematic shifts are observed in all unfolding methodologies for the jet mass and τ 21 distributions.These features highlight the prior dependence of all models which cannot be excluded without the presence of the M step and additional iterations.In Table II we calculate the EMD and triangular discriminator using the unfolded distributions with Herwig as pseudo-data.
Once again we observe a good performance of SBUnfold, achieving the lowest EMD values for four out of six distributions and lowest triangular discriminator values for three of the six.We also observe OmniFold  achieving similar performance for all observables, in particular OmniFold shows improved performance for the particle multiplicity, which is not a continuous distribution and hence harder for generative models to determine precisely.We also study how SBUnfold corrects physics observables back to generator level events by calculating the migration matrix between reconstructed and unfolded observables.Results showing that SBUnfold learns to apply small corrections to reconstructed events are shown in App.B.
OmniFold is also the only algorithm that uses the statistical power of the data to determine the unfolded distributions.We investigate how the unfolding results change when the available number of data entries is reduced to only 1,000 instead of the 600,000 thousands we used previously.For both SBUnfold and cINN this change only modifies the number of generated samples from the trained model while OmniFold only has access to the 1,000 data examples during training while the number of Pythia samples is not changed and kept at 1,000,000.The unfolded results are shown in Fig. 4.
Due to the lower number of data entries we observe larger statistical fluctuations from the unfolded results which still shows a good agreement compared to the expected quantities.We evaluate further the unfolded results by calculating the EMD and triangular discriminator, listed in Table III.
The reported results for SBUnfold and cINN are less affected by the reduced data sizes, even though the uncertainties from EMD have greatly increased due to limited pseudo-data examples.On the other hand, Om-niFold shows a worse degradation compared to previous  results, with EMD values often disagreeing by more than two standard deviations.

V. CONCLUSION AND OUTLOOK
In this paper we presented SBUnfold, an unfolding algorithm that uses a Schrödinger bridge (SB) to design a stochastic mapping between distributions.Since the SB allows the transport between arbitrary distributions 5 , SBUnfold does not require a tractable prior similar to 5 Our Schrödinger bridge is not a universal function approximator; it will map the distributions to each other, but may not preserve the conditional probability density [31].Our numerical results indicate that this is not an issue and we expect that monotonic other generative models, but instead uses a diffusion process to directly denoise physics observables from detector effects.We have demonstrated the performance of SBUnfold using synthetic Z+jets samples and compared with both state-of-the-art methods using generative models (cINN) and reweighting (OmniFold).Using different metrics, we observe an excellent performance of SBUnfold.Compared to cINN, SBUnfold shows improved fidelity for distributions with sharp features and smaller uncertainties overall likely due to more informative priors from reconstructed level events.Compared to OmniFold, SBUnfold shows a more robust performance against variations of the number of available data observations, showing promising results for cases where the amount of available data is reduced.
We have only investigated the expectation step of different unfolding algorithms, leaving the maximization step for future study.We notice that the maximization step based on classifiers used by IcINN is also applicable for SBUnfold, and would be interesting to compare how the unfolding results improve with subsequent iterations between both algorithms.
Finally, similar to OmniFold, SBUnfold can be readily adapted to different data structures such as image-like datasets.On the other hand, the diffusion process currently relies on the specific ordering of the features used during unfolding (see Eq. 8), making the application of SBUnfold with low-level features, such detector distortions should be in the class of functions that SBs are able to accommodate, but we leave further investigations to future work.We thank Jesse Thaler for useful discussions on this point.to the inputs.

FIG. 5 .
FIG. 5. Comparison between different unfolding algorithms for six different physics observables unfolded.Results are evaluated over 600'000 pseudo-data points.Statistical uncertainties are shown only in the ratio panel.Pseudo-data and simulation are described by Pythia.

TABLE I .
Comparison of the earth mover's distance (EMD) and triangular discriminator between different algorithms.EMD is calculated over unbinned distributions while triangular discriminator uses histograms as inputs.Uncertainties from EMD are derived using 100 bootstraps with replacement taken from the unfolded data.Results are evaluated using 600'000 pseudo-data points sampled from Pythia.Quantities in bold represent the method with best performance.

TABLE III .
Comparison of the earth mover's distance (EMD) and triangular discriminator between different algorithms.EMD is calculated over unbinned distributions while triangular discriminator uses histograms as inputs.Uncertainties from EMD are derived using 100 bootstraps with replacement taken from the unfolded data.Results are evaluated using 1'000 pseudo-data points sampled from Herwig.Quantities in bold represent the method with best performance.
Comparison between different unfolding algorithms for six different physics observables unfolded.All observables are unfolded simultaneously without binning, with histograms shown only for evaluation.Results are evaluated over 600'000 pseudo-data points.Statistical uncertainties are shown only in the ratio panel.Pseudo-data is taken from Herwig while Pythia is taken as the main simulator.Comparison between different unfolding algorithms for six different physics observables unfolded.All observables are unfolded simultaneously without binning, with histograms shown only for evaluation.Results are evaluated over 1'000 pseudo-data points, the same number was used to train OmniFold while other algorithms only rely on simulation.Statistical uncertainties are shown only in the ratio panel.Pseudo-data is taken from Herwig while Pythia is taken as the main simulator.

TABLE IV .
Comparison of the earth mover's distance (EMD) and triangular discriminator between different unfolding methodologies.EMD is calculated over unbinned distributions while triangular discriminator uses histograms as inputs.Uncertainties from EMD are derived using 100 bootstraps with replacement taken from the unfolded data.Results are evaluated using 600'000 pseudo-data points sampled from Pythia.Quantities in bold represent the method with best performance.Model EMD(×10)/Triangular Discriminator(×10 3 ) forms the cINN in the EMD metric while still not achieving the same level of performance as SBUnfold.These results indicate that standard diffusion models may be more expressive compared to the normalizing flow model implemented in the original cINN, but still not as precise as SBUnfold which leverages the more informative prior distribution from reconstruction level objects rather than the Gaussian prior required by both cINN and FPCD.