Flows for Flows: Morphing one Dataset into another with Maximum Likelihood Estimation

Many components of data analysis in high energy physics and beyond require morphing one dataset into another. This is commonly solved via reweighting, but there are many advantages of preserving weights and shifting the data points instead. Normalizing flows are machine learning models with impressive precision on a variety of particle physics tasks. Naively, normalizing flows cannot be used for morphing because they require knowledge of the probability density of the starting dataset. In most cases in particle physics, we can generate more examples, but we do not know densities explicitly. We propose a protocol called flows for flows for training normalizing flows to morph one dataset into another even if the underlying probability density of neither dataset is known explicitly. This enables a morphing strategy trained with maximum likelihood estimation, a setup that has been shown to be highly effective in related tasks. We study variations on this protocol to explore how far the data points are moved to statistically match the two datasets. Furthermore, we show how to condition the learned flows on particular features in order to create a morphing function for every value of the conditioning feature. For illustration, we demonstrate flows for flows for toy examples as well as a collider physics example involving dijet events


I. INTRODUCTION
One common data analysis task in high energy physics and beyond is to take a reference set of examples R and modify them to be statistically identical to a target set of examples T .In this setting, we do not have access to the probability density of x ∈ R N responsible for R or T (i.e.p T and p R ), but we can sample from both by running an experiment or simulator.Examples of this task include shifting simulation to match data for detector calibrations, morphing experimental or simulated calibration data to match backgrounds in signal-sensitive regions of phase space for background estimation or anomaly detection, and tweaking simulated examples with one set of parameters to match another set for parameter inference.
A well-studied way to achieve dataset morphing is to assign importance weights w so that w(x) ≈ p T (x)/p R (x).This likelihood ratio can be constructed using machine learning-based classifiers (see e.g.[1,2]) to readily accommodate N ≫ 1 without ever needing to estimate p T or p R directly.While highly effective, likelihood-ratio methods also have a number of fundamental challenges.With nonunity weights, the statistical power of a dataset is diluted.Furthermore, even small regions of non-overlapping support between p T and p R can cause estimation strategies for w to fail.
A complementary strategy to importance weights is direct feature morphing.In this case, the goal is to find a map f : R N → R N from the reference to the target space such that the probability density of f (x ∼ p R ) matches p T .Unlike the importance sampling scenario, f is not unique.The goal of this paper is to study how to construct f as a normalizing flow [3,4] -a type of invertible deep neural network most often used for density estimation or sample generation.Normalizing flows have proven to be highly effective generative models, which motivates their use as morphing functions.Traditionally, normalizing flows are trained in the setting where p R is known explicitly (e.g. a Gaussian distribution).Here we explore how to use flows when neither p R or p T are known explicitly.We call our method flows for flows.This approach naturally allows for the morphing to be conditional on some feature, such as a mass variable [5][6][7].Approaches similar to flows for flows have been performed for variational autoencoders [8] and, recently, diffusion models [9].
In many cases in physics, p R is close to p T , and so f should not be far from the identity map.For example, R might be a simulation of data T , or R might be close to T in phase space.In order to assess how well suited normalizing flows are for this case, we also study how much x is moved via the morphing.An effective morphing map need not move the features minimally, but models that include this inductive bias may be more robust than those that do not.There is also a connection with optimal transport, which would be exciting to study in the future.
This paper is organized as follows.Section II briefly reviews normalizing flows and introduces all of the flows for flows variations we study.Next, Sec.III presents a simple application of the flows for flows variations on two-dimensional synthetic datasets.Sec.IV gives a more realistic application of the transport variations to sets of simulated particle collision data.We summarize the results and conclude in Sec.V.

A. Normalizing flows as transfer functions
Normalizing flows are classically defined by a parameteric diffeomorphism f ϕ and a base density p θ for which the density is known.Using the change of variables formula, the log likelihood (paramaterized by both θ and ϕ) of a data point x ∼ p D under a normalizing flow is given by where J is the Jacobian of f ϕ .Training the model to maximise the likelihood of data samples results in a map f −1 ϕ between the data distribution p D (x) and the base density p θ .As the base density should have a known distribution, it is usually taken to be a normal distribution of the same dimensionalty as the data (which motivates the name "normalizing" flow).
At this point, we can introduce the first transfer method from a reference distribution p R to a target distribution p T , the base transfer.For this method, we train two normalizing flows with two different maps from the same base density.If f ϕ1 constitutes a map to the reference density p R and f ϕ2 is a map to the target density p T , then the composition f −1 ϕ2 • f ϕ1 is a transfer map f : R → T .In other words, the transfer method routes from reference to target via some base density intermediary.
It is also possible to use a learned base density, such as another normalizing flow, instead of some known base distribution.This is our second method, unidirectional transfer.Given samples from two data distributions p R and p T of the same dimensionality, a map f γ : R → T between these distributions can be found by estimating a density p ϕ,R for R to use as the base density in the construction of another normalizing flow.In practice, this involves first training a normalizing flow to learn the density p R by constructing the map f −1 ϕ from a base density p θ to p R .
Training of the two normalizing flows (the first for the base density, the second for the transport) is done by maximising the log likelihood of the data under the densities defined by the change of variables formula and given by .
As a direct extension of the unidirectional training method: defining densities on both the reference and the target distributions, p θ1,R and p θ2,T allows both f γ and f −1 γ to be explicitly used by training in both directions, FIG.1: A schematic of the flows for flows architecture.from R to T and T to R. This comprises our third transfer method, flows for flows.A benefit of training in both directions is that the dependence of f γ on the defined and learned densities p θ1,R and p θ2,T is reduced.A schematic of the flows for flows architecture is shown in Fig. 1.
The invertible network f γ that is used to map between the two distributions may not have semantic meaning on its own as some invertible neural networks are known to be universal function approximators.This map can become interpretable if it is subject to additional constraints.In this work, we investigate two physically-motivated modifications to the flow training procedure: movement penalty, where we add an L1 loss term to the flow training loss, and identity initialization, where we initialize the flow architecture to the identity function.The L1 variation directly penalizes the average absolute value of the distance moved, while the idea for the identity initialization is that the model will converge on the first best solution that gives close to no movement.All five transfer methods introduce in this section are summarized in Tab.I.
This entire setup can be made conditional by making the parameters of the invertible neural network dependent on some selected parameter (i.e. the "condition").The log-likelihood for a normalizing flow conditioned on some variables c is defined by where the base density can also be conditionally dependent on c.In the case of conditional distributions with continuous conditions, the distributions on data p D (x|c) will often change smoothly as a function of the condition.For these situations, a flow that is explicitly parameter-FIG.2: Schematic of a conditional flows for flows architecture.
ized by a well-motivated choice of conditioning variable may have a cleaner physical interpretation.We provide an example of such a flow for our application to particle collision datasets in Sec.IV.In particular, conditional flows have been used often in high energy physics to develop "bump hunt" algorithms to search for new particles [5][6][7][10][11][12].In such studies, the resulting flows perform well when interpolated to values of the conditioning variable not used in training.
A schematic of a conditional flows for flows model is shown in Fig. 2, where the conditioning function f γ(cx,cy) can also take more restrictive forms, such as f γ(cx−cy) to ensure that the learned map is simple [5,7].Furthermore, the two conditional base distributions can be identical such that ϕ 1 = ϕ 2 .Alternatively the base distributions can be different and instead a shared condition can be use c = c x = c y .

B. Network architecture
Throughout this work, we use two different flow architectures, one for the "standard" normalizing flow architecture (i.e.learning transformations from standard normal distributions to arbitrary distributions) and one for the flows for flows architecture (i.e.learning transformations between two nontrivial distributions).
For the former architecture type, the invertible neural networks are constructed from rational quadratic splines with four autoregressive (AR) layers [13].Each spline transformation has eight bins and the parameters of the spline are defined using masked AR networks with two blocks and 128 nodes as defined in the nflows package [14].For the latter architecture type, we use eight AR layers with splines of eight bins from 3 masked AR blocks of 128 nodes.This slightly more complex architecture is found to give better performance for the large shifts between the toy distributions that we consider.However, in cases where the reference and the target distributions are similar to each other, the architecture of the flows for flows model could in principle be simplified for faster training time while maintaining good performance.
An initial learning rate of 10 −4 is annealed to zero following a cosine schedule [15] over 60 epochs for the first flow type and 64 epochs for the second flow type.All trainings use a batch size of 128 and the norm of the gradients is clipped to five.For the toy distribution analyses in Sec.III, the training datasets all contain 10 6 samples.

III. TOY EXAMPLE RESULTS
In this section, we explore the performance of the five transfer methods for learning a mapping between nontrivial two-dimensional distributions.In general, we consider both the accuracy of the transform -i.e.does the transfer method learn to successfully morph between the reference and the target distribution -and the efficiency of the transform -i.e.does the method learn a morphing that is logical, and not unnecessarily circuitous.

A. Base transfer vs. flows for flows
In Fig. 3, we show a transport task between two datasets drawn from a toy distribution of four overlapping circles.Here we are in some sense trying to learn the identity mapping.We compare the action of the base transfer, which can be seen as the "default" method of mapping between two nontrivial distributions, against the flows for flows method.Both methods are able to successfully map the overall shape of the reference to the target distribution.However, the base transfer method tends not to keep points in the same circle when mapping them from reference to target, while the flows for flows method is more successful at keeping larger portions of each ring together.
In Fig. 4, we show a transport task between two different distributions, from four overlapping circles to a four-pointed star.As before, both the base transfer and flows for flows method are able to morph the shape of the reference distribution into the shape of the target distribution.Interestingly, the flows for flows method appears to distribute points from each of the four circles more equally among each point of the star.

B. Evaluating multiple transfer methods.
In Fig. 5, we evaluate just the shape-morphing ability of the transfer methods.We consider six reference -target pairings, where the reference and target distributions are different1 , and show the action of the base transfer, unidirectional transfer, flows for flows, movement penalty, and identity initialization methods on the reference distribution.We consider transports between three toy distribution types: four overlapping circles, a four-pointed star, and a checkerboard pattern.All of the

Reference
Base transfer Flows for Flows FIG.3: Transport task between two instantiations of the same distribution.The first column shows the reference distribution; the second column shows the base transfer method acting on the reference distribution; the third column shows the flows for flows method.Individual samples have been color coded so as to make clear their paths assigned by the transport method.

Reference
Base transfer Flows for Flows FIG.4: Transport task between two different distributions.Individual samples have been color coded so as to make clear their paths assigned by the transport method.
transfer methods considered are able to successfully learn to map from reference to target, except for the unidirectional transfer, which exhibits a large amount of smearing in the final distribution.Overall, the base transfer, movement penalty, and identity initialization methods show the cleanest final-state distributions.Another useful metric is the distance traveled by a sample that is mapped under a flow action.For many physical applications, a map that moves data the least is ideal, but we have only explicitly added an L1 loss term to the movement penalty method.Therefore it is interesting to consider how far, on average, all the methods move the features.
In Fig. 6, we show a histogram of the distances traveled, amassing all of the six transfer tasks shown in the rows of Fig. 5 so as to equalize over many types of starting and target shapes.The movement penalty method performs best, producing the shortest distances traveled from reference to target by a large margin compared with the other methods.Interestingly, the flows for flows and identity initialization methods have larger mean distances traveled than the base transfer method as well as larger standard deviations.This is somewhat counterintuitive given that the base transfer method does not explicitly link the reference and target distributions during the training procedure, but it may reflect the somewhat contrived nature of the toy examples (especially in light of the more intuitive results for the science datasets in Fig. 9).All methods except the unidirectional transfer perform more optimally than or on par with the expected baseline, which comes from computing the distances between two random, unrelated instantiations of each referencetarget distribution pairing.

IV. APPLICATION: CALIBRATING COLLIDER DATASETS
We now move to a physical example: mapping between distributions of scientific observables.Many analyses of collider data are geared towards finding evidence of new physics processes.One powerful search strategy is to compare a set of detected data with an auxiliary dataset, where the auxiliary dataset is known to contains Standard Model-only physics.Any nontrivial difference between the detected and the auxiliary datasets could then be taken as evidence for the existence of new physical phenomena.
The above analysis procedure is contingent upon the auxiliary dataset being a high-fidelity representation of Standard Model physics.However, such an assumption is not true for many datasets that would be, at first glance, ideal candidates for the auxiliary dataset, such as simulation of Standard Model processes or detected data from adjacent regions of phase space.Therefore it is necessary to calibrate the auxiliary dataset such that it becomes ideal.Historically, this calibration task has been performed using importance weights estimated from ratios of histograms, either using data-driven approaches like the control region method or fully data-based alternatives.Recently, machine learning has enabled these approaches to be extended to the case of many dimensions and/or no binning -see e.g.Ref. [16] for a review.
With the flows for flows method, we can consider yet another calibration approach: to create an ideal auxiliary dataset (the target) by morphing the features from a lessideal, imperfect auxiliary dataset (the reference).When the imperfect auxiliary dataset is chosen to be close to the ideal reference dataset, as would be true of the candidates listed in the previous paragraph, then the flows for flows method should simply be a perturbation on the identity map.FIG.6: Distances traveled in parameter space between two nonidentical toy distributions.Each histogram compiles data from six transfer tasks, corresponding to the rows of Fig. 5.The "baseline" method shows the distances between two random, unrelated instantiations of each reference -target distribution pairing.The maximum possible distance travelable in parameter space is 11.31.

A. Analysis procedure and dataset
We focus on the problem of resonant anomaly detection, which assumes that given a resonant feature M , a potential new particle will have |M − M 0 | < ∼ s (which defines the signal region) for some unknown M 0 and often knowable s [17].The value of M 0 , which corresponds to the mass of the new particle, can be derived from theoretical assumptions on the model of new physics or can be found through a scan.Additional features X ∈ R N are chosen which can be used to distinguish the signal (the new particle) from background (Standard Model-like collisions), which can be done by comparing detected data with reference data within the signal region.
For our datasets, we use the LHC 2020 Olympics R&D dataset [18,19] which consists of a large number (∼ 10 6 ) Standard Model simulation events.The events naturally live in a high-dimensional space, as each contains hundreds of particles with momenta in the x, y, and z directions.To reduce the dimensionality, the events are clustered into collimated sprays of particles called jets using the FastJet [20,21] package with the anti-k t algorithm [22] (R = 1).From these jets, we can pull a compressed feature space of only five dimensions; this set of features has been dataset but selecting different conditional values.This method of flows for flows is used to train the Constructing Unobserved Regions with Maximum Likelihood Estimation method [7] and is studied for toy datasets in App.B.
extensively studied in collider analyses.The jet features, along with the resonant feature M , are displayed in Fig. 7.We take the band M ∈ [3.3, 3.7] TeV as our signal region.
The LHC Olympics dataset contains two sets of Standard Model data generated from the different simulation toolkits Pythia 8.219 [23,24] and Herwig++ [25].We use the former as a stand-in for detected collider data.The latter is used as the reference dataset, the less-than-ideal auxiliary dataset that is calibrated through the flows for flows method to form the ideal auxiliary, target dataset.
To construct the ideal auxiliary dataset, we train a flow to learn the mapping between the reference dataset and the target data outside of the signal region, so as to keep the signal region blinded.Once trained, the flow can then be applied to the non-ideal auxiliary dataset within the signal region, thus constructing the ideal auxiliary dataset.We use the same architectures as in Sec.II B, with the modification that we condition the transport flows on the mass feature M .This conditioning is motivated by the fact that the flow is trained outside the signal region and applied within the signal region, which is defined exactly by the variable M .

B. Results
In Fig. 8, we provide the distributions of the flowmoved reference dataset to the target dataset, as well as the ratios to the target, outside of the signal region.As is clear from Fig. 7, the reference and target datasets are far more similar in this calibration example than they were in the toy examples.Therefore for the movement penalty method, it was necessary to scan over the strength of the L1 term added to the training loss in order to achieve good performance; we found that we needed to reduce the strength by a factor of 20, as compared with what was used for the toy distributions.In fact, all five transfer methods methods (base transfer, unidirectional transfer, flows for flows, movement penalty, and identity initialization) perform comparably, and all five methods are able to successfully transform the reference dataset such that the five marginal feature distributions greatly resemble those of the target.
In Fig. 9, we show a histogram of the distances traveled for each data point due to the flow action.Distributions for distance traveled in each individual dimension of feature space are given in Fig. 10.Since the reference and target distributions are so similar, the base transfer methods leads to a highly non-minimal transport path.While the unidirectional method performs well, it shows a longer tail in distance traveled that may represent a less-than-ideal mapping.The flows for flows and identity initialization methods perform comparably with relatively little distance traveled, while movement penalty appears to have found a nearly minimal path.
Based on the closeness of the distributions of the reference and target in Fig. 7, we might hope for a mapping that morphs features m J1 , ∆m JJ , and ∆R JJ almost not

Ratio to target
Flows for Flows at all, and features τ 21 J1 and τ 21 J2 very minimally.Indeed, this is exactly the behavior we see in Fig. 10 for the movement penalty method (and, to a lesser extent, for the flows and flows and identity initialization methods).

V. CONCLUSIONS AND FUTURE WORK
In this work, we have explored a number of ways to use normalizing flows to create mappings between nontrivial reference and target datasets of the same dimensionality.Our aim is to consider methods that go above and beyond the "naive" base transfer method, which uses standard normalizing flows that map from reference to target via a base density intermediary.In particular, we have introduced the flows for flows method, which uses two normalizing flows to parameterise the probability densities of both the reference and the target and trains both with exact maximum likelihoods.
We have evaluated five transfer methods: base transfer, unidirectional transfer, flows for flows, movement penalty, and identity initialization.We have attempted to evaluate each method on two facets: the accuracy of the transport between reference and target, and the efficiency of the transport (i.e.how far away are points transported by the mapping).When the reference and target are fully unrelated (such as for the toy examples in Sec.III), the flows for flows method is comparable with the naive base transfer method both for accuracy and extent.When the reference and target sets are similar, or obviously related in some way (such as for the particle physics calibration application in Sec.IV), the flows and flows method is far preferable to the base transfer method.These results imply that the flows for flows method should be used over the base transfer method, as it can always provide both an accurate and efficient transport.However, the highest performing (and thus our recommended) methods of transport are either the movement penalty or identity initialization methods, depending on the specific application.
There are many avenues for further modifications of the flows for flows method, or other ways to construct flow-based mapping functions in general.One interesting avenue involves physically-motivated extensions of normalizing flows: continuous normalizing flows (CNF) [27], we can constrain the flow mappings such that they can be assigned velocity vectors, and convex potential (CP) flows [28], where the map is constrained to be the gradient of a convex potential.One can explicitly enforce optimal transports with OT-Flows [29], which add to the CNF loss both an L2 movement penalty and a penalty that encourages the mapping to transport points along the minimum of some potential function.While such modifications may not be necessary when the reference and target distributions are very similar, they could be explored for situations when the reference and target distributions are significantly different.), the unidirectional transfer performs extremely poorly, the flows for flows method performs next best, the movement penalty and identity initialization methods perform well, and the base transfer method shows the cleanest final state.Again, given the more intuitive performance rankings for the scientific dataset, it is likely that the superior performance of the base transfer method is more a reflection of the contrived nature of the transport tasks between these toy distributions chosen. In

FIG. 5 :
FIG. 5: Transport tasks between various choices of nonidentical reference and target toy distributions.The colorbar has been set to scale logarithmically, which can emphasize out-of-distribution points.

FIG. 7 :
FIG.7: Reference and target distributions used in the application of the flows for flows procedure to scientific datasets.The feature space is comprised of the resonant feature M and five other features m J1 , ∆m JJ , τ 21 J1 , τ 21 J2 , and ∆R JJ .A description of these observables can be found in[26].The signal region is defined by |M − M 0 | < c for M 0 = 3.5 TeV and c = 200 GeV.

FIG. 8 :FIG. 9 :
FIG. 8: Distributions of and ratios of the flow-transported reference (less-than ideal auxiliary) dataset to the target (ideal auxiliary) dataset.Ratios are taken over each of the five marginal distributions in the parameter space; errorbars represent Poisson uncertainties in bin counts.All data is taken outside of the signal region.All features have been individually minmaxscaled to the range [-3, 3] to optimize network training.

FIG. B. 1 :
FIG. B.1:Transport task between a choice of identical reference and target toy distributions.Individual samples have been color coded so as to make clear their paths assigned by the transport method.The conditioning rotation of each distribution is given in the top right corner.

FIG. B. 2 :
FIG. B.2: Transport tasks between various choices of identical reference and target toy distributions.The colorbar has been set to scale logarithmically, which can emphasize out-of-distribution points.

FIG. B. 3 :FIG. B. 4 :
FIG. B.3:Transport task between a nonidentical choice of reference and target toy distributions.Individual samples have been color coded so as to make clear their paths assigned by the transport method.

FIG. B. 5 :
FIG. B.5: Transport tasks between the fourcircles and star distributions at a variety of conditioning angles.The colorbar has been set to scale logarithmically, which can emphasize out-of-distribution points.

TABLE I :
We consider five transfer methods from a reference dataset to a target dataset, both with unknown distributions p R and p T .
Distances traveled in each dimension of the 5-dimensional parameter space between the reference (less-than ideal auxiliary) dataset to the target (ideal auxiliary) dataset, outside of the signal region.The maximum possible distance travelable in each dimension of parameter space (for the minmaxscaled features) is 6.