Preserving New Physics while Simultaneously Unfolding All Observables

Direct searches for new particles at colliders have traditionally been factorized into model proposals by theorists and model testing by experimentalists. With the recent advent of machine learning methods that allow for the simultaneous unfolding of all observables in a given phase space region, there is a new opportunity to blur these traditional boundaries by performing searches on unfolded data. This could facilitate a research program where data are explored in their natural high dimensionality with as little model bias as possible. We study how the information about physics beyond the Standard Model is preserved by full phase space unfolding using an important physics target at the Large Hadron Collider (LHC): exotic Higgs boson decays involving hadronic final states. We find that if the signal cross section is high enough, information about the new physics is visible in the unfolded data. We will show that in some cases, quantifiably all of the information about the new physics is encoded in the unfolded data. Finally, we show that there are still many cases when the unfolding does not work fully or precisely, such as when the signal cross section is small. This study will serve as an important benchmark for enhancing unfolding methods for the LHC and beyond.


I. INTRODUCTION
Analyses at the Large Hadron Collider (LHC) are generally classified as measurements or searches if their goal is to search for indirect or direct signs of physics beyond * pkomiske@mit.edu † wpmccormack@lbl.gov ‡ bpnachman@lbl.gov the Standard Model (SM), respectively. An important reason for this distinction is that measurements assume that deviations to the SM are small. This is required so that the removal of detector distortions ('unfolding') can be based on SM simulations. Traditional unfolding methods [1][2][3][4][5][6][7] are based on low-dimensional and binned observables. The detector response may depend on additional unmeasured features and may vary strongly within a given bin. If these properties are significantly different for new particles, then an unfolding derived with SM simulations is likely to be inaccurate. This feature of current unfolding methods has been studied in [8] and limits the applicability of recasting tools such as Contur [9]. A variety of complementary tools have been developed to fold model predictions with a detector response, including MadAnalysis [10][11][12][13][14], Recast [15][16][17], CheckMATE [18,19], SModelS [20,21], Fastlim [22], and XQCAT [23]. In addition to limitations from recasting approximations, these approaches are limited by the minimal (binned) search results that are usually highly optimized for particular signal models.
One possibility is to perform model-agnostic approaches at detector-level using one of the growing number of anomaly detection methods  (for reviews, see Ref [60,66]). These techniques can achieve broad and deep sensitivity by learning directly from data. However, methods that do not rely on any signal information (unsupervised) are not particularly sensitive [47,65] and methods that use noisy or partial signal information (weakly and semisupervised, respectively) are not recastable after the search is performed [44].
A new solution that has emerged is to perform an unbinned unfolding using all of the available information. If the high frequency and high dimensional aspects of the detector response are part of the unfolding procedure, then differences between signal and background will not be a source of bias. Unbinned and high-dimensional unfolding are now possible with advances in machine learning [67][68][69] (for other machine learning and unbinned proposals, see Ref. [70][71][72][73][74]). Of these, only the OmniFold [67] can currently process the full phase space that includes all observable particles and their properties. Unlike proposals based on generative models, OmniFold is built on neural network classifiers that are used to iteratively reweight simulations to match the data. Classifiers designed to process variable-length, unordered sets of particles allow this technique to access the full phase space [75,76]. While OmniFold has yet to be applied to collider data in the full phase space, it has recently been deployed in a low-dimensional case with the H1 experiment [77].
In this paper, we investigate the ability of OmniFold to preserve information about new particles present in the data. In particular, we will study the direct production of new particles which have spectra that are not similar to the SM background. Our benchmark example will be the exotic decay of a Higgs boson-like particle decaying into a Z boson and a light color singlet that decays into hadrons. The dominant background to this process is the SM production of Z bosons and jets. We will see to what extent the information about new particles are presevered in the unfolding. Recently, the authors of Ref. [69] showed that generative model approaches can preserve new physics with a relatively large cross section. Our first example will be motivated by this example and then we will explore how the sensitivity depends on variations in signal model parameters and unfolding setup.
This paper is organized as follows. Section II briefly reviews full phase space unfolding and introduces the benefits and challenges of the existing approach in the context of physics beyond the Standard Model. The simulation samples and machine learning setup are introduced in Sec. III. Section IV explores a case where a modelindependent new physics search technique, such as bump hunting, could be applied in unfolded data. Section V then studies an example of exotic Higgs boson decays, where simple bump-hunting would be less fruitful. Implications of model-dependent search program in unfolded data are explored in this section as well. The paper ends with conclusions and outlook in Sec. VI.

II. REVIEW OF OMNIFOLD UNFOLDING
The OmniFold method is represented visually in Fig. 1. There are two inputs: natural data from experiment and synthetic data from simulation. The goal is to remove the detector distortions from the observations ("Data") to infer the underlying particle-level distribution ("Truth"). Synthetic particle-level events ("Generation") provide the initial guess for the Truth and we have an eventby-event match between the Generation and detectorlevel synthetic data ("Simulation"). As in all unfolding Step 1:

Reweight Sim. to Data
Step 2: Reweight Gen. ⌫ n 1 !n ! ⌫ n < l a t e x i t s h a 1 _ b a s e 6 4 = " U S I / a H U k K N m e n 4 g w H q y T X Z 7 r T G s = " > A A A C J 3 i c d V D d S h t B G P 1 W + 6 N p r a l 6 1 5 v B U C g F l 1 0 b k n g X 6 I 2 X F h o V s i H M T r 5 N B u d n m Z l t G 5 Z 9 l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " U S I / a H U k K N m e n 4 g w H q y T X Z 7 r T G s = " > A A A C J 3 i c d V D d S h t B G P 1 W + 6 N p r a l 6 1 5 v B U C g F l 1 0 b k n g X 6 I 2 X F h o V s i H M T r 5 N B u d n m Z l t G 5 Z 9 l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " U S I / a H U k K N m e n 4 g w H q y T X Z 7 r T G s = " > A A A C J 3 i c d V D d S h t B G P 1 W + 6 N p r a l 6 1 5 v B U C g F l 1 0 b k n g X 6 I 2 X F h o V s i H M T r 5 N B u d n m Z l t G 5 Z 9 l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a A N R 4 x q a 5 G 8 N 4 A 3 P G P 5 I 4 W 3 U W F E = " > A A A C J 3 i c d V D d S h t B G J 2 1 t m p s a 6 y X 3 g w G Q Y Q u u x q S e C d 4 4 6 U F 8 w P Z E G Y n 3 y a D 8 7 P M z K r L s u / Q 1 + g L 9 L a + g X e i l 7 3 x O Z x s I t T S H h g 4 n H M + v v l O n H J m b B A 8 e S v v V t 9 / W F v f q G 1 + / P R 5 q 7 7 9 p W d U p i l 0 q e J K D 2 J i g D M J X c s s h 0 X l B e M I 0 U M t z R w j V z P 0 V 0 x n R h F p X 4 5 s t C e R S p G X N F f N 6 P f 4 / 6 R 3 5 Y e C H 3 5 q N 0 8 6 y o n W 0 i / b Q A Q p R G 5 2 i c 3 S B u o i i 7 + g n + o X u v B / e v f f g P S 6 i K 9 5 y Z g e 9 g f f 7 B X 9 h q A I = < / l a t e x i t > ⌫ n 1 Data ! ! n < l a t e x i t s h a 1 _ b a s e 6 4 = " I V 5 R k c z 2 A J e i G q 3 8 y K s p / w T c R w M = " > A A A C e n i c d V F L a 9 w w E J b d V 7 J 9 Z J P 0 1 o v I U m h p a u T a + 7 o t t I c e U + g m g b V Z Z K 3 s F d H D S H L a x e i H 9 l z o r + i h W m 8 K T d o O C D 6 + m W + + m V F R c 2 Y s Q t + C 8 N 7 9 B w 8 f 7 e 3 3 H j 9 5 + u y g f 3 h 0 b l S j C Z 0 T x Z W + L L C h n E k 6 t 8 x y e l l r i k X B 6 U V x 9 X 6 b v 7 i m 2 j A l P 9 t N T X O B K 8 l K R r D 1 1 L L v 2 q x r s t B V k b c o G o + m c T o + R V E y T J J 0 6 A E a T Z I U u U w 2 y 1 a + j R 3 M v m p W r S 3 W W n 1 p M y 3 g n R a o i 9 O / g P u A L X Z e r w S t s G / m 3 L I / Q N F o l K T J B H p L N E Z p D L e W w + k 0 h f G N c D B 7 D r o 4 W / Z / Z C t F G k G l J R w b s 4 h R b f M W a 8 s I p 6 6 X N Y b W m F z h i i 4 8 l F h Q k 7 f d e A 6 + 9 M w K l k r 7 J y 3 s 2 D 8 V L R b G b E T h K w W 2 a 3 M 3 t y X / l V s 0 t p z k L Z N 1 Y 6 k k O 6 O y 4 d A q u L 0 5 X D F N i e U b D z D R z M 8 K y R p r T K z / m V s u J d 1 I U b u e P 8 z v 7 e H / w f m 7 K E Z R / C k d z C a 7 C 4 E 9 8 A K c g F c g B m M w A x / B G Z g D A r 4 H + 8 F R c B z 8 D E / C 1 + G b X W k Y 3 G i O w a 0 I 0 1 8 1 Z 7 s O < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I V 5 R k c z 2 A J e i G q 3 8 y K s p / w T c R w M = " > A A A C e n i c d V F L a 9 w w E J b d V 7 J 9 Z J P 0 1 o v I U m h p a u T a + 7 o t t I c e U + g m g b V Z Z K 3 s F d H D S H L a x e i H 9 l z o r + i h W m 8 K T d o O C D 6 + m W + + m V F R c 2 Y s Q t + C 8 N 7 9 B w 8 f 7 e 3 3 H j 9 5 + u y g f 3 h 0 b l S j C Z 0 T x Z W + L L C h n E k 6 t 8 x y e l l r i k X B 6 U V x 9 X 6 b v 7 i m 2 j A l P 9 t N T X O B K 8 l K R r D 1 1 L L v 2 q x r s t B V k b c o G o + m c T o + R V E y T J J 0 6 A E a T Z I U u U w 2 y 1 a + j R 3 M v m p W r S 3 W W n 1 p M y 3 g n R a o i 9 O / g P u A L X Z e r w S t s G / m 3 L I / Q N F o l K T J B H p L N E Z p D L e W w + k 0 h f G N c D B 7 D r o 4 W / Z / Z C t F G k G l J R w b s 4 h R b f M W a 8 s I p 6 6 X N Y b W m F z h i i 4 8 l F h Q k 7 f d e A 6 + 9 M w K l k r 7 J y 3 s 2 D 8 V L R b G b E T h K w W 2 a 3 M 3 t y X / l V s 0 t p z k L Z N 1 Y 6 k k O 6 O y 4 d A q u L 0 5 X D F N i e U b D z D R z M 8 K y R p r T K z / m V s u J d 1 I U b u e P 8 z v 7 e H / w f m 7 K E Z R / C k d z C a 7 C 4 E 9 8 A K c g F c g B m M w A x / B G Z g D A r 4 H + 8 F R c B z 8 D E / C 1 + G b X W k Y 3 G i O w a 0 I 0 1 8 1 Z 7 s O < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I V 5 R k c z 2 A J e i G q 3 8 y K s p / w T c R w M = " > A A A C e n i c d V F L a 9 w w E J b d V 7 J 9 Z J P 0 1 o v I U m h p a u T a + 7 o t t I c e U + g m g b V Z Z K 3 s F d H D S H L a x e i H 9 l z o r + i h W m 8 K T d o O C D 6 + m W + + m V F R c 2 Y s Q t + C 8 N 7 9 B w 8 f 7 e 3 3 H j 9 5 + u y g f 3 h 0 b l S j C Z 0 T x Z W + L L C h n E k 6 t 8 x y e l l r i k X B 6 U V x 9 X 6 b v 7 i m 2 j A l P 9 t N T FIG. 1. Visual representation of the OmniFold method. This method relies on experimental "Data", which was caused by some underlying "Truth" distribution, and synthetic datasets at both particle-level ("Generation") and detector-level ("Simulation"). The Simulation is reweighted to match the Data using a classifer, and these weights, ωi are applied to Generation. The unweighted Generation is reweighted to match the Generation with ωi applied, giving particle-level weights, νi. This process is repeated iteratively, as the νi are applied to the Simulation, and this reweighted Simulation is once again reweighted to Data. Based on the corresponding figure from Ref. [67].
algorithms, we assume that the detector response is wellmodeled.
The first step in the OmniFold method is to train a classifier to distinguish the Data from the Simulation. An optimally trained classifier that minimizes one of the standard loss functions (cross entropy or mean squared error) will predict the probability that an event x det is drawn from Data instead of Simulation: Pr(Data|x det ). By applying the weight to each simulated event, the Simulation will closely resemble the Data.
The detector-level weights, ω 1 are then applied to the corresponding particle-level events, resulting in a reweighted distribution of particle-level events. A second classification step is required because these pulled back weights are not a proper function of the particle-level phase space: the same phase space point can be mapped to two different detector-level points with different weights under the stochastic mapping of the detector. The second classifier distinguishes the nominal Generation from the one using the weights from the first step. As with the first step, an optimally trained classifier will learn The matching between Generation and Simulation can be used to push the weights to detector-level and the entire process can be repeated N times. The final result is the Generation dataset with a set of per-event weights v N . We will parameterize the classifiers as neural networks. When x is the full phase space, i.e. a complete list of reconstructed or true particles with their observable properties, we need a neural network architecture that can process variable length, unordered sets. For this purpose, we use Particle Flow Networks (PFN) [75,76]. A reduced alternative approach will use a fixed number of high-level observables, which will use a standard fully connected neural network. To distinguish these two cases, we will call the full phase space version OmniFold and the reduced version MultiFold.

A. OmniFold in the Presence of New Physics
The main benefit of the OmniFold method comes from its use of low-level observables and the freedom from fixed bins. By using information about each reconstructed particle in an event or jet, the full phase space is exploited. Therefore, as long as the interaction of individual particles with the the detector is modelled well, beyond the Standard Model (BSM) physics should not negatively affect the ability to unfold. If there is BSM physics present in the data, then the most BSM-like events in Simulation will be upweighted as appropriate 1 . In traditional unfolding schemes that use regularized matrix inversion of binned histograms, the presence of beyond the Standard Model (BSM) physics could affect the detector response matrix in ways that are not accounted for in a SM-only simulation. OmniFold is less affected by this, and is not affected by the possibility of sub-optimal binning choices for BSM sensitivity.
However, there is a key assumption in the Omni-Fold method: the initial Simulation and the Data must have overlapping support. For example, if a heavy resonance existed at a mass well beyond the last data point in Simulation, then it would not be possible to upweight events to match the resonance even in a binned histogram case. In the OmniFold case, the initial Simulation should span the data in all dimensions. Empirically, the Simulations often used at the LHC share the same support as Data. In practice, one needs the ratios of probability densities to not be too far from unity because even if the support is overlapping, a very small likelihood ratio will have a large weight and thus poor statistical uncertainty. 1 Events may be downweighted in the case of interference effects

III. SIMULATION AND MACHINE LEARNING SETUP
Many models of new physics predict new heavy particles that decay to Standard Model particles. If the invariant mass of the decay particles is computed, a resonant enhancement in the mass spectrum should occur, centered on the new particle's mass. As an initial exploration of the efficacy of OmniFold in the presence of BSM physics, we consider the case of a new heavy scalar particle. Our study is based on proton-proton collisions generated at √ s = 14 TeV with Tune 26 [78] of Pythia 8.243 [79][80][81]. Signal events are generated as h → Za, a → gg, where m h has been set to 125 or 250 GeV, and various a masses have been used. This final state (with low m a and m h = 125 GeV) was recently studied by the ATLAS Collaboration in Ref. [82]. Detector effects are emulated with Delphes 3.4.2 [83], using the CMS detector card, which uses particle flow reconstruction. For this study, the Data, Truth, Simulation, and Generation sets consist of 200,000 events. Jets with radius parameter R = 0.4 are clustered using either all particle flow objects (detector-level) or stable non-neutrino truth particles (particle-level) with the anti-k T algorithm [84] implemented in FastJet 3.3.2 [85,86]. We consider leptonic decays of the Z boson, which can be precisely reconstructed. The target final state is then Z boson production in association with one jet that has non-trivial substructure. Events are selected if there is at least one truth-level and one detector-level particle within the jet 2 .
For OmniFold, we use all of the particles in the leading jet. Each particle is specified with its p T , rapidity, y, azimuthal angle, φ, and particle identification number. Furthermore, the invariant mass of the Z+jet system, the jet mass, and the jet multiplicity are included as global features. These data are processed using PFNs implemented in the EnergyFlow Python package [87]. The PFN architecture is composed of an encoder followed by a fully connected network. The encoder has two hidden layers of 200 nodes each and outputs a 256-dimensional latent vector. These vectors are summed over all particles and then the subsequent fully connected network is composed of three layers of 100 nodes each.
For MultiFold, we use ten features from each event, based on the Z boson properties and the leading jet. These features include the invariant mass of the Z+jet system, the jet mass, the jet constituent multiplicity, the jet p T , the Z p T , the jet Les Houches Angularity [88,89], the jet width [89][90][91][92], the groomed jet mass with Soft Drop parameters z cut = 0.1 and β = 0 [93], the groomed jet momentum fraction (same Soft Drop parameters), and the jet image activity, which is the minimum number of pixels in a jet image that contain 95% the total p T [94].
The MultiFold neural networks are composed of three hidden layers of 100 nodes each.
For each iteration of OmniFold and MultiFold, the neural network was trained with 120 and 20 epochs, respectively, and included an early stopping condition based on validation loss improvement. The validation sample was constructed from a random 20% of the events. The models are randomly initialized in the first iteration and subsequently warm-started using the model from the previous iteration. All neural networks are implemented using Keras [95] with the Tensorflow backend [96] and optimized with Adam [97].

IV. HEAVY SCALAR DECAY STUDY
First, we study the case of m h = 250 GeV where 10% of the data is BSM physics. This composition and signal model relative to the Z+jets background are qualitatively similar to the example presented in Ref. [69], which used generative models. Figure 2 shows the detector-level and truth-level distributions of Z+jet invariant mass before and after unfolding. At detector-level, which corresponds to the first step in an iteration of the OmniFold method, the distributions exhibit good agreement after unfolding: the height and width of the mass peak are reproduced accurately, especially in the MultiFold case. At truth-level, the peaks are not reproduced as sharply. In the MultiFold case, the height and width of the peak are similar to that seen at detector level, and in the OmniFold case, the peak is considerably broader.
Part of the broadening is an inherent challenge with non-trivial resolutions and limited statistics. The truthlevel peak quality can be recovered by modifying the Generation. We are free to choose whatever Generation we want as OmniFold is a maximum likelihood estimator that is prior-independent. However, the closer the prior is to the data, the more accurate the unfolding will be with finite statistics. To test this idea, the same Truth sample was used as above, with 180,000 SM events and 20,000 h → Za, a → gg, where m h = 250 GeV and m a = 16 GeV. However, now the Generation was taken to include 200,000 SM events, 10,000 h → Za, a → gg events with m h = 125 GeV for each of m a = 0.5, 1, 2, 4, 8, and 16 GeV, and 10,000 h → Za, a → gg events with m h = 250 GeV for each of the same m a values, for a total of 320,000 events. The truth-level results of unfolding with the same OmniFold setup discussed above are shown in Fig. 3. Here, both the height and weight of the truth-level peak are reproduced well by the reweighted sample. The fact that this works well, when an application of OmniFold with SM-only events did not, shows the importance of sufficiently covering the relevant regions of phase space.
Adding  with 320,000 events from the preceding paragraph is used again, but Data and Truth are taken to be 200,000 SMonly events. The OmniFold method is applied in the same way as above, and the resulting Z+jet invariant mass distributions are shown in Fig. 4. The bump in the unfolded distribution has been eliminated despite the fact that almost 40% of events in the Generation sample were drawn from BSM samples. To optimize the unfolding, care must be taken when choosing the BSM models to include in the Generation sample as well as when choosing the number of BSM events to include -manipulating these parameters effectively corresponds to choosing different priors for what is expected in the Data. This section has demonstrated that OmniFold can qualitatively preserve a relatively large 3 and prominent resonant signature from the data. A sideband technique could then be used to perform a search with these data. In the next section, we will explore the ability of Om-niFold to precisely preserve the phase space so that a multivariate classifier could be used for a search with the unfolded data.

V. EXOTIC HIGGS DECAY
The signal in Sec. IV was sufficiently prominent that it could be effectively searched for with a bump hunt in the Z+jet invariant mass. Not all new physics processes can be searched for with such a simple approach. To explore such a scenario, we consider m h = 125 GeV. In this case, the signal bump is near the background peak and so additional features beyond just the Z+jet invariant mass are required. We explore the possibility of performing a model-dependent search that uses dedicated BSM vs. SM discriminating variables. If OmniFold effectively unfolds the full phase space, it should be possible to use any combination of variables in unfolded data.

A. Unfolding with MultiFold
First, we can consider the case of MultiFold with two working points for BSM physics: • 0.1% of Data and Truth events are BSM physics, with m a = 16 GeV.
• 10% of Data and Truth events are BSM physics, with m a = 16 GeV.
For each working point, the Data, Truth, Simulation, and Generation sets will be once again be 200,000 events. In each case, the Simulation and Generation sets are drawn from the SM-only Pythia 8 sample. The SM events in Data and Truth are also drawn from the Pythia 8 sample, but no SM event can be used in both Data and Simulation.
The distributions of Z+jet invariant mass, jet mass, and jet multiplicity for Truth, Generation, and unfolded Generation are shown in Fig. 5. The impact of a 0.1% signal is difficult to detect in these one-dimensional histograms and MultiFold has correspondingly left the phase space mostly untouched. For the 10% signal, Mul-tiFold clearly improves the agreement of the distributions. Similar trends hold for alternative m a values as well (not shown).
The ratio panels in Fig. 5 show that MultiFold is qualitatively able to encode BSM physics in the unfolded data. To probe this in greater detail, we emulate a modeldependent search by training a fully supervised classifier to distinguish Z+jets events from the m a = 16 GeV signal. A sample of 90,000 SM and 90,000 BSM events was used for training, with 30% randomly held out as a validation set. The neural network has the same inputs and architecture as the one used for MultiFold 4 . If MultiFold preserves the complete phase space, then any threshold cut on this classifier should have the same efficiency with the unfolded data as it does with the Truth.
The number of Truth, Generation, and unfolded Generation events passing a cut on the neural network score, as a function of the cut value, is shown in Fig. 6. In both the 0.1% and 10% signal case, the number of events in the reweighted samples more closely matches the Truth than the raw Generation samples. The agreement between the reweighted sample and Truth in the 10% case is an impressive achievement, as the Truth and Unfolded yields after the application of the NN score cut is stable at one. The NN score is a specialized value that was not used in training, so it is clear that in this case, the most BSM-like events are being up-weighted to an appropriate degree. In contrast, the unfolding has not up-weighted the BSM events enough in the 0.1% case, highlighting the difficulty of working with such a small signal.

B. Unfolding with OmniFold
An investigation similar to the MultiFold case can be performed with OmniFold. The same m a and con- Generation events passing a cut on the neural network score, as a function of the cut value, in the case that 0.1% of the data comes from BSM physics (top) and 10% of the data comes from BSM physics (bottom). The unfolding was performed with MultiFold. The neural network was specifically trained to distinguish SM from BSM physics. The middle segment of both plots shows the ratio of the pre-and post-unfolding Generation yields to the Truth yield. The yield from the pre-unfolding Generation is not expected to agree well with the Truth, as there are no BSM events. However, if the most BSM-like events get upweighted by the unfolding, then the weighted sum of passing events increases. In a fully accurate unfolding the unfolded yield would match the yield from Truth. The bottom segment shows the fraction of Truth events that are BSM events at truth-level; as the cut value approaches 1, the events passing the selection must be more BSM-like, which is why the BSM-purity increases as a function of cut value.
tamination values are investigated. The distributions of Z+jet invariant mass, jet mass, and jet multiplicity for Truth, Generation, and unfolded Generation are shown in Fig. 7. The performance in these plots is similar to but slightly worse than that observed in Fig. 5.
It is also possible to train a PFN to distinguish SM from BSM events. This PFN is set up in the same way as the PFN used for OmniFold, but it is trained to discriminate a sample of 90,000 SM from 90,000 BSM events. Here, the area under the ROC curve is 0.94, achieving superior discrimination to the neural network described in Sec. V A (AUC of 0.73). Figure 8 shows the number of Truth, Generation, and unfolded Generation events that pass cuts on the PFN score. If this figure is compared to Fig. 6, it is evident that the post-cut yields in the OmniFold case do not agree with truth as well as in the MultiFold case. This is due to the challenges discussed in Sec. II A. When specifically trained to discriminate SM from BSM events, the PFN is highly accurate. The Generation sample poorly populates the very BSM-like region of this discriminator. Especially in the case of 0.1% contamination, the unfolded dataset would not enable a discovery of new physics, as the weights do not strongly affect the post-cut yields.

C. Including BSM Physics in Generation
Similar to the end of Sec. IV, we explore how the performance in the previous section changes if we add in BSM to the Generation. This can effectively populate the regions of phase space that are under-populated by the SM to enable a more precise post-unfolding search. For this purpose, we take the same 200,000 SM events in Sec. V B and add 10,000 events each from h → Za, a → gg samples with m a = 0.5, 1, 2, 4, 8, and 16 GeV, for a total of 260,000 events in Generation. The OmniFold method is carried out in exactly the same way as in Sec. V B. Distributions for the invariant Z+jet mass, jet mass, and jet multiplicity are shown in Fig. 9. By comparing the triangular discriminator metric between Fig. 9 and Fig. 7, it can be seen that when BSM physics is included in the Generation, the distributions generally agree slightly better after unfolding, particularly in the case of 10% contamination, despite worse initial  Fig. 10. The PFN is also able to discriminate events with different m a values relatively accurately 5 , and it is clear here that the Omni-Fold reweighted sample does not predict post-cut yields well. The average weights found for the different m a components of Generation are given in Table I. While the m a = 16 GeV events are upweighted relative to the lighter m a events, it is clear that in the 0.1% contamination case, the m a = 16 GeV events are not adequately 5 Such that few events with ma = 4 GeV will pass the NN cut, for example downweighted, and in the 10% contamination case, they are not adequately upweighted. Together with Fig. 8, it can be seen that while OmniFold is performed using the full phase space, it has difficulty properly weighting extreme regions of phase space that can be particularly useful to model-dependent searches.

VI. CONCLUSIONS AND OUTLOOK
The OmniFold and MultiFold methods can be used for unbinned, all-variable unfolding in the presence of BSM physics, but there are inherent limitations on its applicability for truth-level searches for new physics.
In general, the distributions of high-level observables are unfolded well, as in Fig. 2, 5, and 7. This would enable model-independent searches or searches that use relatively high-level variables as discriminants, especially if the new physics has a high cross-section. However, it is possible to devise strong BSM vs. SM discriminating variables that are not necessarily unfolded well, such as the neural network scores shown in Fig. 6 and 8. These discriminants, which probe relatively subtle regions of phase space would most likely be applied in a modeldependent search. While OmniFold uses the full phase space, it has difficulty unfolding such specialized variables. The best performance highlighted above is the 10% BSM contamination case with MultiFold, where the post-cut yield in the unfolded sample closely matches that found in Data. In the 0.1% contamination case with MultiFold, there is also an enhancement in the reweighted sample relative to the Generation sample, but the agreement with Truth is not as stable as the 10% case. Together with the lack of agreement in the OmniFold case, this suggests that it would be difficult to make a discovery of BSM physics unless the new physics comprises > 1% of data events. Such high rates of BSM contamination would likely be discovered through conventional means by experimentalists prior to the release of unfolded datasets. A significant issue in any attempt to perform a search with unfolded data is the inverse problem highlighted by Fig. 2. Information is lost as particles pass through the detector, as seen in the smearing of the truth-level peak. We have shown how this can be partially recovered by adding BSM events to the Generation. This also helps to populate the most BSM-like regions of phase space. For example, Fig. 3 shows that this can be a powerful means to accurately reproduce an invariant mass peak even at truth-level. However, this raises the natural question of how to choose the correct events to include in Generation. The study in Sec. V C highlights the fact that even though high-level distributions can be unfolded well when BSM events are included in Generation, specialized variables may not be unfolded well; in particular, the reweighted distributions in Fig. 9 are significantly different from both the Data and from what would be expected in a SM-only case. Because of this, a model-dependent search with the PFN discriminator would be ineffective in the 10% case and return a false positive in the 0.1% case.
Overall, our studies have shown that full phase space unfolding is a promising direction for post-measurement searches for resonant new physics. However, significant work is required to increase the precision of the unfolding and to cope with cases where there are phase space regions with a large likelihood ratio. It is likely that non-resonant new physics, which may be modeled using effective field theory methods, will be more successful because the likelihood ratio is never too far from unity. This is closer to the previously studied case that investigated the impact of different SM simulations [67]. The resonant examples presented in this paper will serve as an important benchmark for the community as existing methods are extended and new techniques are developed to empower a new class of analyses at the LHC and beyond.

CODE AND DATA
The code for this paper can be found at https:// github.com/wpmccormack/OmniFoldBSM.