Finding Dark Matter Faster with Explicit Profile Likelihoods

Liquid xenon time-projection chambers are the world's most sensitive detectors for a wide range of dark matter candidates. We show that the statistical analysis of their data can be improved by replacing detector response Monte Carlo simulations with an equivalent deterministic calculation. This allows the use of high-dimensional undiscretized models, yielding up to $\sim\! 2$ times better discrimination of the dominant backgrounds. In turn, this could significantly extend the physics reach of upcoming experiments such as XENONnT and LZ, and bring forward a potential $5 \sigma$ dark matter discovery by over a year.


A. Motivation
Astrophysical and cosmological measurements have established that about four-fifths of the mass of the universe's matter consists of dark matter [1,2]. Several experiments attempt to detect dark matter particles on earth [3][4][5]. Liquid xenon time projection chambers (LXe TPCs) lead the field of direct detection for many dark matter models [6][7][8], including the prominent category of models with dark matter mass around ∼ 100 GeV/c 2 known as weakly interacting massive particles (WIMPs).
Along with other rare-event searches [9][10][11], LXe TPCs use (profile) likelihood ratio tests to derive their final physics results, since these tests have (nearly) optimal statistical power [12]. In this work, we suggest an improvement to how upcoming LXe TPC experiments such as XENONnT [13], LZ [14], and PandaX-4T [15] calculate their likelihood. We show this leads to a higher sensitivity through improved signal/background discrimination, and more robust results through supporting more simultaneous correlated nuisance parameters.

B. Background
Particle physics experiments often specify their detector response model implicitly through a Monte Carlo (MC) simulation. However, their inference uses (profile) likelihoods, which need the differential expected event ratethe number of events expected in an infinitesimal volume around the event, in the space of observables such as signal amplitude or reconstructed position. To obtain differential rates from a simulation, experiments use density estimation: usually, they histogram the results of a * jelle.aalbers@fysik.su.se † bart.pelssers@fysik.su.se ‡ cristian.antochi@fysik.su.se § pueh-leng.tan@fysik.su.se ¶ conrad@fysik.su.se large simulation run, e.g. O(10 7 ) events. Such histogramderived densities are known as templates.
Parameters of the detector response model are implemented as configuration options of the simulation. During inference, an optimizer explores many parameter-value combinations. To avoid re-doing simulations at every point during inference, experiments often precompute templates at strategically chosen points in the parameter space, and interpolate between them; linearly, or using more advanced techniques -this is known as template morphing [16].
To maximize sensitivity, an experiment would like to use all relevant simultaneous observables measured in an event. In particular, unless the detector response is homogeneous and constant in time, reconstructed position and time are relevant. This means templates must be highdimensional histograms, which require a large number of simulations.
To derive robust results, experiments account for uncertainties as 'nuisance' (i.e. additional) parameters of the likelihood, which are profiled during inference (or marginalized, in Bayesian methods). The more nuisance parameters are used, the more templates must be precomputed. In particular, correlated parameters require templates precomputed on a multidimensional grid, causing an exponential scaling of cost with number of parameters.
These requirements have an escalating price in terms of computation. Experiments therefore often make compromises that decrease their physics potential, such as discretizing the model in a few spatial-or temporal 'bins', or reduce robustness, such as considering only a few key or aggregate uncertainties as nuisance parameters.

C. Concept and outline
In this paper, we show how to compute the differential expected event rate for events in LXe TPCs directly, without MC simulations or templates. The result is not an approximation of an MC. Rather, we directly obtain the result of running an MC simulation with infinite statistics to fill an infinitesimally binned histogram, separately for the detector conditions appropriate for each individual event.
We build on the basic idea that every MC simulation approximates an integral or sum. For example, take a trivial simulation such as: This might represent an experiment observing signals of size S, obtained by Gaussian smearing from a number of photons N , which is in turn is sampled from a Poisson distribution with mean λ. To compute the probability density P (s) at one or more observations, we could run the simulation many times, populate a histogram, and look up the estimated density at each observed s. Alternatively, we could compute the following sum: where Poisson(n|λ) = λ n e −λ /n! and Gauss(x, σ) = exp (−x 2 /(2σ 2 )/ √ 2πσ 2 . This method works even if the mean of the Gaussian distribution is a complicated function of the number of detected photons, e.g. involving an externally determined response map. The sum runs over all non-negative integers, but evaluating only e.g. n ∈ { s − 5 √ s , s − 5 √ s + 1, ... s + 5 √ s } will give accurate results except for 5σ outliers.
For the LXe emission and detector response model, the summation is more complicated to construct, and sensible bounds are more difficult to estimate -but the principle is the same. The summation becomes a large matrix/tensor multiplication, implemented using Tensor-Flow [17], to take advantage of GPU acceleration and automatic differentiation. The latter gives access to the gradient and Hessian matrix in the space of nuisance parameters, which makes inference with a high number of nuisance parameters practical.
We present an open-source Python package flamedisx [18] which integrates the direct differential rate computation in an inference framework. Instead of specializing to one experiment's choice of LXe model, we provide a framework in which functions such as the parametrizations of yields can be easily customized. Our model skeleton is inspired by, though not exactly equivalent to, NEST [19,20] and models used in XENON1T [21][22][23][24]. flamedisx includes utilities for statistical inference with the MINUIT and scipy optimizers [25][26][27]. Finally, we show that flamedisx's high-dimensional likelihood gives up to a factor ∼ 2 better discrimination of the dominant electronic recoil background when compared with a classical 2d MC-TM approach, and a corresponding increase in sensitivity for dark matter searches.
Section II reviews the principle of LXe TPCs and their likelihoods as they are currently used. Section III discusses flamedisx 's method and models in detail. Section IV discusses flamedisx's performance on a test case that aims to resemble future detectors such as XENONnT and LZ. We end with a short summary and outlook in section VI.

II. LXE TPCS AND THEIR LIKELIHOODS
In this section, we briefly recapitulate the operational principle of LXe TPCs, and introduce notation used later in the paper. A thorough exposition of LXe TPCs can be found in [28][29][30].

A. LXe TPCs
TPCs used in modern dark matter searches are metersized cylinders filled with liquid xenon, with a few centimeter thick layer of gaseous xenon at the top. Figure  1 illustrates the working principle of such a LXe TPC. Particles scatter off either the electrons or the nucleus of a xenon atom, resulting in electronic (ER) or nuclear recoils (NR), respectively. Recoils create excited xenon dimers, which decay within O(10 ns), emitting UV light observed as a signal (S1) by the two arrays of photomultipliers (PMTs) at the top and bottom of the detector. Electron/ion pairs are also created in the recoil. The electrons are drifted up towards the gas by a O(100 V/cm) electric field. The electrons then produce a larger secondary signal (S2) in the gaseous xenon under the influence of a O(10 kV/cm) electric field. The light pattern of the S2 signal on the top PMT array indicates the horizontal (x, y) position of the event; the drift time of the electrons (measured by the time delay between S1 and S2) indicates the depth (z) of the interaction. The relative size of the S1 and S2 signals depends on the recoil's energy and type (ER or NR). This allows LXe TPCs to discriminate the NR signals expected for dark matter (and e.g. radiogenic neutron backgrounds) from ERs, primarily from radioactive contaminants in the LXe.

B. Observables and likelihoods
Likelihood tests require two key inputs: reconstructed quantities of observed events, and expected differential rates for signal and background sources.
LXe TPC likelihoods use a vector of observed quantities, s, for each event that passes the selection criteria. The s usually includes the amplitudes S1 and S2 of the main detector signals in photoelectrons (PE), sometimes the reconstructed event positionx, y, z or r, z, φ depending on whether Cartesian or cylindrical coordinates are used -and sometimes the measured absolute event time t. The likelihood is always expressed in directly observable quantities, both in this work and all implementations we are aware of in the field. For instance, the event positions are reconstructed positions, not presumed true positions.
Likelihoods also require the differential event rate R j (s) for different signal or background sources j. For example, if s = [S1, S2, x, y, z, t], R j relates to µ j , the total expected number of events after selections of the source j, as µ j = ρ dS1 dS2 dx dy dz dt R j (S1, S2, x, y, z, t). (3) Here ρ is the liquid xenon density in the detector, so that R has units of events/(tonne × year × PE 2 ). During inference, µ j and R j depend on several parameters, such as the dark matter cross-section, the rates of different backgrounds, parameters of the LXe charge and light yield response functions, or experimental parameters such as τ , g 1 and g 2 .
To compare models against the observed data, LXe TPCs commonly use an extended unbinned likelihood: Here, Poisson is the Poisson probability mass function, µ = j µ j is the total expected number of events from all sources, and N tot is the actual observed number of events. Eq. 3 ensures that the sum calculates the probability density function of events in the total model. In practice, the logarithm of eq. 4 is used: The constant depends on the dataset but not the model, and cancels in likelihood ratio inference. It is thus omitted.
The likelihood can have many parameters, but one type is worth highlighting: rate multipliers r j , which are unitless scales of both µ j and R j . That is, and a similar equation for R j , where m j (θ) is the expected number of events from source j if the multiplying parameter r j = 1, and θ encapsulates the other (nuisance) parameters in the likelihood, known as shape parameters. The r j = 1 represents a reference level, e.g. a cross-section of 1 × 10 −47 cm 2 for a WIMP, or a particular nominal level for the internal ER background.

C. Inhomogeneity and corrections
LXe TPCs do not respond in the same way to signals produced at different positions in the detector or at different times during a science run. The S1 photon detection efficiency is generally highest close to the bottom PMT array, and lowest at the top edge of the liquid volume near the liquid-gas interface (which reflects photons impinging on it at a sufficiently large angle). The survival probability of S2 electrons decreases with depth because impurities in the LXe absorb drifting electrons, as quantified by the mean electron lifetime, τ . The electron lifetime during a science run is often time-dependent. There are also other, usually smaller, inhomogeneities in the response of LXe TPCs, such as variations in the drift and extraction fields, PMT gains, PMT quantum efficiencies, etc.
flamedisx operates on the full (S1, S2, x, y, z, t) space of observables, and can thus take full account of these effects. The field currently uses template-based methods, for which, as mentioned above, using six dimensions is impractical, except by coarsely discretizing some dimensions. Instead, collaborations define space-and time dependent corrections for their signals, so they can use lower-dimensional likelihoods. The corrected signals are called (cS1, cS2), or sometimes (S1 c , S2 c ). Using a template in (cS1, cS2), rather than (S1, S2), gives a much better signal/background discrimination.
Let G 1 (x, y, z, t) and G 2 (x, y, z, t) denote the mean expected signal in PE per released photon or electron, respectively, at a given position, S1/G 1 and S2/G 2 are asymptotically unbiased estimates of the number of photons and electrons released at the interaction site. These are usually multiplied by constant scale factors g 1 and g 2 to obtain the corrected signals: cS1(S1, x, y, z, t) = S1 Generally, g 1 and g 2 are defined as

Simulation Bound estimation
Differential rate R(S1, S2) Mean events μ Data Likelihood ... Figure 2. Sketch of flamedisx's functionality. Given data, flamedisx estimates bounds for hidden variables such as the number of produced electrons. A TensorFlow representation of the data is used to compute the differential expected event rate for a signal or background source at the observed events. flamedisx can also simulate events, which it uses to estimate the total number of expected events µ after efficiencies. These two ingredients make up the likelihood of equation 5. All functionality uses a common set of model functions, such as electron lifetime τ or the input energy spectrum R0(E). The user can customize these to make them depend on arbitrarily many observables, such as time or observed position). Most can also depend on particular unobserved variables, such as energy.
with the integrals running over the bounds of the fiducial volume V and run time δT used in the search. Thus, for a homogeneously distributed source, cS1 = S1 on average inside the fiducial volume, and cS2 = S2 at the liquid-gas interface.
flamedisx uses the full space of observables, which is preferable for two reasons. First, signal and background intensities and spectra often vary with (x, y, z, t), even if the detector response was homogeneous, and even in the fiducial volume of the detector. Second, signal corrections only compensate for a change in the mean response, not for differences in resolution. For example, the electron detection efficiency decreases with depth in a TPC, worsening the S2 resolution further down the detector. For this reason, top regions of TPCs often show the best signal/background discrimination. If the analysis does not distinguish these from the bottom regions, it loses physics potential.

III. METHOD
A. Overview Figure 2 sketches the modeling functionality of flamedisx. Users can specify the desired physics through model functions -e.g. how the electron lifetime and ER charge yield vary with space and time (and, for the latter, energy). Model functions will be highlighted in green in the text below. Each can depend on an arbitrary number of observables such as (x, y, z, t), and some can additionally depend on hidden variables, such as energy. The model functions are used for three main computations in flamedisx.
First, given data (provided as a Pandas DataFrame [31]), flamedisx estimates reasonable ranges of the hidden variables that are summed over in the computation. A parameter σ max controls the width of the bounds: increasing it yields more accurate results for outlier events at the cost of computational speed. For this paper, we use σ max = 5, to much the same effect as in the example in section I C.
Second, flamedisx uses the model functions inside the differential rate computation itself. We first convert the data to a series of TensorFlow tensors, segmented in batches of configurable size to control the memory required for the computation. Next, we compute a series of tensors that are multiplied together to yield the differential rate, as explained in section III C -III H.
Finally, flamedisx also contains an ordinary Monte Carlo code for simulating events using the same model functions. Besides its use in creating toy datasets, we will use it in section IV A to confirm that flamedisx's differential rate computation matches the result of an equivalent template-based method.

B. Mean estimation
The simulation code also plays a role in flamedisx's inference. Besides differential rates, the likelihood in eq. 5 uses the total expected number of expected events µ. Computing this by integrating the differential rate computation over the space of possible observables would be very inefficient. Instead, we simulate a configurable number (by default 10 5 ) events and count how many yield detectable signals that pass all efficiencies.
Like differential rates, the µ j depend on shape parameters θ: see eq. 6. flamedisx is able to use the same method as classical template morphing: precompute m j at some particular θ values, then interpolate between the results while θ varies during the inference. This is not a significant computational burden, since a single number is much simpler to estimate and interpolate than a multidimensional template. For example, as discussed in [16], conventional ('vertical') interpolation performs poorly in modeling shifts in distributions, even in one dimension. For a single number such problems do not occur.
Interestingly, template-like interpolation of m j is not needed at all in many cases, or a very coarse estimate suffices. If the error on the estimate of m j (θ) is well below the width of the external constraint on r j , fits or limit constructions will converge to the correct values of the shape parametersθ, and the inaccuracy in m j (θ) will be absorbed by the fitted rate multiplierr j alone. After the fit, a single simulation can then be used to estimate P (S1 | n ph det ) P (n ph det | n ph prod ) R2(n ph prod , n el prod )  m j (θ) accurately, and we find the correctr j by dividing the fittedr j by the ratio of the accurate (post-hoc) and inaccurate (as used during the fit) estimates of m j (θ). Indeed, LXe TPC likelihoods generally have no or weak external constraints on the rate multipliers r j of the dark matter signal and the dominant internal ER backgroundsince the dark matter rate is the main parameter of interest, and the most sensitive measurement of the internal ER background usually comes from the data itself. For such a source, pre-fit estimates of m j (θ) are not needed at all. For some backgrounds, such as radiogenic neutrons, external constraints on r j are relevant, since their expected rate is too low to be effectively constrained by the data itself. However, for rare backgrounds, small inaccuracies in m j (θ) will only have a marginal effect.

C. Model structure
In the remainder of this section, we will describe the ER/NR light-and charge emission and detector response model implemented in flamedisx. More precisely, we describe the structure in which several physics models can be implemented; the exact form of model functions such as light and charge yield can be chosen by collaborations themselves. As mentioned above, we provide reasonable defaults inspired by NEST [19,20] and our previous work on XENON1T [21][22][23][24]. Figure 3 illustrates the implementation of the differential rate calculation in flamedisx. In outline, we start with an input energy spectrum R 0 (E). For each event, this is a vector whose elements range over different energies -see section III D. The computation is vectorized over a batch of events, indicated as the depth dimension in figure 3. Since each event yields a stochastic number of detectable quanta n q , we transform the expected spectrum at each event to a vector R 1 (n q ) over n q -see section III E. Some quanta manifest as scintillation photons (n ph prod ), others as drifting electrons (n el prod ) -see section III F. Accounting for this, the differential rate becomes a matrix R 2 (n ph prod , n el prod ). Some n ph det (n el det ) of the produced photons (electrons) survive to be detected, as quantified by a matrix P (n ph det |n ph prod ) (P (n el det |n el prod ))see section III G. Finally, the detected photons (electrons) determine the S1 (S2) amplitude, quantified by P (S1|n ph det ) (P (S2|n el det )) -see section III H. This is only a vector in n ph det (n el det ) since the S1 (S2) value is already observed. We multiply the P 's and R 2 together to get R(S1, S2), the differential expected event rate at the observed event.
D. Input energy spectrum flamedisx's starting point is an input energy spectrum, or more formally, a differential rate R 0 , where is the total number of expected events in a hypothetical detector with perfect efficiency. Thus R 0 has units of e.g. events/(tonneyear). In flamedisx, R 0 is specified by a function returning a vector of energies and an associated R 0 for each. That is, an input energy spectrum is represented by a comb of delta functions. This is a natural input format for spectra specified as a finely (but possibly non-homogeneously) binned histogram. As a result, there is no dE in eq. 9; this is absorbed in R 0 . For a homogeneous internal background, R 0 is only a function of energy; for dark matter, R 0 is a function of energy and time (since dark matter signals are expected to have an annual modulation [32]); for external backgrounds, R 0 can be a function of all observables. Like all model functions, the x, y, z, t have no particular special status, and the user can make R 0 depend on an arbitrary number of observables. Dependencies on hidden variables -such as E for R 0 -are fixed, since they determine the structure of flamedisx's computation.

E. Detectable quanta
The differential rate as function of number of quanta is: We omitted the dependence on (x, y, z, t), and will do so below, since all model functions can depend on arbitrarily many observables. The probability to generate n q quanta (scintillation photons and ionization electrons) given some deposited energy E is given by for ER and NR, respectively. Here δ is the Kronecker delta function, Poisson the Poisson probability mass function, W ≈ 13.8 eV the LXe work function, i.e. the energy required to generate one detectable ER quantum, and L(E) the mean fraction of NR energy used for making detectable quanta (rather than lost as heat). This is often called the Lindhard factor, though the user can specify any functional form of L, not just the one proposed by Lindhard [33].

F. Splitting over photons and electrons
The differential rate of producing n ph prod photons and n el prod electrons is given by: R 2 (n ph prod , n el prod ) = nq P (n el prod | n q )R 1 (n q ) δ(n q = n ph prod + n el prod ).
The detected quanta will split binomially over photons or electrons. Thus n ph prod = n q − n el prod , and P (n el prod | n q ) = Binom(n el prod | n q , p e ), where Binom is the Binomial probability mass function and p e the fraction of detectable quanta that manifest as electrons. ERs show an overdispersion on top of this, commonly modeled as a Gaussian fluctuations on top of p e , likely due variations in recombination [23,34,35]. The Gaussian model cannot be physical, since it allows probabilities outside [0,1]. Instead, we assume the fluctuation is described by a Beta distribution: P (p e | n q ) = Beta(n el prod | α, β).
We can express the Beta distribution's formal parameters α and β in the mean µ pe and standard deviation σ pe as Since the Beta distribution is a conjugate distribution of the Binomial, the process is equivalent to drawing n el prod directly from a Beta-Binomial distribution. Thus, our model for n el prod for ERs and NRs is: P (n el prod | n q ) ER = BetaBinom(n el prod | n q , µ pe (n q ), σ pe (n q )) P (n el prod | n q ) NR = Binom(n el prod | n q , p e (n q )).
The parameters are functions of n q rather than E. For ER, E and n q have an identity relation (see eq. 11), so this is merely a change of notation. For low-energy NRs, it is an open question whether a direct dependence on E instead of n q would better fit the data. As low-energy NR tracks are much smaller than the electrons thermalization length [36,37], at the time of recombination, there should be no memory of E other than through the number of surviving excitons and electron/ion pairs n q . Thus a direct dependency on n q may well be favored. On the other hand, post-thermalization recombination is not the only process involved. Current models often apply a separate factor for suppression of the NR light yield due to Penning quenching [38,39]. At present, flamedisx implements this by unphysically making the photon detection efficiency a function of the number of produced photons; a more elegant implementation would be to adjust the forms and parameters for p e and the Lindhard factor L.

G. Signal quanta losses
Some of the electrons and photons produce detectable signals in the PMTs, while others are lost. These efficiencies are usually position-dependent and include a variety of effects, such as absorption by impurities and extraction  through the liquid-gas interface for electrons, and collection and quantum efficiency for photons. Each signal photon and electron undergoes these processes individually, without interacting with other photons or electrons. Thus there is a unique probability per photon to be detected (given a particular position, time, etc.) and likewise for electrons. The fraction of detected photons (and likewise electrons) must therefore follow a binomial distribution.
Separately from the per-quanta efficiencies η ph and η el , experiment have per-event detection efficiencies. For example, two-photon S1s are often excluded from analysis, since these are likely to arise from accidental coincidence of PMT dark counts. These efficiencies are usually parametrized as functions of either the raw S1 or S2 signal amplitude (see the next section) or the number of detected photons and electrons. We call the latter ζ ph and ζ el , respectively.
Given n ph prod produced photons (n el prod electrons), the probability of detecting n ph det photons (n el det electrons) is thus: P (n ph det | n ph prod ) = ζ ph (n ph det )Binom(n ph det | n ph prod , η ph ) P (n el det | n el prod ) = ζ el (n el det )Binom(n el det | n el prod , η el ).
H. Final signal production The observed S1 and S2 signal size, given a number of detected quanta, is assumed to follow a Gaussian distribution. Additionally, per-event detection efficiencies may depend on the S1 and S2 size. We thus have: P (S1 | n el det ) = ξ ph Normal(S1 | µ S1 (n ph det ), σ S1 (n ph det )) P (S2 | n el det ) = ξ el Normal(S2 | µ S2 (n el det ), σ S2 (n el det )), with µ ph and σ ph the mean and standard deviation S1 area in PE produced by a single photon, ξ ph the per-event detection efficiencies parametrized as function of S1 (and thus not included in ζ ph ), and similar quantities for electrons and S2.
The secondary scintillation gain of 20-30 photons per electron of most TPCs is large enough that a Gaussian signal production model is adequate. Thus, we simply have: with µ S2 1 and σ el 1 the mean produced S2 area per single detected electron, respectively.
For S1s, the non-Gaussian response of PMTs to single photons is relevant. In particular, PMTs in LXe TPCs have a O(20%) probability of producing two rather than one photoelectron per LXe scintillation photon [40]. This effect can be captured adequately for n ph det 5 by choosing µ S1 and σ S1 as: µ S1 (n ph det ) = µ S1 1 n ph det (1 + p 2 ) σ S1 (n ph det ) = µ S1 1 n ph Here p 2 is the probability per detected photon of observing double-PE emission, and µ 1 and σ 1 the mean and standard deviation of the S1 signal area contributed per photoelectron. The two terms in the square root add the variance of the binomial double-PE emission (scaled by µ S1 1 ) to that of the per-PE signal area fluctuation in quadrature. We plan to implement the binomial double-PE emission process fully as an extra step in the next version of flamedisx.

A. Model verification
A first and essential check is to verify that flamedisx's direct differential rate computation (figure 3) produces the same result as a lookup in finely-binned template created from a high-statistic simulation. A divergence would point to an error in the flamedisx core code.
Computing even a single 6-dimensional (S1, S2, x, y, z, t) histogram with sufficient statistics for this check would be a significant undertaking. It is much more practical to fix the simulated event positions and time, so only a twodimensional (S1, S2) histogram is needed for comparison. Figure 4 shows the result of such a comparison, drawing events from either a 0 − 10 keV ER background or a 30 GeV/c 2 WIMP, and evaluating differential rates under both models. The histogram is taken to have 70 bins in both S1 and S2, uniformly increasing in log S1 and log S2 respectively to better capture the quickly-varying region at low energy. We fill the histogram with 10 8 simulated events; the events used for the test are a separate simulated set. A small fraction of these are not added the histogram due to the assumed detection efficiencies, or because they fall outside the histogram range.
The results confirm that flamedisx's computation is equivalent to a simulation with high precision. Despite the fine binning, events that fall in individual bins can be seen as vertical stripes on the plots. This is an inaccuracy inherent in using template-based likelihoods. The result of the two methods should be exactly the same only in the limit of an infinitely large simulation, a histogram with infinitesimally small bins, and σ max → ∞.
It is simple to repeat this test, though it should not be necessary often. Users will frequently change only the model functions, which are shared between the simulator and the direct computation. Changes in these cannot by themselves cause a divergence if the core code is correct.

B. Single-event discrimination
flamedisx can boost the physics potential of LXe detectors by making a full, undiscretized (S1, S2, x, y, z, t) likelihood viable, in a robust analysis with several correlated nuisance parameters. In this section, we will show the benefits of this, by comparing flamedisx's likelihood with a two-dimensional (cS1, cS2) likelihood.
A fully specified model is necessary for this comparison. We used the ER and NR models described in [23] (reparametrized to fit the flamedisx structure) and the S1 and S2 efficiency maps from [41]. The exact model functions can be inspected along with the the flamedisx source code at [18]. We assume a fiducial volume of 5 tonnes, an electron lifetime of 500 us, a TPC length of 1.5 m, a homogeneous drift field and perfectly stable detector conditions.  WIMP signal in flamedisx (solid) and a two-dimensional template-based likelihood (dashed), using exactly the same emission, detector response, and analysis efficiency assumptions. Blue: flat-spectrum homogeneous ER background from 0 − 10 keV; Red: radiogenic neutron background with a spatially varying rate; Purple: hypothetical homogeneous NR background with the same energy spectrum as the WIMP signal, but a time-averaged rather than an annually modulating spectrum. Note the switch from a linear to a logarithmic scale below 1% background leakage. Arrows show the background reduction at 50% signal acceptance.
Using this model, we compute two-dimensional (cS1, cS2) ER and NR templates, using the same binning discussed above, except that is linear in cS1 to match existing practice. We fill the template with 10 8 events each, then use the templates as well as flamedisx's direct computation to estimate the differential rate for simulated events. The ratio of expected differential rates of two sources is the Neyman-Pearson optimal discrimination test statistic to discriminate single events. Figure 5 shows several discrimination efficiencies ("ROC plots") using this statistic. Specifically, it shows, for different backgrounds, the fraction of remaining expected WIMP signal events if a given fraction of the background must be removed. We see that, at 50% signal acceptance, flamedisx effectively reduces the dominant homogeneous flat ER background by a factor two. This is almost entirely due to separating different z regions of the detector, which have different S2 resolution (and thus different discrimination power) due to the finite electron lifetime. The neutron background is discriminated by an even larger factor, due to its spatial dependence within the assumed fiducial volume. The WIMP signal's annual modulation contributes almost nothing to this discrimination: a hypothetical homogeneous NR background with the same mean energy spectrum as the signal (but without modulation) can barely be distinguished from the signal. We stress that the same results would be obtained in a traditional template-based method with densely binned high-dimensional templates, if it would be feasible.

C. Speed
flamedisx has a fundamentally different performance profile than classic template-based likelihoods. For templates, filling the histograms with simulated events is usually the most expensive step: the number of events needed grows exponentially with the histogram dimensionality and the number of potentially correlated parameters that shape the distributions. flamedisx does not use templates, and thus requires essentially no precomputation -only O(10 s) of TensorFlow graph construction, which does not need to be repeated when a new (toy) dataset is considered. Figure 6 shows an estimate of the required template computation time for a two-dimensional (cS1, cS2) and a six-dimensional (S1, S2, x, y, z, t) likelihood. For the central model (bands) we assumed a simulator capable of producing an accurate 2D-template model in 100 (50-200) seconds [42] and 7 (5-10) templates per parameter. For the 6D-template model, we assume 7 (5-10) bins each for (x, y, z and t). Clearly, with more than a handful of nontrivial parameters, even two-dimensional templates become unwieldy, and six-dimensional template grids require computation times measured on geological scales.
After precomputation, template-based likelihoods are relatively fast: computing differential rates only requires a few lookups and simple interpolation. flamedisx must instead do a new differential rate computation for each event, which is generally the rate-limiting step.
On a single Tesla P100 GPU, flamedisx's differential rate computation runs at ∼ 3 events/ms for events from a 0 − 10 keV ER model. NR models are nearly three times faster to evaluate, mainly because they lack the beta-binomial in equation 13. Thus, the likelihood of a O(1000 event) dataset under a model with both ER and NR sources can be computed in about half a second. flamedisx's differential rate computation can also run on a CPU, but then it is O(10 2 ) times slower.
During inference, the likelihood must be computed at many different points in parameter space. flamedisx exploits TensorFlow's automatic differentiation to compute the gradient and (optionally) the Hessian in parameter space. With this information, high-dimensional optimization needs far fewer iterations, and becomes more robust. Figure 6 shows the mean duration of fitting an ER model to a 1000-event ER calibration dataset. Specifically, we fitted polynomial models of different orders for µ pe (n q ) -effectively controlling the mean ER charge yield as a function of energy. Figure 6 is clearly not a level comparison of flamedisx and a template-based inference framework: it shows an estimated precomputation time on a CPU for the template method, and a per-fit measured time on a GPU for flamedisx. We juxtapose the two pieces of information to highlight the fundamentally different scaling with the number of nontrivial parameters. A templatebased framework could be ported to GPUs, but the only benefit would be to allow, perhaps, two or three more dimensions or parameters within the same computation time frame.

D. Physics reach improvement
The improved background discrimination translates into an improved physics reach.
One way to measure the physics reach is by the expected discovery significance, i.e. the p-value of the backgroundonly hypothesis, for one particular assumed WIMP signal. Discovery significances can be computed using the profile likelihood test statistic: where σ is the WIMP cross section, θ represents nuisance parameters (here only the rate of the internal ER background),σ,θ denote the global best-fit value, andθ the  best fit of θ conditional on σ = 0. For sufficiently large exposures, as well as certain other technical conditions [43], t 0 is asymptotically distributed as 1 2 δ(0)+χ 2 ν=1 , with δ the Dirac delta function and χ 2 ν=1 the chi-squared distribution with one degree of freedom. We verified that the distribution of t 0 is well-described by this approximation for 200 GeV/c 2 WIMPs and exposures of 10 tonneyear and higher, using fits to background-only toy MCs. Specifically, a 3σ asymptotic significance corresponds to a 3 ± 0.2σ true significance, for both flamedisx and the (cS1, cS2) template method. Figure 7 shows the expected asymptotic discovery significances for a σ = 2 × 10 −47 cm 2 , 200 GeV/c 2 WIMP in different exposures of the XENONnT/LZ like test model discussed above. This model is not excluded by current experiments [6], but well within range of the next generation of detectors [13,44]. We assumed a flat ER background level of 75 events/(tonne × year) and a radiogenic NR background of ∼0.04 events/(tonne × year). Clearly, flamedisx gives a substantial advantage over using a classic (cS1, cS2) likelihood. With a ∼ 5 tonne fiducial volume, it cuts the time needed to get a 5σ discovery for this model by around a year. More details on this projection can be found in [45].
We stress that our model should be regarded as a test model only. Accurate projections of the performance of future detectors are more complex and rely on information available only to collaborations themselves. For example, we omitted the accidental coincidence and coherent neutrino nucleus-scattering background completely, and our neutron background spectral shape and its spatial rate dependence are chosen to roughly approximate published figures [6,46].
The physics reach can also be characterized by the median exclusion sensitivity in case dark matter does not exist. In this case, background discrimination is much less relevant, since the dark matter models probed are by definition at the statistical limit of what the experiment is capable of distinguishing. Only in the extreme background-limited case, where the sensitivity scales with the square root of exposure, do discrimination improvements (and flamedisx) yield a proportional improvement of the exclusion sensitivity. LXe DM searches so far have usually stopped running to build a bigger detector well before reaching this limit.

V. DISCUSSION
A. Validity of the model As discussed in section III, our model structure is not exactly equivalent to models previously used in the field, but designed to have ample expressive power to fit real data. Since LXe collaborations have not released their raw data to the public, we cannot report on our efforts to verify this here. We do not suggest experiments use the defaults of flamedisx out of the box, but that they use flamedisx to fit their calibration data first. Indeed, one of flamedisx's key strengths is to offer a single framework capable of calibration data fitting and the final scientific inference. For future flamedisx versions, we hope to implement defaults that more closely track NEST [20] for a variety of detector conditions. Using a 'six-dimensional likelihood' might give some readers pause. Would this mean that orders of magnitude more calibration data must be collected to verify that the model fits? This is not the case, for several reasons. First, we must not confuse the number of observables with the number of parameters in the model. flamedisx, like an MC simulation, can use as many or few parameters in its functions as the user wants, which determines whether it will under-or overfit the data. flamedisx looks at more observables (six) than a two-dimensional template for the same number of events, so it is a more sensitive instrument to decide between models. If anything, using flamedisx would mean that less calibration data is needed by an experiment.
Moreover, a two-dimensional template is equivalent to a six-dimensional template that assumes the response is constant over four of the dimensions. A model that actually looks at six observables only has to improve on this rather low bar to be superior -e.g. by accounting for simple and easily verified effects, such as the reduction in S2 resolution by electron lifetime. Finally, since flamedisx obviates computationally expensive template morphing, it becomes simpler to add additional nuisance parameters representing model uncertainties to the inference. Properly used, this will increase the robustness of results.
B. Impact on real experiments flamedisx's advantage in physics reach is entirely due to allowing a full undiscretized (S1, S2, x, y, z, t) likelihood. We compared flamedisx against a (cS1, cS2) likelihood above (e.g. figure 7), but LXe dark matter search likelihoods often incorporate space and time dimensions already in a limited way. For two reasons, we still believe using flamedisx would significantly improve the physics reach of the imminent generation of experiments.
First, the space and time dimensions are very sparsely discretized in current likelihoods. That is, the histogram templates may cover more than two dimensions, but the number of bins in the spatio-temporal dimensions is low. The likelihood itself is still unbinned, but less accurate than it would be with an undiscretized model. XENON1T initially used a 2-d (cS1, cS2) likelihood [21]. In XENON1T's full science run, a sparse spatial discretization was used, mostly in the r dimension [6]. LUX's likelihood was discretized in four z segments and four time segments [47]. PandaX-II used 18 time segments [48]. LZ's sensitivity projection assumes a 2-d (cS1, cS2) likelihood [44].
Second, we assumed constant detector conditions and homogeneous drift and extraction fields in our example experiment. Neither of these were realized in the last generation of detectors [6,47]. Either of them would increase the importance of space and time dimensions in the likelihood, and therefore the benefit of using flamedisx.

VI. CONCLUSIONS AND OUTLOOK
We described a new framework for computing LXe TPC likelihoods, flamedisx, which replaces simulation-based templates with a direct differential rate computation implemented in TensorFlow. This enables high-dimensional undiscretized likelihoods, which will increase the physics potential of LXe detectors. It also enables consideration of more correlated nuisance parameters, leading to more robust results.
We hope flamedisx will be useful for experimental collaborations. The flamedisx source code is released at [18] under a permissive open-source license, and will continue to be developed. As members of XENON, we are particularly looking forward to possible application of flamedisx in the upcoming XENONnT experiment [13]. Eventually, we hope the methods used in flamedisx might inspire similar changes in likelihoods used by other types of particle physics experiments.