Likelihood-free inference of experimental Neutrino Oscillations using Neural Spline Flows

We discuss the application of Neural Spline Flows, a neural density estimation algorithm, to the likelihood-free inference problem of the measurement of neutrino oscillation parameters in Long Base Line neutrino experiments. A method adapted to physics parameter inference is developed and applied to the case of the disappearance muon neutrino analysis at the T2K experiment.


I. INTRODUCTION
First indications of neutrino oscillations using atmospheric neutrinos were presented by the SuperKamiokande experiment in Japan in 1998 [1], the confirmation with solar neutrinos was performed in 2002 by the SNO experiment [2] and the check of the atmospheric oscillation with a human-made beam experiments at K2K [3] for atmospheric oscillations in 2006 and at Kamland [4] for solar neutrino oscillations in 2004. After eight miraculous years, the oscillation phenomena was experimentally well established. Neutrino oscillation phenomena is the first, and so far the only, evidence of neutrinos having mass. Following the initial experiments, a successful set of measurements confirmed the oscillation picture and improved the understanding of the Pontecorvo-Maki-Nakagawa-Sakata (PMNS) neutrino flavour mixing matrix parameters [5]. By 2010, the measurement of a non-zero θ 13 angle by T2K [6] and Daya-Bay [7] opened the potential existence of leptonic Charge-Parity (CP) symmetry violation (i.e. particles behaving differently from the antiparticles). As of today, almost all mixing angles have been measured [8], including the two mass differences between the three mass neutrino eigenstates. The remaining mixing matrix parameter to be measured are the imaginary phase responsible of the CP violation and the sign of one of the two neutrino mass splitting, the so-called hierarchy. Both measurements are at reach for current and near future oscillation experiments.
Statistical methods used in the oscillation analysis so far comprises both a frequentist and Bayesian approaches [9,10]. There are, however, some limitations to these methods. Both approaches require a binned likelihood algorithm which might limit the sensitivity of the experiment, as they impose a Gaussian dependency in some of the marginalized parameters affecting the correctness of the results. Additionally, they require very intensive CPU processing time, which is a limiting factor that reduces the flexibility of the statistical analysis and checks, and introduces strong constrains on the delivery of the results. * pinas@aia.es These limitations are derived from the intrinsic difficulties of the statistical data analysis that are depicted in Sec. II, using the T2K experiment as a reference example.
In this paper we propose an alternative statistical method to overcome some of the limitations of the current methods in use. The proposed procedure is based on an un-binned likelihood inference using neural density estimators. This method has the potential of being accurate, fast and to reduce the possible bias due to the intrinsic binning in other approaches. The Neural Spline Flows, the implementation of neural density estimators we chose, has also some advantages since the Gaussian generator intrinsic to the method will facilitate the introduction of experimental errors in the distributions. We will discuss in this paper the basic concepts of the method and show the potential with a simplified example.

II. PROBLEM DEFINITION AND PHYSICAL SIMULATOR
Neutrino oscillation experiments search for the modification of the flavour content of a neutrino beam travelling in vacuum or matter for a certain distance. Beams are normally characterized at a near site, where the flux and neutrino flavour composition is not yet altered by oscillations. The same beam is sampled after a certain flight distance L. The change on the flavour composition can be determined in two different ways: experiments search for the disappearance of a certain neutrino flavour as function of the neutrino energy. The disappearance produces both a reduction in the flux of neutrinos of a concrete flavour and the distortion of the neutrino flux spectra that is observed in the distribution of the measured quantities, normally the muon momentum (p µ ) and the angle with respect to the neutrino direction (θ µ ).
(ii) The neutrino flavour appearance (P (ν α → ν β ), α = β), experiments search for the appearance of a neutrino flavour that is normally suppressed in the original neutrino flux. In T2K, this new flavour are electron neutrinos. Dependency of the oscillation with the neutrino energy are inferred from the momentum and the angle with respect to the neutrino direction of the electron ejected in the interaction of neutrinos with matter.
The neutrino flavour is determined by the flavour of the charged lepton (muon, electron or tauon) produced in charged current interactions of neutrinos with the nuclei. For the current analysis, we will concentrate on the disappearance phenomena (i).
In a synthetic way, the experimentally observed neutrinos with observed properties ( θ reco ν ) can be described by: for the near detector and for the far detector. The number of observed neutrinos depends on the cross-section (σ(E ν )), the neutrino flux (φ far,near (E ν )), the probability of observing the experimentally accessible quantities ( θ reco ν ) given a neutrino energy (P far,near ( θ reco ν |E ν )), the oscillation probability (P osc (E ν )) and the backgrounds observed in the detectors (Back near,far ( θ reco )).
The experimental challenge comes from inferring the neutrino energy, E ν , given the experimental observable ( θ reco ν ). The term P far,near ( θ reco ν |E ν ) encapsulates not only the detector resolution, but also the neutrinonucleus cross-section model predictions and uncertainties. Other difficulties raise from the limited knowledge (≈9% in the latest T2K results [11]) of the neutrino flux (φ far,near (E ν )) and the knowledge of neutrino crosssections as function of the energy (σ(E ν )). The background terms (Back far,near ( θ reco )) are normally relevant (≈20% in T2K) [12] and they subsequently depend on the neutrino-nucleus cross-sections in a non trivial manner.
Both the frequentist and the Bayesian statistical approaches [9,10] utilize the Near detector data to predict the probability density function at the far detector in absence of oscillations: Once determined, the conditional probability density function f ( θ reco ν |E ν ) can be used to determine the oscillation parameters (P osc (E ν )) by comparing to the far detector events. Most of the experimental effort is actually devoted to the determination of this conditional probability which depends also on a large number of hidden and correlated parameters describing uncertainties in detector performances, cross-section model and neutrino flux uncertainties. Hidden parameters are marginalized in the analysis, providing the experimental result for oscillation parameters as (the posterior in the case of Bayesian approaches) probability maps.
In the particular case of the T2K experiment, the experimentally accessible observables ( θ reco ν ) are the momentum and direction of the µ lepton (p µ , θµ). The µ lepton is produced at the interaction of the neutrino with the target nucleus. In this case the probability density function (f ( θ reco ν |E ν )) can be simplified to p(p µ , θ µ |E ν ). The near detector of the experiment measures the neutrino flux and constrains the neutrino-nucleus interaction providing the estimation of the probability density function p(p µ , θ µ , E ν ) = p(p µ , θ µ |E ν )p(E ν ) together with the expected number of interactions in the far detector in absence of oscillations. The near detector provides also a dependency with free parameters in the model and a full error covariance matrix relating all of them. To simplify the exercise, we ignore the error covariance matrix in this study and assume that the probability p(p µ , θ µ , E ν ) is implicitly known to the experiment through simulations.
The neutrino oscillation disappearance probability can be approached by the simplified two-flavour [8] oscillation. The disappearance probability as a function of the initial energy of the neutrino E ν is where 295 is the distance in kilometers between the near and far sites in the T2K experiment and 1.27 is a scaling parameter to adjust the oscillation phase to distance in kilometers. E ν is the neutrino energy in GeV, θ mix the mixing angle of the two flavours and ∆m 2 the difference in mass of these two flavours in eV 2 . θ mix and ∆m 2 are the parameters governing the oscillations. Hence, determining their values is the goal of the neutrino disappearance oscillation experiments, after measuring the momentum and direction of the µ lepton (p µ , θµ) at the far detector.

A. Physical Simulator
We have simplified the problem to demonstrate the viability of the proposed method to determine the oscillation parameters using Neural Spline Flows. Event samples are generated using Monte Carlo event the NEUT [13] event generator model that describes the interactions of neutrinos with nuclei. We also use a realistic neutrino flux energy spectrum provided by the T2K collaboration [14]. With both inputs, we generate ≈ 20M events of charge current quasi-elastic events (CCQE). CCQE is the most probable reaction at T2K energies where the neutrino transforms into a muon exchanging a neutron into a proton and the one dominating the statistical sensitivity of the experiment. To simplify, we ignore other reaction channels and potential backgrounds, and also the detector effects. The generator provides n-tuples of events weighted according to their appearance probability as function of neutrino energy and angle and momentum of the muon. As explained above, the neutrino energy is not directly accessible experimentally but is required to generated samples of oscillated events as it is explained in Sec. IV A, so it is important that the relation is kept in the generated sample.

III. METHODOLOGY
Evaluating the density of high-dimensional data has become an important task for unsupervised machine learning in recent years. Neural density estimation proposes a solution using neural networks by learning an estimation q φ (x) of the exact target density p (x) from samples x ∼ p (x). In this work, this task is performed through Neural Spline Flows, a specific implementation of the more general Normalizing Flows methods, presented in Sec. III A. We explain how the density estimation at the near detector is combined with the analytical formula for neutrino oscillation for the far detector to obtain the likelihood used to perform Bayesian inference in Sec. III B for the T2K experiment. Additionally, an alternative way of computing the posterior is explained in Sec. III C, based on the standard histogram approach but tweaked in order to attain the limit of un-binned likelihood, closely related to what is obtained by Neural Spline Flows.

A. Neural density estimation using Neural Spline Flows
Consider a simulator which can produce samples x ∼ p (x) over R D from the target density. Consider also q φ (x) a flexible family of density functions parametrized by φ, i.e., q φ (x) ≥ 0 for all x, φ, and q φ (x) dx = 1 for all φ. Then, if we want to approximate p (x) through q φ (x), we optimize the parameters φ by maximizing the loglikelihood of the approximated density under simulated data from the target distribution, This objective function for the optimization procedure is equivalent to minimizing the Kullback-Leibler divergence between the target and approximated density, which is always non-negative and only equal to 0 if both densities are equivalent. Let us proof quickly this state-ment: where in the last step we have approximated the expected value of log q φ (x) with respect to p (x) by the finite expected value. Hence, maximizing the objective function Eq.
Normalizing flows are a mechanism of constructing such flexible probability density families for continuous random variables. A comprehensive review on the topic can be found in [15], from which a brief summary will be shown in this Section on how they are defined, and how the parameters of the transformation are obtained, together with a specific implementation, the Neural Spline Flows (NSF) [16].
Consider a random variable u defined over R D , with known probability density p u (u). A normalizing flow characterizes itself by a transformation T from another density p (x) of a random variable x defined also over R D , the target density, to this known density, in the same space R D , via The density p u (u) is known as base density, and has to satisfy that it is easy to evaluate (e.g., a multivariate D-dimensional normal, as will be chosen through this work, or a uniform distribution in dimension D). The transformation T has to be invertible, and both T and T −1 have to be differentiable, i.e., T defines a diffeomorphism over R D .
This allows us to evaluate the target density by evaluating the base density using the change of variables for density functions, where the Jacobian J T (x) is a D × D matrix of the partial derivatives of the transformation T : The transformation T in a normalizing flow is defined partially through a neural network with parameters φ, as will be described below, defining a density The goal is to find the parameters φ to maximize Eq. (2) if x ∼ p (x). The subindex of T φ will be omitted in the following, simply denoting the transformation of the neural network by T . If the transformation is flexible enough, the flow could be used to evaluate any continuous density in R D . In practice, however, the property that the composition of diffeomorphisms is a diffeomorphism is used, allowing to construct a complex transformation via composition of simpler transformations. Consider the transformation T as a composition of simpler T k transformations: Assuming z 0 = x and z K = u, the forward evaluation and Jacobian are These two computations (plus their inverse) are the building blocks of a normalizing flow [17]. Hence, to make a transformation efficient, both operations have to be efficient. From now on forth, we will focus on a simple transformation u = T (x), since constructing a flow from it is simply applying compositions.
To define a transformation satisfying both efficiency properties, the transformation is broken down into autoregressive one-dimensional ones for each dimension of R D : where u i is the i-th component of u and x i the i-th of x. τ is the transformer, which is a one-dimensional diffeomorphism with respect to x i with parameters h i . c i is the i-th conditioner, a neural network, which takes as input x <i , i.e., the previous components of x, and φ are the parameters of the neural network. The conditioner provides the parameters h i of the i-th transformer of x i depending on the previous components x <i , defining implicitly a conditional density over x i with respect to x <i . The transformer is chosen to be a differentiable monotonic function, since then it satisfies the requirements to be a diffeomorphism. The transformer also satisfies that it makes the transformation easily computational in parallel and decomposing the transformation in one dimensional autoregressive transformers allows the computation of the Jacobian to be trivial, because of its triangular shape. To compute the parameter h i of each transformer, one would need to process a neural network with input x <i for each component, a total of D times.
Masked autoregressive neural networks [18] enable to compute all the conditional functions simultaneously in a single forward iteration of the neural network. This is done by masking out, with a binary matrix, the connections of the h i -th output with respect to all the components with index bigger or equal to i, ≥ i, making it a function of the < i components.
In their work on Neural Spline Flows [16], Durkan et al. advocate for utilizing monotonic rational-quadratic splines as transformers τ , which are easily differentiable, more flexible than previous attempts using polynomials for these transformers, since their Taylor-series expansion is infinite, and are analytically invertible.
The monotonic rational-quadratic transformers is defined by a quotient of two quadratic polynomial. Stacking up many of these transformations, a highly flexible neural density estimator can be build and will be the one utilized during this work.

B. Neural Spline Flows applied to the T2K Oscillation problem
We start by estimating the expected energy spectrum E ν of the neutrinos, together with the momentum p µ and angle θ µ of the measured muon without oscillations, obtaining the density p(p µ , θ µ , E ν ), as measured by the near detector. This is done by learning the density using a NSF from the Monte Carlo data, generated as presented in Sec. II A. The base density p u (u) used for Eq. (4) is a three-dimensional standard normal distribution.
Having estimated the joint probability p(p µ , θ µ , E ν ) of the initial distribution at the near detector, we need to construct the conditional density p p µ , θ µ |θ mix , ∆m 2 of the observed magnitudes given the oscillation parameters at the far detector in order to perform Bayesian inference. For this, we simply integrate the probability of not oscillating, 1 − p osc , using Eq. (1), over the energy spectrum of the joint distribution: where C θ mix , ∆m 2 is a constant of normalization computed after performing the integral. With this we have the probability of observing a single muon with momentum p µ and angle θ µ after oscillating given the parameters θ mix and ∆m 2 .
In order to take into account the number of observed samples, the extended likelihood [27,28] is used, modifying the likelihood with a Poisson count term to take into account the expected number of events for a given set of parameters and the actual observed number: In our case, the Poisson parameter µ (θ) is obtained by integrating the possible oscillations over all the energy spectrum, scaled to the initial number of particles N ini : Hence, the extended likelihood we apply for the analysis is and the posterior for the parameters takes the form of with p θ mix , ∆m 2 the prior information before observing the events and p the set of observed events.

C. Reference analysis using an approximate un-binned likelihood
The results of the experiments are validated using an approximate un-binned likelihood. The likelihood is computed following Eq. (5). To do so, event histograms (M (p i µ , θ j µ |E k )) binned in muon momentum and angle given a neutrino energy are generated from the simulated data from Sec. II A. Then, the oscillated probability is computed by reweighting the histogram content, using Eq. (1), as and is interpolated linearly to reduce the effects due to binning: The number of expected events (µ(θ, ∆m 2 )) is computed by summing the binned probability: The probability by normalizing the oscillated maps takes the form The likelihood L(θ mix , ∆m 2 ) is finally computed for discrete values of θ mix and ∆m 2 . Those values are distributed in a grid identical to the one used for the NSF method for proper comparison between both methods. For Sec. IV C, we have tested the results with different number of initial bins in energy, momentum and angle to find a good compromise between stability of the result and speed. We call this in what follows the Hist method and it will be used as a reference for the NSF calculations.

IV. EXPERIMENTS
The methodology is tested on experiments of simulated neutrino oscillations according to the T2K experiment. For this, ten different set of observed events are constructed following Sec. IV A, with additional data used to fit both the NSF and construct the un-binned likelihood. The training of the NSF is shown and verified in Sec. IV B, where different statistics of the learned density are computed to verify its integrity. Finally, in Sec. IV C, the inference on the ten experiments are performed utilizing both for NSF and the un-binned likelihood, discussing the findings of both methodologies.

A. Data samples and experiment simulation
As defined in Sec. II A, ≈ 20M CCQE events are generated, using the neutrino flux energy spectrum from the T2K collaboration [14] to produce the energy of the incoming neutrino E ν , and the NEUT event generator [13] to compute the momentum p µ and angle θ µ of the resulting muon, forming a triplet of (p µ , θ µ , E ν ), describing an implicit density of these three magnitudes.
These events were split into three subsets: • Training set: ≈ 15M events, used to optimize the NSF during training, compute the gradient of the loss function Eq. (2) and update its parameters. This set is also used to construct the histograms of the un-binned likelihood of Sec. III C.
• Validation set: ≈ 4M events, used to perform validation of the NSF during the training and asses the best model that generalizes outside of the training samples.
• Test set: ≈ 1M events, never seen by the NSF during the training and model selection, used to generate the observations of the different experiments, and perform the additional validation of the trained NSF in Sec. IV B.
As just mentioned, the experimental observations are generated from the test set. Two kind of experiments were performed for five different set of parameters: one of low statistics (around 500 observations) and one of high statistics (around 50k observations). In order to generate the observations of a fixed set of parameters (θ mix , ∆m 2 ), the procedure was the following: 1. Choose an approximated number of observed events (500 or 50k) N approx .
2. Sample from the test set events, make the event oscillate according to its energy E ν and accept it with probability given by Eq. (1), until accepting a total of N approx events. This took a number of N init tries.
3. Sample a different N init number of events from the test set, and accept them with probability according to Eq. (1). The accepted samples form the observation set, with a number of N obs samples, which is similar to N approx .
Note that N init would correspond to the initial number of neutrinos produced at the source before oscillating and N obs to the number of neutrinos that have not oscillated. These numbers are hard to determine and also produce a stochastic number of observed events, hence this heuristic procedure to produce N init and the variable number of N obs . The parameters of the initial experiments 1 and 2 in Sec. IV C are chosen following the best fit value of sin 2 θ 23 and ∆m 2 from [29]. The parameter θ mix is computed as where we have used the fact that Eq. 1 is symmetric over θ mix = π/4 and fixed it in the interval [0, π/4]. For the best fit value, maximal mixing is almost achieved (sin 2θ mix = 0.9986). Experiments 3 and 4 (5 and 6) follow the same procedure, but choosing sin 2 θ 23 within 1 (2) σ of the best fit. Experiments 7-10 the mixing angle has the value of experiments 3 and 4, but ∆m 2 is changed in order to explore the behaviour on the parameter space.

B. Training and validation of the NSF
In order to apply the methodology described in Sec. III B, we start by estimating the density p(p µ , θ µ , E ν ) using a Neural Spline Flow on ≈ 15M samples, with ≈ 4M for validation.
Because of the similarities regarding dimension of the data and # of samples, but with a simpler structure, we have chosen the hyperparameters almost as the Power data-set in Appendix B from [16], i.e., Adam optimizer with learning rate 0.0005, batch size 512, training steps 400k, flow steps 5, transform blocks 5, hidden features 128 and bins 8. Additionally, the learning rate was decreased during the training using a cosine scheduler to ensure stabilization at the end of the training procedure. As shown in Fig. 1, the validation set stabilizes at the end of the training and appears to converge for the selected architecture of the network. To ensure that the transformation T of Eq. (4) was found properly by the NSF, consider the transformed data u for the test set, i.e., the events that were not used during the training of the algorithm. If T is correctly approximated, u should follow a three-dimensional standard normal distribution, as is shown qualitatively in Fig. 2. To verify this, two tests were performed.
The first test was to take 10 disjoint subsamples of the transformed u of size ≈ 100k and compute their covariance matrix. Then, the covariance matrix was also computed for samples of the same size from a real threedimensional standard normal distribution, 10k times, in order to find the statistical distribution that one could expect for the covariance matrix values. In Fig. 3, the blue histograms show this distribution of the true covariance matrix values, together with disjoint subsamples of the transformed data u as dashed black vertical lines, which agree with the expected distribution. Secondly, the Kolomogorov-Smirnov test was performed on each of the components of u = (u 1 , u 2 , u 3 ), with respect to a one-dimensional standard normal distribution. As shown in Table I, the p-value does not show to reject the null hypothesis that the data u i was effectively drawn from a standard normal distribution. Satisfying these two tests, each component being a standard normal distribution and the covariance matrix of the transformed data agreeing with the distribution of covariance matrix values of the base distribution (a three-dimensional standard normal), we can assume that the transformed data u corresponds to samples from such base distribution, justifying that the transformation T was properly found by the NSF, hence allowing to evaluate p(p µ , θ µ , E ν ) with a good enough approximation.

C. Inference results
To test the performance of obtaining the posterior according to Sec. III B, ten different observation sets were constructed, following Sec. IV A, with five different mixing  angles θ mix and difference in mass squared ∆m 2 .
For each of the five combinations of parameters, low and high statistics experiments were performed. Low statistics are of the order of the real observed samples at T2K and used to assure its performance when a small number of events is dealt with. High statistics help to ensure that in the limit case where the traditional binning methodology can be very refined agrees with the unbinned NSF posterior. Table. II shows the ten experiments, with the number of observed samples, the true parameters and the results using the NSF (Sec. III B). Additionally, a result using an approximate un-binned likelihood, denoted by Hist (Sec. III C), is also displayed, which would correspond to the limit case when the binning can be performed with enough refinement to behave like an un-binned estimation. Since Bayesian inference is used, the inference on the parameters describes a density function according to Eq. (6). The central value shown for each parameter is the one that maximizes the joint posterior density. The uncertainty is then computed by marginalizing in the 2-dimensional density one of the parameters to obtain the 1-dimensional one of the other, and finding the interval such that for a 1 − α confidence level (CL), α/2 of the density is found on each side. This is done for both NSF posterior and Hist posterior. In general, the results of both methodologies agree, with slight fluctuations in the confidence levels. The difference could come from the NSF not learning perfectly the density of the points, or from the interpolation done by the Hist method introducing wrong approximations.
Additionally, in order to visualize the agreement in 2dimensions in Fig. 4, the Highest Posterior Density (HPD) curves [30] of 68% and 95%, together with the best fit (highest posterior value) were computed for experiments 1 (top left), 2 (top right), 7 (bottom left) and 8 (bottom right). In both HPD regions a clear overlap for low statistics, and slight fluctuations on larger statistics can be observed. When comparing the difference in area size, the relative difference of the Hist method with respect to NSF through the ten experiments is 3.1 ± 1.9%, showing that the areas agree within 2-σ on average. In Fig. 4, one observes that the areas, even being similar in size, are slightly shifted one from another. Empirical experiments of computing the posterior using actual discrete binning show that, by refining the binning, its posterior was shifting towards the NSF result. The displacement of the Hist method could still be a remainder of the initial binning used in the method before interpolating.
Both quantitative, Table II, and qualitative, Fig. 4, show that NSF indeed provide a tool to perform likelihoodfree inference on physical simulators such as the one of the T2K experiment, on par with un-binned likelihood approaches as we have compared it to.

V. SUMMARY
In this work we have presented the viability of a likelihood-free inference methodology through Neural Spline Flows on a simplified neutrino oscillation problem at the T2K experiment. For this the importance of estimating the initial flux of neutrinos in the near detector together with the properties of the interacting muons is presented, which allows for an analysis in the far detector of the oscillating properties. A brief introduction to normalizing flows as density estimators from data samples is performed, making emphasis on the particular implementation of Neural Spline Flows. We developed a framework to use this estimation of the density from data taken at the near detector in order to perform inference of the oscillation parameters at the far detector for a simplified 2-flavour neutrino problem, allowing to perform exact inference if the density is properly estimated. This gives an edge over traditional binned histogram methods, specially when the statistics is low as is the case in the T2K experiment, since the binning cannot be as refined as one would like to. An un-binned alternative formulated by interpolating the histogram method was constructed to check the results. Additionally the integrity of the learned density was thoroughly verified through different statistical and empirical tests. The results obtained using the Neural Spline Flow methodology and the un-binned likelihood methodology show results which are in agreement with each other and open new possibilities to use similar likelihood-free neural network inference for more complex analysis.