Bayesian parameter estimation in $\chi$EFT using Hamiltonian Monte Carlo

The number of low-energy constants (LECs) in chiral effective field theory ($\chi$EFT) grows rapidly with increasing chiral order, necessitating the use of Markov chain Monte Carlo techniques for sampling their posterior probability density function. For this we introduce a Hamiltonian Monte Carlo (HMC) algorithm and sample the LEC posterior up to next-to-next-to-leading order (NNLO) in the two-nucleon sector of $\chi$EFT. We find that the sampling efficiency of HMC is three to six times higher compared to an affine-invariant sampling algorithm. We analyze the empirical coverage probability and validate that the NNLO model yields predictions for two-nucleon scattering data with largely reliable credible intervals, provided that one ignores the leading order EFT expansion parameter when inferring the variance of the truncation error. We also find that the NNLO truncation error dominates the error budget.


I. INTRODUCTION
Chiral effective field theory (χEFT) descriptions [1][2][3][4] of the strong nuclear interaction depend on low-energy constants (LECs) that govern the strength of the various interaction terms.Their numerical values must be inferred from data and are best described by a posterior probability density function (pdf).Clearly, this parametric uncertainty will combine with the inherent discrepancy of χEFT, i.e., the epistemic gap between model predictions and real world observations.Thus, their joint uncertainty must be estimated and propagated forward before the credibility of the underlying theory and its predictions for nuclear observables can be assessed.Drawing samples from this posterior predictive distribution (ppd) is key to analyze the implications of physical and probabilistic modeling choices in the ab initio description of nuclear systems.Fortunately, operating with a χEFT endowed with a power counting offers a principal handle to estimate the relevant model discrepancy in terms of Bayesian credible intervals for the EFT truncation error [5].Still, drawing samples from relevant posterior pdfs presents a formidable challenge, particularly for high-dimensional parameter volumes.
Here, we introduce the Hamiltonian Monte Carlo (HMC) method [6] to sample LEC posteriors and ppds in χEFT.HMC is a Markov chain Monte Carlo (MCMC) [7,8] method that exploits the equations of Hamiltonian dynamics for drawing uncorrelated pdf samples with a high acceptance probability.Crucially, HMC performs well also in cases of high-dimensional probability distributions where most other MCMC algorithms fail to converge within a reasonable time frame.In this paper we demonstrate how to use HMC to efficiently sample the LEC posteriors at leading order (LO), nextto-leading order (NLO), and next-to-next-to-leading order (NNLO) in deltaless χEFT.We also perform model checking and model validation by drawing samples from * isak.svensson@chalmers.se the ppd for elastic nucleon-nucleon (N N ) scattering in the neutron-proton (np) and proton-proton (pp) channels.All pdfs are conditioned on data from the recent Granada database [9,10] of measured N N scattering cross sections.We employ a statistical model for the EFT truncation error, proposed by Wesolowski et al. [11], to account for the model discrepancy due to excluded contributions from higher chiral orders.
In a Bayesian data analysis it is straightforward to condition on existing results.We make use of a previous Roy-Steiner analysis [12,13] to incorporate prior knowledge about the LECs that govern the strengths of subleading pion-nucleon (πN ) interactions.We also place a prior in accordance with naturalness expectations on the LECs that govern contact interaction strengths.Throughout this work, we define the necessary one-and two-pion exchanges and contact interactions according to Ref. [3] and follow the same conventions for the potential, scattering amplitudes, and N N scattering observables as in Ref. [14].We use a non-local super-Gaussian momentumspace regulator with a fixed cutoff Λ = 450 MeV.

II. BAYESIAN PARAMETER ESTIMATION IN χEFT
Bayes' theorem pr( α|D, I) = pr(D| α, I) • pr( α|I) pr(D|I) , provides a straightforward way to express the posterior pdf pr( α|D, I) for the LECs α in terms of a likelihood pr(D| α, I), prior pr( α|I), and marginal likelihood (or evidence) pr(D|I).In this work, the LEC posterior is conditioned on N N scattering data D and additional prior information I.The marginal likelihood, which does not depend on α, plays no role in parameter estimation and we therefore have pr( α|D, I) ∝ pr(D| α, I) • pr( α|I).
A hallmark of the Bayesian approach is the transparent and straightforward inclusion of prior information I.A main goal of this work is to incorporate the probabilistic model for the EFT truncation error from Ref. [11] and introduce an HMC algorithm to efficiently sample the posterior pdf for all LECs up to NNLO.
In the following subsections we present the N N data we use, our prior and likelihood, and specify the hyperparameters and details of the model for relating experimental N N scattering data to a χEFT prediction at a given chiral order.

A. Prior
Our full prior for the LECs is written as a product of two independent priors: pr( α N N |I) for the contact LECs and pr( α πN |I) for the πN LECs.This form, pr( α|I) = pr( α N N |I) • pr( α πN |I), implies no prior assumption of correlation between LECs from these sectors.Furthermore, we adopt independent and identical Gaussian pdfs for all contact LECs with zero-mean and standard deviation ᾱ = 5.This is a rather weak prior which again makes no assumption of correlations.However, it mildly encodes the naturalness expectation of the LECs by penalizing LEC values 1, thus safeguarding somewhat against overfitting [15].The exact value of ᾱ does not have a major impact on the outcome, as the large N N data set used in the likelihood strongly dominates over the prior.The prior at LO and NLO in χEFT-where no πN LECs appear-is thus given by pr( α|I) = N 0, Σ prior (4) with For the πN LECs c 1 , c 3 , c 4 , entering at NNLO, we have chosen a much more restrictive prior based on mean values and covariance matrices extracted from the maximum-likelihood fit of a Roy-Steiner analysis of the πN scattering amplitudes by Siemens et al. [13].That analysis proceeds in a kinematical region of chiral perturbation theory that exhibits a stronger curvature with respect to the subthreshold parameters of the πN scattering amplitudes.Once matched to the πN LECs in χEFT, we obtain a rather informative prior for the corresponding part of the potential.We will return to a more detailed discussion of the πN LECs when analyzing the MCMC posteriors in Sec.IV B.
To be specific, the LECs that we consider at each order are: We employ a conventional notation linked to the momentum partial-wave basis, see, e.g., Ref. [3].Note also that isospin-breaking effects enter at NLO and only in the 1 S 0 partial-wave.

B. Elastic nucleon-nucleon scattering observables
We condition our LEC posterior on experimental data.We use roughly two thirds of the Granada 2013 database [9,10] to define the training data set D. We hold out all scattering data in the range 80 ≤ T lab ≤ 100 MeV of laboratory scattering energies and assign it to a validation data set D. We further assign to D all data in the energy range 290 < T lab ≤ 350 MeV, i.e., just above the pion-production threshold, plus the set of integrated np scattering cross sections from Ref. [16].Overall, this choice of data split enables detailed model checking while leaving ample information for estimating the LECs at each chiral order up to NNLO.In all, D consists of 4366 experimental data points to be used as input in the parameter estimation process, while D contains 2018 validation data points.The details of the N N scattering data used for parameter estimation and validation are presented in Table I.Given a set of numerical values for the LECs, we compute scattering amplitudes by solving the Lippmann-Schwinger equation for all partial waves with maximum total angular-momentum quantum number J max ≤ 30.This is more than enough to converge the physical model predictions for the resulting scattering observables.Thus, we neglect all sources of numerical or computational method uncertainties going forward.
In pp scattering we include all relevant electromagnetic effects, as outlined in Ref. [14]: the static Coulomb interaction and its relativistic correction, the first-order approximation to the vacuum polarization, and relevant magnetic moment interactions.This set of long-ranged electromagnetic interaction has been demonstrated by the Nijmegen group to be sufficient for explaining the observed low-energy pp scattering data [18,19].For this reason, we also neglect any theoretical model discrepancy due to neglected higher-order contributions to the electromagnetic interaction.

C. Likelihood and EFT truncation error
In this work we relate an experimental measurement y exp of some observable y to the true value y true via a statistical model This also introduces the experimental uncertainty, δy exp , as a random variable for which we employ the standard deviations provided in the Granada database.We also relate the true value and our theory prediction y th via where δy th is the model discrepancy term.We will model δy th as coming from the truncation of the chiral expansion in χEFT at some finite chiral order k.When doing so we tacitly assume that the entire epistemic uncertainty of the theory can be systematically reduced by going to higher orders in χEFT.
To model the truncation error we follow Furnstahl et al. [5] and Wesolowski et al. [11] and formally write the χEFT expansion for some observable prediction y th up to chiral order k as where y ref is a reference value, c ν are dimensionless EFT expansion coefficients, and Q is a dimensionless expansion parameter that we assign as where m π is the pion mass, p is a soft scale associated with the observable, and Λ b is a hard scale.In this work, we will set Λ b = 600 MeV and p is given by the N N scattering momentum.All contributions to the potential at chiral order ν = 1 vanish, i.e., c 1 = 0, since we employ Weinberg power counting.We refer to the different orders as LO (ν = 0), NLO (ν = 2), and NNLO (ν = 3).
Truncating the χEFT expansion at order k induces a truncation error given by δy where K → ∞.Assuming that all c ν coefficients, including those for which ν ≤ k, are independent and identically distributed, we can use known lower-order coefficients c 0 , . . ., c k to learn about the single pdf from which the unknown higher-order coefficients should also be sampled.This, combined with a prior assumption about the form of the pdf for the c ν coefficients, provides us with a prescription for quantitatively estimating δy th .We assume a Gaussian prior with variance c2 for c ν , All physical scales reside in Q and y ref and it is reasonable to assume that the c ν coefficients are of order 1.This is an assumption we will test explicitly in Sec.II C 2 when estimating c from order-by-order shifts in the prediction of different N N scattering observables y.
Placing a normal prior for c ν , with known c2 , leads to a normal pdf with variance σ th for δy (k) th in the limit K → ∞ given by [11] pr(δy where we have also conditioned on Q, and In this work we will assume that the theory errors for different observables (y i , y j ) are completely uncorrelated.
This leads to a diagonal covariance matrix Σ th for the truncation error at order k that is given by Note that the LO truncation error is proportional to Q 4 /(1 − Q 2 ) since the ν = 1 chiral order vanishes such that the first non-zero term in Eq. ( 13) corresponds to ν = 2.The likelihood function for the entire data set D canassuming that experimental and theoretical errors are independent-be expressed as with the residual vector r defined as i.e., the difference between the experimental data and the corresponding set of χEFT predictions at order k given a vector α of LEC values.We employ a diagonal covariance matrix Σ exp also for the experimental data and used the normalization factors from the Granada database [9,10] for the joint systematic uncertainty of a group of data originating from the same experiment [20].Note also that the likelihood ( 18) is implicitly conditional on c and the breakdown scale Λ b .Let us also mention two possible extensions of the statistical model used in this work.Firstly, instead of placing an implicit delta prior on c2 , one could follow standard Bayesian practice and exploit the class of conjugate priors, assuming, e.g., an inverse-χ 2 pdf such that pr(c 2 |I) = χ −2 (ν 0 , τ 2 0 )1 .The conjugacy of this prior with respect to the normal pdf leads to a Student's t pdf for the expansion parameters c ν and the truncation error δy (k) th , see Ref. [21].Secondly, one could relax the assumption of completely uncorrelated truncation errors and model the covariance structure of the expansion coefficients c ν using, e.g., a Gaussian process where the correlation length in Q is one of its hyperparameters [21].

Choice of reference values
The reference value, y ref , in Eq. ( 13) for each observable can be assigned in different ways.Options include using an experimental value, picking some suitable theoretical prediction, or a value motivated from an order of magnitude estimate.We have chosen to let LO predictions set the scale for observables.But since LO predictions depend on the LECs α-which at this point in the th ( α ν=0 ), be the reference value for SGT and DSG observables.For spin polarization and correlation cross sections we use y ref = 0.15 as motivated by the average of the LO maximum-likelihood prediction.

Estimating c
We proceed to estimate c from the finite set of expansion coefficients c ν obtained by rearranging Eq. ( 11) and setting α = α ν .In general, we obtain a vector, c ν , of expansion coefficients at order ν c ν,i = y (20) for a representative set of N o = 54 observables at different scattering energies and angles.Specifically, we used a combination of total cross sections, differential cross sections, and spin observables in np and pp at an energy grid T lab = 20, 70, 120, 170, 220, 270 MeV, and scattering angles θ = 50, 150 degrees, such that we cover the relevant kinematical regions while being sufficiently well separated in the energy-angle variables to reduce the influence of finite correlations in the expansion coefficients.We account for the vanishing of chiral order ν = 1 when computing the order-by-order differences.
At each order we then compute a root-mean-square (RMS) value, i.e., Our choice to set y ref according to the LO MLE effectively constrains the expansion coefficient c 0 , leading to c 0 = 1.17, accounting for the averaged spin polarizations.See the third column in Table II for results at the other two orders.We also define an estimate of c as the total RMS value of the c ν,i coefficients up to, and including, the order k at which we truncate the EFT expansion (see the fourth column in Table II).The highest-order estimate for c is given by c0...3 and therefore includes information from LO, NLO, and NNLO.Note that outlier values of c ν,i were removed before computing all RMS values since they would otherwise influence our estimate disproportionally.We used the following procedure to determine outliers: 1. Compute the lower and upper quartiles C 25% and C 75% of c ν,i .
2. Determine the distance ∆C between the quartiles.
3. Discard any c ν,i that falls outside the interval The numbers of removed outliers at each order are provided in Table II.
Since we express the observable predictions as an EFT expansion in Eq. ( 11) we expect to observe expansion coefficients of natural size.On average we find relatively natural values that characterize the truncation error.However, the larger differences in the χEFT predictions when going from LO to NLO is a signature of an irregular convergence pattern and incurs an unexpectedly large value for cν=2 .We analyze the consequences of this in Sec.V.

III. HAMILTONIAN MONTE CARLO
Hamiltonian Monte Carlo shares some key features with the canonical Metropolis-Hastings algorithm [7,8].These MCMC methods draw samples from a pdf pr( α) by producing an ergodic Markov chain of states whose unique stationary distribution is pr( α).Given a current state α n of the MCMC chain, the Metropolis-Hastings algorithm proposes a new state α n from a proposal distribution q( α n | α n ).The new state is accepted with a probability a given by where r is the Hastings ratio The next state of the chain, α n+1 , will be set to α n if the update is accepted, or to a copy of α n if it is rejected.The proposal distribution is typically a normal distribution but many other options exist.The Metropolis-Hastings algorithm is a single-particle (or single-walker) algorithm, i.e., it is only aware of a single location in the parameter space at any given time.The primary drawback of the random walk Metropolis-Hastings algorithm, and most of its derivatives, is that the elements of the resulting MCMC chain are strongly correlated with each other in most practical applications.Such correlations decrease the amount of information contained in the MCMC chain, requiring longer chains than would otherwise be necessary.The problem is exacerbated when the dimensionality of the parameter space for the sampled distribution pr( α) increases.
In 1987, lattice field theorists devised an ingenious MCMC algorithm based on Hamiltonian dynamics to combat the problem of correlated samples.They dubbed the algorithm Hybrid Monte Carlo (HMC) [6], but it has subsequently become known as Hamiltonian Monte Carlo [22], retaining the acronym.HMC is based on the Metropolis-Hastings algorithm, but the method for proposing a new state is radically modified.The pdf to be sampled is treated as a potential energy surface and the current state of the chain is regarded as a particle with position coordinates given by the parameters of the pdf.By endowing the HMC "particle" at iteration n with a randomly drawn momentum p n we can associate the joint state ( α n , p n ) with a total energy and, by extension, a probability to find the system in that state.Simulating the particle's trajectory for a finite period of time subsequently yields a proposed state ( α n , p n ).The proposed state retains the previous total energy since this is a conserved quantity in Hamiltonian dynamics and the new state will therefore be accepted.The momentum p n is then discarded.In practice, the total energy will only be approximately conserved due to numerical simulation errors, and we vet the proposed state in a manner similar to Eq. (22).
The result is a completely uncorrelated sample of the target pdf, assuming that the length of the trajectory is appropriately chosen.Clearly, the length of the MCMC chain produced by HMC can then be drastically reduced compared to chains with a high degree of correlation.We will quantify this statement by extracting an effective number of samples in Sec.IV D. The strongest advantage of using the HMC algorithm is its potential for producing uncorrelated samples even when the target pdf is high-dimensional.Unfortunately, simulating Hamiltonian dynamics for each proposed state is computationally expensive compared to the method of proposing new states by applying small random perturbations.It is therefore apparent that the reduction in overall chain length and inter-sample correlations must be sufficiently large to warrant the increased cost per sample.
In the following subsections we present the HMC algorithm in detail.We have opted to write a custom implementation2 using Python [23] and NumPy [24] in lieu of using a standard package such as Stan [25].Specific implementation choices will be presented as appropriate.Our implementation and the mathematical details presented here are largely based on Ref. [22].A conceptual introduction to HMC can be found in Ref. [26].

A. From potential energy to posterior probability
In HMC we generate samples from the d-dimensional posterior pdf pr( α|D, I) by simulating classical Hamiltonian dynamics for a particle moving under the influence of a potential proportional to the posterior itself.To see this, we let the phase space of states-specified by position and momentum coordinates ( α, p ) with total energy H-be described by a Boltzmann distribution For a classical Hamiltonian function with kinetic energy K and potential energy U , the Boltzmann distribution factorizes and the marginal Boltzmann distribution for the position vector α is independent of the distribution for p.For our purposes, the temperature T is some function that makes the exponent dimensionless, and we set T = 1.The factor Z is independent of ( α, p ) and ensures proper normalization of the distribution in Eq. ( 24).From the exponential Boltzmann factor we identify the potential energy term as the negative log posterior (NLP).Disregarding the constant marginal likelihood we define In the last step we also used Bayes' theorem (1) to link the potential directly to the likelihood and the prior.Following standard practice, and the classical dynamics analogy, we employ a quadratic form for K( p): Here M is a positive-definite, symmetric matrix, called the mass matrix.With The HMC algorithm thus samples the joint pdf pr( α, p ). Marginalizing over the auxiliary momentum p-by simply discarding these coordinates in the Markov chainleaves us with MCMC samples of pr( α|D, I).
A particle governed by Hamiltonian dynamics moves through 2d-dimensional phase space on a hypersurface of constant energy.The time evolution of the system is described by Hamilton's equations where i = 1 . . .d.The form of the Hamiltonian, Eq. ( 25), allows us to rewrite Hamilton's equations as Note that ∂U/∂α i is a partial derivative of the NLP that we must evaluate in order to simulate Hamiltonian dynamics and, by extension, use HMC sampling.Realistically, this must be done through automatic differentiation (AD) except in special cases where analytic expressions for these partial derivatives are available (our implementation readily allows for both possibilities).AD generally incurs a factor of two overhead compared to just evaluating the target pdf [27].In this work, we exploit an external AD library [28] for computing the necessary gradients and we measure the computational overhead of AD to less than 50%, see Sec.IV C 2.

B. Advancing the HMC sampler
Starting from some current state ( α n , p n ), the total energy of a particle trajectory traversing the HMC phase space is conserved.We should thus always accept the proposed state ( α n , p n ) at the end of the Hamiltonian trajectory, discard the auxiliary momentum p n , and store α n as the new parameter sample α n+1 in the HMC chain.In practice, however, numerical errors in the solution of Eqns.( 31)-( 32) break energy conservation, which in turn breaks detailed balance.If ignored, we can no longer guarantee that the Markov chain converges to the sought stationary distribution.The solution is to vet the proposed state, i.e., the state at the end of the Hamiltonian trajectory, with the accept/reject step of the Metropolis-Hastings algorithm.We can also ensure a symmetric proposal distribution by negating the momentum variable p n at the end of the particle trajectory.However, such a negation is not necessary for a quadratic momentum distribution as in Eq. (28).The new state is thus accepted with the probability a given by Eq. ( 22) and a Hastings ratio for the joint probabilities given by The HMC chain will typically be ergodic since we draw a new momentum before integrating Hamilton's equations and thereby drastically altering the total energy.

C. Leapfrogging Hamiltonian dynamics
Upon imposing the Metropolis-Hasting accept/reject criterion we implicitly require reversibility of our chain.Fortunately, Hamiltonian dynamics is time-reversible and it is necessary to integrate Hamilton's equations in a way that preserves this property.Standard methods like Euler integration or explicit Runge-Kutta methods are disqualified as they do not preserve time-reversibility.From a strictly practical point of view, integration methods that do not conserve the Hamiltonian also limit the length of the particle trajectory since a large accumulated error would result in an unacceptably low mean acceptance rate ā.To ensure time-reversibility, one should use a symplectic integrator that goes hand-in-hand with the volume preservation of phase space that follows from Liouville's theorem.The local discretization error of a symplectic integrator is equally likely to be positive or negative in each step of the integration as long as the step size is below some threshold value.The result is that the total energy is approximately conserved for an arbitrarily long trajectory.A nice property of HMC is that ā drops precipitously if is greater than the threshold value.The absence of quiet failures makes it trivial to diagnose a too large choice for .The upper bound for is generally imposed by the most constrained parameter in the posterior.
The number of leapfrog iterations L can drastically influence the performance of the HMC sampler; it is imperative that L is set neither too high nor too low.A too small number partly defeats the purpose of using HMC in the first place, as it would result in a random-walk-like behavior with highly correlated samples.In contrast, a too large value of L would waste valuable CPU cycles without improving (and possibly even curtailing) performance.Naturally, choosing a very small step size needs to be compensated for by increasing L in order to avoid random walks.
Neither nor L are fixed in our implementation.Rather, random values are drawn from predefined probability distributions prior to each invocation of the leapfrog solver according to where the user specifies the nominal values and L .The reasons for randomizing these leapfrog parameters are threefold.First, variations in the trajectory length L may decrease correlations between samples.Second, a fixed trajectory length can result in oscillatory behavior if L happens to approximately match some periodicity of the target distribution.This type of (nearly) non-ergodic behavior can severly limit the efficiency of HMC.Third, the target pdf may have regions where its gradient is very steep so that the nominal is too large to resolve features in that section.
There are extensions of HMC whose purpose is to relieve the user from the burden of tuning the hyperparameters and L. For instance, the choice of may be automated based on acceptance rates of small trial runs.The state-of-the-art No-U-Turn Sampler (NUTS) termi-nates trajectories based on heuristic rules for when continued simulation no longer increases the performance of the sampler.Both of these improvements are described in Ref. [29].

D. Tuning in to the target distribution
An array of tunable HMC parameters have been introduced in the previous subsections: the step size , the number of leapfrog iterations L, and the mass matrix M. A drawback of HMC is the need to carefully tune these hyperparameters to each target distribution, or risk poor performance.However, as mentioned in Sec.III C, the tuning of and L may be automated.For now, we use a manual tuning procedure that will be outlined below.It is designed to achieve efficient sampling and also involves the important mass matrix.
The keen reader may have noticed that , L, and M are interlinked and changing one may force us to change one (or both) of the others.The tuning procedure is therefore, to an extent, iterative.We start with .

Leapfrog step size and number of iterations
For tuning we exploit that the acceptance rate ā is largely independent of L (a consequence of using a symplectic integrator) and produce a very short HMC chain using a small number of leapfrog steps (L = 3).If ā is low (in practice 0%) we decrease by an order of magnitude and try again.If ā is 100% we instead increase by an order of magnitude and try again.Once the appropriate magnitude is found we make fine-grained adjustments as necessary.Using this method we quickly achieved HMC acceptance rates of 99% at all three chiral orders.This value is likely too high for optimal efficiency, but we find it adequate for our purposes.
The number of required leapfrog iterations L is intimately linked to the choice of the mass matrix M. Careful tuning of L is rather pointless until M is settled.We have found that L ≈ 10-20 yields excellent performance for approximately Gaussian distributions with around a dozen or so parameters, assuming that the mass matrix (and ) is well chosen.To rapidly assess the choice of L-and the overall performance-it is useful to inspect trace plots of each individual parameter.The trace plots will reveal no apparent structures if the HMC algorithm is performing well.Any remaining structures in the trace plots may be quenched by increasing L, at the obvious expense of increased computational effort, or by improving the mass matrix.The latter alternative should always take precedence if possible.Note that it is usually necessary to revisit the tuning of after the mass matrix has been updated.

Mass matrix
Both and L are scalar values with no distinction for each individual parameter α i and thus cannot be used to compensate for differences in parameter scales.Like Metropolis-Hastings, and unlike, e.g., affine-invariant ensemble samplers (AIES) [30] such as the emcee package [31], HMC is sensitive to such differences of scale and we need a way to account for them.This is the purpose of the mass matrix M.
Deploying a mass matrix that captures the most important features of the target distribution is absolutely critical to the performance of HMC.An improper choice of M can degrade the performance by several orders of magnitude.Letting M = 1, i.e., an identity matrix, does not fare well with the χEFT models analyzed in this work.To improve, we exploit published LEC uncertainties from a previous analysis [32] and construct a diagonal mass matrix.We then draw ∼ 1000 samples, using L = 8 at LO and L = 20 at NLO and NNLO, to estimate a parameter covariance matrix Σ α and construct a mass matrix according to M = Σ −1 α .We find that this approach to learn about M yields high HMC performance in practice.Note that one does not have to use HMC for this tuning, as we do; indeed, it may be preferable to use a more expedient method for extracting an approximate parameter covariance matrix, e.g. the more tuning-agnostic MCMC sampler emcee.
Local estimates of the target covariance based on, e.g., optimization and second derivatives have been used in previous studies of LEC uncertainties, see, e.g., Ref. [14].This method was recently used to construct a Bayesian prior for the χEFT contact LECs at NNLO when estimating the c D and c E LECs in the three-nucleon force sector [33].It was found that the prior and marginal posterior were largely the same.This finding reinforces the observation that the inverse of a point-estimated covariance matrix yields a performant mass matrix.

IV. SAMPLING LEC POSTERIORS USING HAMILTONIAN MONTE CARLO
In Sec.IV A we outline the sampling strategy and in Sec.IV B we present our posterior pdfs pr( α|D, I) for the LECs α at LO, NLO, and NNLO in χEFT sampled using HMC.These posteriors enable all subsequent inference in this paper and constitute the main result of our work.In Sec.IV C we discuss the convergence of the MCMC chains.In Sec.IV D we highlight some of the unique aspects of the HMC algorithm by comparing with posterior samples obtained using emcee.In Sec.IV E we comment on multimodality and the challenge it brings.

A. Sampling strategy
We employ the same HMC sampling strategy at all chiral orders considered in this work.
1. To identify a ballpark region where we expect to find the posterior mode we first optimize the data likelihood in Eq. (18).At this stage we employ c = 1 and y ref = y exp to parameterize the covariance matrix for the EFT truncation error.Every subsequent HMC sampling is then randomly initiated within an overdispersed region around the MLE.
2. To tune the mass matrix M, we use previously published uncertainties for the LECs [32] to define its diagonal entries and draw n tune ∼ 1000 samples using L ≈ 10-20.The resulting sample covariance matrix is inverted to yield the final mass matrix.
3. We determine the optimal step size from a small set of very short HMC chains consisting of 10-20 samples and easily find a step size that yields an acceptance rate ā of 99%.
4. Equipped with a well-tuned HMC algorithm, we collect M ≥ 3 independent chains at different starting values for the LECs.It is important to have two or more chains to enable canonical convergence tests based on within-and between-chain variances, e.g., the Gelman-Rubin test.We also employ a convergence criterion based on the integrated autocorrelation time as recommended in Ref. [31,34], see Sec.IV C 2.
5. The trace plots of the HMC chains indicate that the length of the burn-in phase is very short.This is corroborated by the autocorrelation analysis presented in Sec.IV C 2. We discard the initial ∼ 10 samples from each chain, except for three chains at NLO which require us to discard ∼ 100 samples.The situation is drastically different for the emcee chains, where we find it necessary to discard the initial ∼ 1000 samples at LO and ∼ 30,000-40,000 samples at the higher orders.These lengthy burn-ins have to be repeated for each emcee chain.
In contrast, tuning the HMC hyperparameters is a one-time cost.
Detailed information about the tuning and sampling phases at LO-NNLO are summarized in Table III.Note that the number of tuning samples n tune is larger than necessary at LO and NLO.The step sizes and acceptance rates ā, which are closely linked, are remarkably similar across the three chiral orders.The reason is that the step size is generally limited by the most constrained parameter(s) which in all three cases are the LO LECs C.

B. LEC posteriors
The posterior pdfs for the LECs are multivariate but will be presented using univariate and bivariate projections.These so-called corner plots of the LO, NLO, and NNLO posteriors are shown in Figs. 1, 2, and 3, respectively.
At all orders, the locations of the maximum a posteriori (MAP) probability and widths of the posterior pdfs are similar to the corresponding measures we obtained using frequentist parameter estimation in Ref. [14].This is largely due to the fact that we are using nearly the same database of thousands of N N scattering cross sections in both analyses and that the inference is likelihood dominated.The main modification to the database comes from setting aside part of the data for validation in this work.We also employ identically regulated χEFT interactions, and closely related diagonal covariance matrices for estimating uncorrelated EFT truncation errors.Despite several apparent similarities it is very important to realize that we are comparing results from two fundamentally different approaches.The use of Bayesian inference methods allows us to assign a probability (density) measure to LEC values themselves.In the frequentist approach we are estimating covariances from the gradients at the maximum likelihood estimator of the data.

LO
At LO we consider the two N N contact LECs present at this order: C 1S0 and C 3S1 , acting in the S-waves.The corner plot in Fig. 1 reveals that they are both very well constrained by the N N scattering data D and appear to be uncorrelated with each other.We note that C 1S0 is considerably more constrained than C 3S1 .This is likely due to: (i) the isovector (isoscalar) character of C 1S0 ( C 3S1 ), and( ii) that pp data is more abundant and more precise than np data at low scattering energies where the truncation error is relatively small.

NLO
Several contact LECs are introduced at NLO and we find that the LEC posterior exhibits noticeable corre-lations in certain directions, see Fig. 2. The presence of such correlations indicate a level of parameter redundancy in the model.From a statistical perspective there exists methods, e.g., singular value decomposition, to identify and retain only the most important parameters (or linear combinations of parameters) of a model to explain data, so-called stiff directions in the parameter space.However, before doing so it is worthwhile to inspect the model structure from a physics perspective.In the present case we identify a strong correlation between the LECs C pp 1S0 and C np 1S0 .Following conventional counting of the isospin-breaking effects in χEFT we encounter the leading isospin-dependent 1 S 0 contacts at NLO.This is also in line with the results from a high-precision data analysis by the Nijmegen group [18] demonstrating that  strong and electromagnetic interactions break charge independence and most prominently in the 1 S 0 channel.However, since isospin-breaking is a comparatively small effect, a non-negligible EFT truncation error at NLO is likely to dilute isospin sensitivity with respect to the N N data being used.
We also detect correlations between S-wave LECs act-ing within the same spin channel.In general, the spinsinglet and spin-triplet partial-wave contact LECs do not exhibit any significant correlation with each other at any of the chiral orders we examine.This is somewhat different from the frequentist analysis in Ref. [14] where correlations (|ρ| 0.7) where found between all S-wave LECs and C 3P 2 .We speculate that this difference in correla-tion structure could be rooted in the difference between the models of the truncation error used in the frequentist and Bayesian analyses.Upon inspection, we find that the truncation errors at NLO and NNLO employed in this work are more than twice as large compared to the corresponding error magnitudes used in Ref. [14].We have not performed a systematic analysis to compare the two error models, but a smaller truncation error can certainly shift weights of, e.g., spin-polarization and spin-averaged data at different energies which in turn could induce stronger correlations between the LECs of the interaction model.

NNLO
At NNLO we also find strong correlations between certain LECs, see Fig. 3. Similarly to the NLO posterior, there is a very strong correlation between the chargeindependence breaking contacts.However, we do not identify any unexpected correlation structures, like the ones found at N3LO in Ref. [11], to reveal a physics parameter redundancy in the model.
At this order we encounter the sub-leading πN LECs c 1 , c 3 , c 4 for the first time.As opposed to contact LECs they act in all angular momentum channels.Certain combination of πN and N N LECs also show significant correlations; in particular, c 1 appears positively correlated with the C 1S0 LECs, as does c 3 with, e.g., C 3P 2 .
Recall that we assign a multivariate normal prior pdf for c 1 , c 3 , c 4 using maximum-likelihood results from a Roy-Steiner analysis of the ∆-less πN scattering amplitudes [13].This prior strongly regulates the values of c 1 , c 3 , c 4 compared to the N N LECs which are assigned a less informative prior based on naturalness.
In Fig. 4 we show a corner plot where we focus on the differences between the prior and posterior in the πNsector.Note that we are comparing two pdfs that have very different origin from both a statistical and methodological perspective.We can still draw the following conclusions: • The N N data induces an overall 5 − 10% shift, in the positive direction, of the πN LECs, i.e., the N N data appears to reduce the πN sub-leading attraction slightly.However the overall effect of this shift on the binding of atomic nuclei remains to be analyzed.
• The difference in MAP values of the πN prior and posterior is significant compared to the extent of the credible intervals.For c 1 we observe a slight overlap of the marginal pdfs.It is interesting to note that the πN vertex corresponding to c 1 does not contain any contribution from the ∆-isobar.
• The link between low-energy πN and N N scattering processes is a hallmark of χEFT.If all uncertainties are accurately modeled, and we are operating with an EFT, then we expect to find an overlap between the prior and the posterior.However, we do not observe this.One possible explanation is that we estimate the truncation error in the N N sector via the uncorrelated theory covariance matrix in Eq. ( 17).The Roy-Steiner analysis is based on an MLE with uncorrelated data and method uncertainties [12,13] and it is not clear how to propagate an EFT error to the πN LECs matched at this order.

C. Convergence towards a stationary distribution
How many samples do we need to reach an accurate representation of the stationary target distribution with small sampling error?In connection with this, one should note that the N samples collected during some finite time period will not be independent.
It is unfortunately not possible to determine the level of convergence of a finite chain; we can only attempt to detect convergence failures.As such, all convergence diagnostics in the MCMC literature merely provide necessary but not sufficient conditions.Multiple diagnostics have been devised, of which we employ two of the most common ones: the standard Gelman-Rubin statistic ( R) [35,36], discussed in Sec.IV C 1, and the integrated autocorrelation time (τ ) discussed in Sec.IV C 2. Both diagnostics are applied to each LEC α i individually.
Although MCMC algorithms are ergodic and eventually explores the entire state space, we clearly face the challenge of pseudo-convergence due to multimodality when working with finite chains.This problem is discussed further in Sec.IV E.

The Gelman-Rubin statistic R
With the Gelman-Rubin statistic R we compare the variance of the samples within a single chain to the variance between M ≥ 3 chains initialized at different starting positions.Following Ref. [35], we assume that each chain contains N samples after we have discarded initial samples to reduce the memory of the starting position.We discussed the removal of the burn-in samples in Sec.IV A. Based on the N × M samples one defines a joint mean for each LEC α i based on all chains where ᾱi (m) denotes the within-chain mean for the mth chain and is given by In this notation, α the between-chain and within-chain variances for the ith LEC in terms of the corresponding means as and respectively.A weighted average of the above variances can be used to estimate the variance of the marginal posterior for the

LEC α i
Var where the +-sign indicates that this quantity overestimates the posterior variance provided that the M chains are initialized at locations with greater variability compared to the true posterior.Indeed, for finite N we have that W will underestimate the marginal variance since the chains have not explored the posterior while B will overestimate the marginal variance if the initial sampling distribution is overdispersed.In the limit of N → ∞ we will have that W approaches the variance of the marginal posterior.Incorporating these finite-N corrections leads to a Student's t distribution for α i with variance (scale) estimated by The Gelman-Rubin measure expresses the potential scale reduction by forming the ratio which approaches 1 as N → ∞.A widely used threshold for declaring convergence is R < 1.01 [37].This, somewhat arbitrary, threshold simply states that the sample variance is 2% larger than the within-chain variance, and that one should expect a corresponding potential scalereduction if continuing the sampling process.In Fig. 5 we show the evolution of R with the number of HMC samples at LO, NLO, and NNLO.Clearly, our HMC chains fulfill R < 1.01 as well as an even stricter threshold R < 1.001.The chains obtained with emcee also pass the same R thresholds after a similar amount of MCMC samples.
There exists updated Gelman-Rubin measures.One can employ so-called split-R [37] and rank-normalized R [38] to better handle non-stationary chains and chains distributed with a heavier tail that conspire to yield a good R.We have not detected the need for using such updated R measures to analyze the χEFT posteriors in this work.

The integrated autocorrelation time τ
We use the MCMC chains to compute (statistical) expectation values and we should therefore also analyze the sampling variance of such estimates.For example, one can straightforwardly estimate the mean value ᾱi of the ith LEC within a single MCMC chain3 , see Eq. ( 37).The sampling variance of the estimated mean value for a particular LEC, based on N uncorrelated samples, scales with 1/N according to where Var[α i ] is the variance of the samples with respect to the posterior pr(α i |D, I).This estimate only holds for uncorrelated samples, and we will demonstrate that such samples can be obtained with the HMC algorithm.
In contrast, most random-walk based MCMC algorithms generate highly correlated samples.When the samples are correlated, the variance of the mean is modified according to where τ i is referred to as the integrated autocorrelation time for the chain of sample values of the ith LEC.It is given by The autocorrelation function ρ i (h) measures the correlation between (stationary) samples separated by h MCMC steps.In the literature, h is referred to as the lag.Note that the integrated autocorrelation time will be different for each expectation value.Here, we limit ourselves to inspect τ i for the mean values of the LECs and use this below to assess the convergence of an MCMC chain.In Fig. 6 we present the estimated autocorrelation functions ρi (h) of all LECs at LO, NLO, and NNLO, and their averages, as obtained by HMC and emcee.As expected, a well-tuned HMC algorithm generates virtually uncorre-lated samples whereas the emcee chains exhibit a correlation structure that is typical for most MCMC algorithms.The HMC algorithm generates uncorrelated samples even as the dimensionality of the parameter space is increased.We see this advantageous dimensionality-scaling of HMC when going from LO to NNLO.The corresponding correlation length of the emcee chains markedly increase.IV.This is one of the primary advantages of using HMC.Some care is needed in the numerical computation of the integrated autocorrelation time τ i .For large values of N in Eq. ( 45), the estimated autocorrelation ρ suffers from a signal-to-noise problem.Indeed, although the correlation decreases towards zero with increasing lag, the variance of the correlation does not.Following Ref. [34] we therefore truncate the sum at the smallest integer N such that N ≥ cτ i (N ) for c = 5.With this in place we can monitor the evolution and convergence of τ as a function of the number of collected samples N in the chain.
As mentioned, the computation of τ is not only useful for quantifying the sampling variance but also provides a handle on the convergence of the MCMC chain.While the Gelman-Rubin statistic compares several identically prepared chains, diagnosing convergence based on the evolution of τ can be applied to a single chain.Using this method, convergence is declared when the estimation of τ has stabilized and N τ , where N is the length of the chain.In this work we apply the condition N ≥ 50τ , and in Fig. 7 we show the evolution of τ as N increases.We have also indicated the non-convergence zone N < 50τ .At LO and NLO, we fulfill τ -convergence using both HMC and emcee.At NNLO, we fulfill τconvergence only with HMC while emcee falls just short of the imposed tolerance.

D. Effective sample size and efficiency of HMC
The reduction in sample quality due to correlation is often quantified with the effective sample size where we also introduce τ defined as the average of all τ i for each LEC α i .We use the ESS value to quantitatively compare the efficiencies of HMC and emcee.It is obvious from Eqns. ( 44) and ( 46) that the correlation structure directly impacts the ESS and hence the total computational effort required to reach a tolerable variance.Since there is a significant computational overhead involved in integrating Hamilton's equations, it is key to tune the HMC algorithm to reach a very small τ .The intrinsic benefit of using HMC also increases with the dimensionality of the parameter space.
In Table IV we present τ values and relevant related quantities for emcee and HMC at LO, NLO, and NNLO.The results presented in this table are based on a single chain for each order and choice of algorithm.We have, however, verified that the results are similar regardless which one of the parallel chains that is used.We note that we have τ ≈ 1 for the HMC sampled chains, and even τ < 1 in two cases (LO and NLO).With emcee, τ is roughly two orders of magnitude greater than the HMC equivalents.This is solely due to the correlation structure of the respective MCMC chains.Also shown in Table IV (as ESS/N ) is the average number of effective samples that one MCMC sample provides.By introducing N Lthe total number of likelihood calls during sampling, tuning, and burn-in-we also show (as N L /N ) the average number of calls to the likelihood function required to generate one MCMC sample.We obtain N L /N > 1 also for emcee since the quoted results also include burn-in.We can use the number of likelihood calls per effective sample to compare the average efficiency of HMC and emcee, as evaluating the likelihood constitutes nearly all of the computational effort.We therefore define an HMC speedup factor S with respect to emcee as where we also account for the computational overhead induced by the use of AD which we employ to generate the gradients necessary for HMC.In our implementation we measure the AD overheads to 10% at LO, 24% at NLO, and 43% at NNLO, and we use these figures when quantifying the HMC speedup factors in Table IV.In summary we find that the real-world speedup factor is more than 6 at LO and NLO, and 3.6 at NNLO.The smaller speedup at NNLO is primarily due to less ideal tuning, and could be improved further.Another contributing factor is that the estimate of τ for emcee at NNLO has not stabilized, and the value reported in Table IV is therefore a lower bound (cf.Fig. 7).We obtain τ < 1 at LO and NLO when using HMC, a result which comes from drawing anticorrelated samples from the posterior pdf, as can be seen clearly in the corresponding autocorrelation functions in Fig. 6.This so-called antithetic sampling [39] leads to ESS > N , i.e., an effective number of samples greater than the number of MCMC samples with a corresponding reduction of the variance in Eq. (44).Therefore, drawing completely independent MCMC samples is not necessarily the optimal strategy.In this work we encounter antithetic sampling for sufficiently large values of the HMC step length L once we have constructed a mass matrix M that we believe suits the target distribution.This is obviously an advantageous sampling strategy and antithetic sampling is one of several known variance reduction techniques in Monte Carlo sampling [22].While we do not achieve antithetic sampling at NNLO in this work, we believe that it can be reached with further tuning.

Consequences of improper tuning of the HMC hyperparameters
HMC sampling is challenging in practice, primarily due to the need for careful tuning of the mass matrix M to achieve high performance.It is therefore instructive to show what a failure looks like.Fig. 8 demonstrates the strong autocorrelation that results when M is not well-tuned for two chains at NLO; one where M is set up using a χEFT naturalness argument, and the other where it is based on previously published LEC uncertainties [32].To construct the mass matrix using the naturalness argument we employed ratios of the contact LECs at LO-NNLO according to where F π 92 MeV is the pion decay constant.In both cases, the integrated autocorrelation time is very large which, combined with the high per-sample computational cost of HMC, results in very poor performance.Autocorrelation structures like those shown in Fig. 8 are unacceptable of course, but typically seen before tuning the hyperparameters.

E. Multimodality
Although the Markov chain is ergodic, one would have to run the MCMC algorithm for an infinitely long time to visit all states.Clearly, multimodal distributions makes the sampling process considerably more complicated.The Markov chain has to cross a valley of low probability to explore more than one mode.Such a crossing is naturally a low probability event and will generally occur infrequently.Different modes may also have different shapes, causing poor performance if the tuning of the MCMC algorithm is unsuitable for the other mode even if the algorithm successfully moves between modes.Standard HMC has no advantage over alternative sampling algorithms in this regard due to its single-walker nature combined with the necessity of careful problem-specific tuning.Although the situation can be improved with, e.g., tempering methods [22] it is probably better to explore other MCMC algorithms, such as MultiNest [40], if multimodality is expected.
In our analysis we encountered one clear case of multimodality: the LEC posterior at LO.This pdf is straightforward to explore in detail because it is only twodimensional.We performed a scan on a 500 × 500 grid of the LO pdf.The result is shown in Fig. 9.The main mode is located around ( C 1S0 , C 3S1 ) ≈ (−0.11, −0.07) • 10 4 GeV −2 , and marked with a cross in the lower panel of Fig. 9.A second mode is found at ( C 1S0 , C 3S1 ) ≈ (−0.11, −0.03) • 10 4 GeV −2 .We computed the marginal likelihood pr(D|I) of both modes using the Laplace approximation, and found that the second mode contains a negligible probability mass.A deep valley-many orders of magnitude lower in probability-separates the two modes and presents an effectively impenetrable barrier to the HMC sampler.The valley is caused by the breakup of the deuteron bound state, at C 3S1 ≈ −0.05 • 10 4 GeV −2 , that suppresses the likelihood for np SGT data at very low energies.As far as we can tell the vast majority of the probability mass for the LEC posteriors at NLO and NNLO is located in a single dominant mode.In the current work we therefore proceed under the assumption that all pdfs are unimodal.

V. MODEL CHECKING
A probabilistic model of a physical system can only be upheld if it provides an acceptable representation of data.Therefore, we should always check to what extent the model fits the data.To that end we sample the ppds and inspect the empirical coverages.

A. The posterior predictive distribution
As presented in Sec.II B, we have reserved roughly one third of the Granada database of experimentally measured scattering cross sections for model validation and refer to this as data set D, see Table I.
Following Eq. ( 10) we can model the true value of an N N scattering observable as the independent sum of the predicted value up to chiral order k and the truncation error, i.e., Neither term on the right hand side is known with certainty.Indeed, they are stochastic variables described by pdfs.We consider the prediction y  th , can be accounted for as well [33].In this work, however, we focus on quantifying the LEC uncertainty and combining this with the EFT truncation error.
Equipped with an HMC chain of samples from the LEC posterior pr( α|D, I) at chiral order k, it is straightforward to evaluate the ppd pr(y  th for each of the N samples ( a (1) , a (2) , . . ., a (N ) ) in the HMC chain.Fortunately, the HMC chains are comparatively short and the necessary computation of N N scattering observables does not pose any significant challenge.Should this become an issue one could try emulating the observable response [41][42][43] or use hardware acceleration [44].Opting for emulation or acceleration will add an error term to Eq. ( 49) quantifying the corresponding additional uncertainty.
Next, we sample the pdf for the truncation error in Eq. ( 15).This is trivial for a normally distributed EFT error where the parameters c2 and y ref characterize this pdf entirely.
Since we can evaluate the two terms in Eq. ( 49) we can also draw samples from pr(y true |D, I).Panels (a) and (b) in Fig. 10 show such predictions for the true value of the total np cross section (SGT) for 0 < T lab ≤ 350 MeV, while panels (c) and (d) show predictions of the true value for the pp spin correlation parameter (AYY) at T lab = 294.4MeV.These observables were not included in the training data set D. Also shown in the figure are experimental measurements of the same observables, gathered from Refs.[16] for SGT and [45] for AYY.The panels in the left column show 100 individual predictions at each order, while the panels in the right column show 68% and 95% highest density intervals (HDIs) 4 computed from 1,000 predictions.We only find unimodal ppds and they appear to be rather symmetric.Indeed, as will be discussed below, the ppds are dominated by the normally distributed EFT truncation error.The individual predictions were generated in an uncorrelated fashion to reflect our model for the EFT truncation error.See the supplemental material [46] for ppds at LO, NLO, and NNLO of all observables present in the Granada database, i.e., a model check with respect to both the training data D and the validation data The predictions in Fig. 10 appear to converge toward the experimental results with increasing chiral order, and a visual inspection of the 68% and 95% HDIs of the ppds indicates that they work as advertised for these observables.We note that the employed model for the truncation error does not incorporate the known symmetry constraints of the spin-scattering matrix [47], e.g., that the vector analyzing power P goes to zero at extreme scattering angles.This type of information can be straightforwardly incorporated as a boundary condition on a Gaussian process model for the EFT truncation error [48].
We find that the truncation error δy (k) th is the dominating source of uncertainty in Eq. (49).Indeed, the propagated error due to the variability in the LECs is quite small in comparison since the LEC posteriors are conditioned on a very large and informative data set.In fact, we find that the truncation error dominates at all orders up to NNLO for all energies in the Granada database.Even at low energies (T lab ≈ 1 MeV) the truncation error is still 5-10 times larger than the uncertainty stemming from the LECs.Of course, at low energies both errors are very small on an absolute scale.In accordance with EFT principles we would expect the HDIs to narrow with each order and widen with increasing scattering energy.Instead, we find that the widths of the HDIs at all orders are comparable, and particularly so at the larger T lab -values.This is a consequence of underestimating c at LO and NLO, see Table II.We remind the reader that we estimate the EFT truncation at order k by exploiting information only up to this order since this reflects what one can do in a real situation without any higher orders available.Of course, at LO the available information is particularly scarce.In Sec.V B we will explore other choices for estimating c.

B. Empirical coverage probability
We compute the empirical coverage probability to assess whether our HDIs are accurate as advertised.The empirical coverage quantifies how well we meet the expectation that if we compute a p • 100 % HDI for the ppd of a true value for an observable, we should find that this HDI covers the measurement of said observable with probability p on average.To that end, we perform a binary test for each datum in our validation data set D and count the number of times the validation data falls within the specified HDI, i.e., we count the number of 'hits', and compare this with the total number of data in D. In this procedure we neglect the (generally) very small uncertainties in the experimental data.Repeating the binary coverage test for a range of values of p, i.e., for different HDI intervals, yields a summary for the empirical coverage probability that is also convenient to inspect graphically.
Figure 11 shows the resulting empirical coverage plots for the validation data set D using three different strategies for evaluating an RMS value for c.In coverage plots of this kind, a p • 100 % HDI should yield a coverage such that it ends up on the 45°positive diagonal if it is working as expected.The diagonal line is indicated with a dash-dotted line in all our coverage plots.A coverage probability larger than this nominal value, i.e., a coverage probability above the diagonal, corresponds to an overly wide HDI, implying a too conservative error.The opposite situation corresponds to an overly narrow (liberal) HDI, implying a too small assigned error.Following a precautionary principle, a conservative error is preferable to the underestimated one.
The observed number of hits should also follow a binomial distribution under the assumption that the observables are uncorrelated, see, e.g., Ref. [5].Our validation set, like the training set, contains hundreds of largely independent data groups that contain data recorded at different experimental facilities over several decades.The continuous version of the binomial distribution is the β distribution, so we use this to compute 95% confidence intervals, indicated as a gray filled region along the diagonal in all coverage plots.
The truncation error is intimately linked to the inferred value for c.In Fig. 11 value ck according to Table II.This method entails that we use information from all orders up to the one we are working at, i.e., ck = RMS( c 0...k ).Clearly, this yields liberal HDIs for the ppds on average, in particular at LO and NNLO.Note that the LO HDI is based on an EFT truncation error essentially governed by the choice of reference values.At NLO and NNLO, the HDIs represent credible intervals conditioned on information from one and two order-by-order differences, respectively, and the corresponding coverage probabilities perform slightly better.However, we clearly underestimate c.As a second approach to determine c, which we denote RMS( c 2...k ), we ignore the cν=0 contribution.In this case the empirical coverage probability improves drastically, see Fig. 11(b).Motivated by the dominance of the EFT truncation error compared to the uncertainty originat-ing from LEC variability, we do not resample the LEC posteriors when varying the hyperparameters of the truncation error.The LO predictions are absent in Fig. 11(b) to emphasize that this method for computing ck is not relevant at that order.
In Fig. 11(c) we compute c as the RMS value of the expansion coefficients extracted with respect to the first omitted order only, which we denote RMS( c k+1 ).At NNLO we exploit an MLE at N3LO.As expected, this method improves the performance of our error model, in particular at LO and NLO.There is not much difference between the coverage probabilities for the NNLO HDI in panels (b) and (c).
In Fig. 12 we inspect the coverage probability with respect to three different subsets of the validation data set D. In panel (a), we only look at integrated np cross section data (np SGT).In (b), we look at all data with T lab ≤ 100 MeV, and in (c) we look at all data with T lab > 290 MeV.Here we use ck = RMS( c 2...k ).The coverage with respect to SGT data is consistently too high.The data points in this set are very correlated with each other, which makes this comparison less meaningful.
In panel (b), where we retain validation data with T lab ≤ 100 MeV, the error model at NLO performs rather well, whereas the NNLO error is too liberal.The coverage with respect to data with T lab > 290 MeV, panel (c), exhibits a similar pattern.Although the HDIs improve with information from higher orders, as expected, this is not an entirely satisfactory situation when one is making predictions at low orders of χEFT.
Finally, we estimate the consequences of modifying the employed value for the χEFT breakdown scale Λ b .Specifically, we change Λ b from 600 MeV to 500 MeV in the expression for the EFT truncation error (15) without resampling the LEC posteriors.We find that decreasing Λ b results in more natural c values while simultaneously yielding improved coverage probabilities for the EFT error.The model also becomes somewhat less sensitive to the method of computing c.Conversely, raising Λ b increases the model's sensitivity to c as well as resulting in less naturally sized expansion coefficients.The results presented here are in line with the findings in Ref. [48].Further studies of the χEFT breakdown scale, along the lines of Ref. [33], are clearly warranted.

VI. CONCLUSIONS AND OUTLOOK
In this work we implemented the HMC MCMC algorithm for sampling the LEC posterior pdf pr( α|D, I) at LO, NLO, and NNLO in the N N sector of χEFT.We accounted for uncorrelated EFT truncation errors [5] in the sampling.Our prior was based on both reasonable assumptions and information from previous studies.For example, we assume natural EFT expansion coefficients and N N contact LECs while the πN sector was further informed by the results from a Roy-Steiner analysis [13] of πN scattering data.
We conditioned the LEC posteriors on thousands of scattering data, leading to a likelihood-dominated posterior and consequently a probability mass of the LEC posterior that is concentrated to a very small region in parameter space.At all orders, the MAP and typical widths of the posteriors are very close to the corresponding measures found using frequentist parameter estimation in, e.g., Refs.[14,49].This is largely due to the fact that we employed nearly the same database of N N scattering cross sections in both analyses.
An analysis of the coverage probability of the HDIs for predictions indicates that the credible intervals perform largely as advertised if one excludes the LO expansion parameter when estimating c and if one has access to predictions at NNLO or beyond.Indeed, the large shift in predictions when going from LO to NLO does not provide representative information about the EFT convergence pattern.This disturbance in the order-by-order description of the nuclear interaction was also identified in a recent Bayesian analysis of three-and four-nucleon states [33].It is hence desirable and timely to develop improved models for the EFT truncation error that explicitly account for such irregularities in the convergence pattern.At the same time it is equally important to address deficiencies in the LO description of the nucleonnucleon interaction, in particular for making reliable loworder EFT predictions [50].
Apart from a second mode in the LO posterior, for which the deuteron is unbound, we found no clear evidence of multimodality in the LEC posteriors at NLO and NNLO.We also found that the posterior for the πN LECs at NNLO does not overlap with any significance with the narrow prior inferred from the Roy-Steiner analysis in Ref. [13].Provided that we are operating with a lowenergy EFT for the nuclear interaction, it is reasonable to expect this discrepancy to vanish if all uncertainties are accurately modeled.
We found that the HMC algorithm provides virtually uncorrelated samples of the LEC posterior at all three chiral orders.We confirm a very low level of autocorrelation which is a hallmark of the HMC algorithm.When analyzing the correlation structure in detail we found evidence of an antithetic sampling pattern in the HMC chains at chiral orders LO and NLO.This yields an integrated autocorrelation time τ < 1 and an ESS greater than the number of gathered MCMC samples.If harnessed, antithetic sampling could serve as a valuable method for reducing the sample variance of LEC posteriors at higher chiral orders.
We also compared the HMC chains at the three considered orders to corresponding chains obtained using the emcee algorithm and found a three-to six-fold increase of the sampling efficiency.With HMC, we achieved nearinstant convergence, as measured by τ , at LO, NLO, and NNLO.Using emcee, in contrast, we just managed to cross the convergence threshold at NLO and failed to reach a converged NNLO result.Although HMC sampling relies on access to gradients of the posterior with respect to the LECs, and requires more tuning than other MCMC algorithms, we find that the rewards justify the extra effort.It is certainly possible to devise algorithms for self-tuning of the hyperparameters with NUTS-HMC [29] being one such strategy.Our results provide a promising outlook for parameter estimation at higher chiral orders, where the increasing number of LECs most likely presents a formidable challenge to other MCMC algorithms.

FIG. 1 .
FIG. 1. LO posterior sampled with HMC.The LECs are shown in units of 10 4 GeV −2 .The inner (outer) gray contour line encloses 39% (86%) of the probability mass.The dot-dashed vertical lines indicate a 68% credibility interval in the univariate marginals.White areas indicate zero counts of samples.

FIG. 2 .
FIG. 2. NLO posterior sampled with HMC.The LECs are shown in units of 10 4 GeV −2 for the LO LECs and 10 4 GeV −4 for the NLO LECs.The inner (outer) gray contour line encloses 39% (86%) of the probability mass.The dot-dashed vertical lines indicate a 68% credibility interval in the univariate marginals.

FIG. 3 .
FIG. 3. NNLO posterior sampled with HMC.The LECs are shown in units of 10 4 GeV −2 for the LO contact LECs, 10 4 GeV −4 for the NLO contact LECs, and GeV −1 for the πN LECs.The inner (outer) gray contour line encloses 39% (86%) of the probability mass.The dot-dashed vertical lines indicate a 68% credibility interval in the univariate marginals.

c 4 FIG. 4 .
FIG.4.Prior and posterior pdfs for the πN LECs c1, c3, c4, in units of GeV −1 , indicated with black-line ellipses and colored (purple jagged) regions, respectively.The inner (outer) black ellipses enclose 39% (86%) of the prior probability mass, and the jagged gray lines do the same for the posterior probability mass.The posteriors were obtained using HMC.See text and Fig.3for details.

FIG. 5 .FIG. 6 .
FIG.5.The Gelman-Rubin convergence diagnostics R for the HMC sampled chains at LO, NLO, and NNLO as a function of the number of samples N .R for each individual parameter is shown in gray, while the mean R is shown in green, blue, and purple, respectively.

FIG. 7 .
FIG.7.Integrated autocorrelation time τ vs number of samples.The gray area indicates the zone of non-convergence, i.e., N < 50τ .The emcee results are averaged over all walkers, hence the seemingly low number of samples compared to TableIV.
to the uncertainty of the LECs, and we consider δy certain due to the unknown EFT expansion coefficients c ν for ν > k.Additional uncertainties regarding, e.g., the breakdown scale Λ b , the expansion parameter Q, and the correlation structure of the model predictions y (k) |D, I) by observing that pr(y (k) th |D, I) = pr(y (k) th , α|D, I) d α = pr(y (k) th | α, D, I)pr( α|D, I) d α = pr(y (k) th | α, I)pr( α|D, I) d α.

( 50 )
The last step is a consequence of the conditional independence between y (k) th and D given α.Drawing random samples from this ppd amounts to evaluating y (k)

FIG. 10 .
FIG. 10. Posterior predictive distributions for the true value of two N N scattering observables at LO (green), NLO (blue), and NNLO (purple).(a) Uncorrelated samples of pr(ytrue|D, I) for the np total cross section.Empirical data with error bars from Ref. [16] is shown in orange.(b) 68% (dark shaded regions bounded by solid lines) and 95% (light shaded regions bounded by dashed lines) HDIs of the ppds in (a).(c) Uncorrelated samples of pr(ytrue|D, I) for pp AYY at T lab = 294.4MeV.Empirical data with error bars from Ref. [45].(d) 68 and 95% HDIs of the ppds in (c).

TABLE I .
[17]distribution of observables in the training and validation data sets denoted D and D, respectively.D is composed of all available data in the 80 ≤ T lab ≤ 100 and 290 < T lab ≤ 350 MeV ranges, plus one set of np SGT data[16]covering the 33 ≤ T lab ≤ 350 MeV range.All other available data constitutes D. We denote scattering observables using the SAID nomenclature[17], e.g.

TABLE II .
Results from the c analysis.cν denotes the RMS value of the EFT expansion coefficients at a particular order ν, while c0...ν denotes the RMS value up to, and including, order k.

TABLE III .
Detailed statistics of the HMC chains during the tuning and sampling phases.The number of chains at each order is M and the total number of samples is counted across all M chains.The HMC parameters and L denote the nominal step length and total number of steps taken with the leapfrog algorithm to integrate Hamiltons equations for each HMC step.ā is the average acceptance rate.

TABLE IV .
(47)arison between the performance of emcee and HMC applied to sample the LEC posteriors at LO, NLO, NNLO.τ is the integrated autocorrelation time.ESS is the effective sample size and N is the number of collected MCMC samples.Consequently, the ESS/N column shows how many effective samples one nominal sample is worth.The NL/N column, where NL is the total number of likelihood calls, shows the average number of likelihood evaluations (including tuning and burn-in) necessary to collect one nominal sample.S is the average real-world speedup of HMC compared to emcee as defined in Eq.(47).