General framework for cosmological dark matter bounds using $N$-body simulations

We present a general framework for obtaining robust bounds on the nature of dark matter using cosmological $N$-body simulations and Lyman-alpha forest data. We construct an emulator of hydrodynamical simulations, which is a flexible, accurate and computationally-cheap model for predicting the response of the Lyman-alpha forest flux power spectrum to different dark matter models, the state of the intergalactic medium (IGM) and the primordial power spectrum. The emulator combines a flexible parameterization for the small-scale suppression in the matter power spectrum arising in"non-cold"dark matter models, with an improved IGM model. We then demonstrate how to optimize the emulator for the case of ultra-light axion dark matter, presenting tests of convergence. We also carry out cross-validation tests of the accuracy of flux power spectrum prediction. This framework can be optimized for the analysis of many other dark matter candidates, e.g., warm or interacting dark matter. Our work demonstrates that a combination of an optimized emulator and cosmological"effective theories,"where many models are described by a single set of equations, is a powerful approach for robust and computationally-efficient inference from the cosmic large-scale structure.


I. INTRODUCTION
Determining the particle physics of the dark matter would not only explain an integral part of the standard cosmological model but can also point to new physics that will deepen and unify our understanding of nature. Following recent null results at high-energy colliders [1] and by direct detection [e. g., 2], it is timely to consider new theoretically well-motivated avenues that do not rely exclusively on the "WIMP miracle" [weakly interacting massive particles; 3], as well as alternative experimental approaches beyond colliders and direct/indirect detection. Cosmological observations allow constraints on dark matter properties in physical regimes otherwise inaccessible in the laboratory [e. g., 4,5].
The Lyman-alpha forest (as a tracer of the small-scale, high-redshift linear matter power spectrum [26]), often in combination with larger-scale probes like the cosmic microwave background (CMB) and baryon acoustic oscillations, provides competitive bounds on these models [19,20,[27][28][29][30][31][32][33][34][35], to which we will refer in general as "noncold" dark matter, following Ref. [36]. The challenge in getting robust bounds using the Lyman-alpha forest and other probes of the cosmic large-scale structure (e. g., galaxies, weak lensing, 21 cm) is the computational cost of modeling the data with sufficient accuracy. This often requires N -body simulations [37,38], sometimes with hydrodynamics [39] which adds even further computational cost. This makes "brute-force" sampling of the parameter space (as required by e. g., Markov chain Monte Carlo methods; MCMC) computationally infeasible. Further, the many different dark matter and cosmological models that can be tested each require, in principle, a new set of simulations for each analysis, which can make inefficient use of computational resources. Much effort has gone into developing cosmological "effective theories," where many different physical models are described by a single set of equations [e. g., 23,40]. These are designed to address the issue of testing many physical theories in an efficient way.
In this work, we present a framework for cosmological dark matter bounds, specifically using the Lymanalpha forest, which makes the testing of new models robust, accurate and computationally efficient. This combines a general model for the effect of non-cold dark matter (nCDM) on the matter power spectrum presented in Ref. [36] with the Bayesian emulator optimization we presented in Ref. [41,42]. An "emulator" is a flexible model that can characterize the data accurately and is computationally cheap to evaluate [43]. This means it can be easily called within MCMC, making parameter inference feasible. It is optimized using a small set of "training" simulations, the selection of which is critical (see below). Emulators have found wide use in cosmological studies using the large-scale structure [41,42,[44][45][46][47][48][49][50][51][52].
We build an emulator for the Lyman-alpha forest flux power spectrum as a function of the nCDM model and other nuisance parameters. This initial base emulator can then be optimized for the analysis of specific dark matter models. Here, we consider the example of ultralight axion (ULA) dark matter, but the procedure will be equivalent for any of the dark matter candidates which can be characterized by the nCDM model (see above). We optimize the construction of the emulator training set using Bayesian optimization, which iteratively adds training data in order to determine the peak of the posterior distribution accurately while exploring the parameter space fully [53]. This allows tests of the robustness of dark matter bounds with respect to the accuracy of the emulator model. Previous dark matter and cosmological bounds from the Lyman-alpha forest [e. g., 19,20,[27][28][29][30][31][32][33][34][35][54][55][56][57][58] have relied on linear interpolation of simulated flux power spectra around a fiducial point. The emulator model we use makes fewer assumptions and is more robust in its statistical modeling, which can remove bias in power spectrum estimation and strengthen parameter constraints [42]. Even for cosmological datasets where emulators are already used, the combination of convergence tests using Bayesian optimization and effective theories as described above can provide significant benefits.
In § II, we present the general emulator for nCDM models and then in § III, we present the example of optimizing the emulator for the study of ULA dark matter. We conduct tests of emulator convergence and crossvalidation in § IV, discuss in § V and conclude in § VI.

II. NON-COLD DARK MATTER EMULATOR
In this section, we present a general emulator of hydrodynamical simulations for non-cold dark matter models using the Lyman-alpha forest flux power spectrum as the constraining data. This initial emulator is not tied to a specific dark matter model, but rather forms the base for Bayesian optimization for a particular model (see § III for the example of ultra-light axions). This framework ensures the computational efficiency, theoretical accuracy and maximum precision for dark matter bounds using the cosmic large-scale structure. In § II A, we explain the non-cold dark matter model we employ, first introduced by Ref. [36]. We present the details of our cosmological hydrodynamical simulations in § II B and the intergalactic medium and background cosmology model which must be marginalized in § II C. In § II D, we discuss the method of emulation by Gaussian processes and how this connects to Bayesian inference.
A. Non-cold dark matter model In order to build an emulator-inference framework which is as general and powerful as possible, we employ the non-cold dark matter (nCDM) model first introduced by Ref. [36]. The model is defined by a "transfer function", (1) P CDM (k) and P nCDM (k) are the linear matter power spectra respectively for CDM and nCDM models; k is the comoving wavenumber (length-scale) in h Mpc −1 ; and [α, β, γ] are the free parameters of the model, where α is in units of h −1 Mpc. In our analysis, we define T (k) at the initial conditions of our cosmological simulations (see § II B), i. e., at z = 99. It is a generalization of the fitting function for a thermal warm dark matter (WDM) transfer function [59], to which Eq. (1) reduces when β = 2ν, γ = − 5 ν and ν = 1.12. By having two further degrees of freedom, the general model in Eq. (1) can characterize the matter power spectra for many models beyond the thermal WDM case, e. g., non-thermal WDM, like sterile neutrinos resonantly produced or from particle decays, mixed models of cold and warm dark matter, axionlike particles, light (sub-GeV) dark matter coupling to baryons, dark matter coupling to dark radiation or selfinteracting dark matter [17][18][19][20][21][22][23][24][25]. Many of these models are explored in Ref. [36]; we will consider the example of ultra-light axions in § III.
We consider α ∈ [0, 0.1], β ∈ [1, 10], γ ∈ [−10, 0]. Fig. 1 shows some example nCDM transfer functions. In particular, we show the cold dark matter limit with α = 0; nCDM model fits for the transfer function of a 2 keV WDM particle [α, β, γ] = [0.02, 2.2, −4.5] and a 10 −22 eV ULA particle [α, β, γ] = [0.1, 5.2, −3.4] (more details in § III); and some extremal values of the nCDM parameters within our prior volume. In general, the nCDM model can characterize dark matter candidates which suppress the growth of structure on small scales and hence form a cut-off in the matter power spectrum. By spanning the three free parameters of the model, it is possible to characterize different scales and shapes of this cut-off, which can map to different particle models, as shown by the WDM and ULA examples in Fig. 1. α is the main parameter determining the scale of the suppression and as it increases, the cut-off moves to larger scales (smaller k). β and γ are the main parameters determining the shape of the suppression. As β increases, the cut-off becomes sharper, in particular flatter before T 2 = 0.5; as γ increases, the cut-off also becomes sharper, in particular flatter after T 2 = 0.5.

B. Cosmological hydrodynamical simulations
Our observable is the Lyman-alpha forest flux power spectrum at z = [4.2, 4.6, 5.0]. In order to model this, we run cosmological hydrodynamical simulations of the intergalactic medium (IGM; from which the forest is sourced) using the publicly-available 1 code MP-Gadget [39,60,61]. We evolve 512 3 particles each of dark matter and gas in a (10 h −1 Mpc) 3 box (see tests of convergence with respect to particle number and box size in Appendix A) from z = 99 to z = 4.2, saving snapshots of particle data in particular at z = [4.2, 4.6, 5.0]. We include the effect of the nCDM models we study by generating the initial conditions of each simulation using a suppressed matter power spectrum as modified by the transfer function in Eq. (1); we discuss the implications for the nCDM models we can characterize in § V. We focus computational resources on the low-density gas that forms the Lyman-alpha forest by enabling the quick-lya flag, which converts gas particles at overdensities > 1000 and with temperatures < 10 5 K into collisionless particles [62]. This yields significant computational efficiency 1 https://github.com/MP-Gadget/MP-Gadget.
gains with a negligible impact on the accuracy of Lymanalpha forest statistics.
From each particle snapshot, we generate 32000 mock spectra (with pixel widths ∆v = 1 km s −1 ) with only the Lyman-alpha absorption line and calculate the 1D flux power spectrum using fake spectra [63]. The 1D Lyman-alpha forest flux power spectrum measures correlations in Lyman-alpha absorption lines along the line-ofsight only (i. e., integrated over transverse directions). It Here, k f and k ⊥ f are respectively the line-of-sight and transverse components of the full velocity wavevector k f and have inversevelocity units (e. g., s km −1 ); P 3D f (k f ) is the 3D flux power spectrum.
We estimate P f (k f ) first in each mock spectrum using a fast Fourier transform (FFT): is the fraction of emitted quasar flux transmitted, F is the mean flux across all spectra in a given box, τ is the optical depth and the effective optical depth τ eff = − ln F ; v is the line-of-sight velocity. In Ref. [64], the flux is normalized to a "rolling mean flux" (averaged at each pixel using a finite window). This attempts to correct for redshift evolution within individual spectra and mis-modeling of the quasar emission continuum; we test the impact on our results in Appendix C. Finally, we then average the flux power spectra over all mock spectra and correct for the pixel window function [65]: The pixel window function is W (k f , ∆v) = sin(k f ∆v/2) k f ∆v/2 ; note that we do not simulate the finite resolution of the spectrograph as this does not commute with the rescaling of mean fluxes (see below) and we compare only to data where the spectrographic resolution has been corrected (we test the impact of imperfect resolution modeling in Appendix B).
We calculate (and emulate; see § II D) the flux power spectrum at the (real-valued) discrete Fourier transform sample frequencies of the simulation box with wavenumbers in inverse-distance units (h Mpc −1 ). This is because the conversion to velocity units depends on redshift and the energy density by the Hubble law. Since the matter energy density will be varied in our base emulator (see below), this would mean that the velocity wavenumbers of each simulation would vary across the emulator and the modes would not be consistently distributed. In the likelihood function (see § II D), we linearly interpolate to velocity wavenumbers.

C. Intergalactic medium and cosmological model
In order to study nCDM models using the flux power spectrum, we must marginalize over uncertainty in the thermal and ionization history of the IGM (which partly derives from the uncertain nature of cosmic reionization), as well as relevant aspects of the cosmological model.

Pressure smoothing
The energy deposited in the IGM by the background of ionizing ultra-violet (UV) photons and the finite temperature of the IGM (∼ 10 4 K at z ∼ 5) leads to a few observable effects in the flux power spectrum. The heat in the IGM smoothes the spatial distribution of the gas below a characteristic scale, where the gas pressure balances gravitational clustering. A simple, classical interpretation defines a Jeans pressure smoothing scale, which is a function of the instantaneous gas temperature [e. g., 66]. However, in an expanding universe with an evolving thermal state, the relevant pressure smoothing scale depends on the integrated history of the gas heating as spatial fluctuations from earlier times expand or fail to collapse depending on the temperature at that epoch. This "filtering length" in general deviates from the Jeans length, being typically greater before reionization and less afterwards (∼ 100 comoving kpc at z ∼ 5).
In order to track the heat deposited in our simulations and the resulting filtering scale, we use the integrated energy deposited into the gas per unit mass at the mean density,  [6,13]. In general, a delay between heat injection and a change in the gas distribution is expected, but this delay can be reduced in the flux power spectrum owing to non-linear redshift-space distortions arising from gas peculiar velocities; Ref. [64] posits this as an explanation for the different z end . Otherwise, the values of z ini are consistent with the gas gradually losing sensitivity to heating at earlier times.

Temperature-density relation
The small-scale flux power spectrum is sensitive to the instantaneous thermal state of the IGM in particular owing to the thermal broadening of absorption lines. After the gas is heated by the (re-)ionization front, it cools (for ∆z ∼ 1 − 2) towards a thermal asymptote set largely by the balance between UV background (UVB) photoheating and adiabatic cooling owing to cosmological expansion [e. g., 68]. Further, the overwhelming majority of the low-density gas (∆ 10) to which the forest is sensitive follows a tight power-law temperature-density relation [69], where gas temperature This relation has two free parameters: the temperature at mean density T 0 (z) and a slope γ(z). The latter is generally expected to be ∼ 1 (in particular immediately after an ionization event) and to asymptote towards 1.6; however, γ(z) is generally poorly constrained by data with sometimes inconsistent measurements [e. g., 70]. We discuss in § V where this model breaks down, e. g., a double power law during reionization or temperature fluctuations arising from an inhomogeneous reionization [71,72], and the resulting implications for dark matter bounds.

Ionization state
The ionization state of each species is determined by the balance between the UVB ionization rates and recombination effects. Uncertainty in the ionization rates is degenerate in the flux power spectrum with the mean amount of absorption in the transmitted flux or the effective optical depth τ eff . We therefore marginalize over the uncertainty in the ionization state by varying τ eff ; we parameterize τ eff at each redshift by a multiplicative correction τ 0 (z) to a fiducial model [64]:

Simulation input and output
The thermal and ionization history in our hydrodynamical simulations (see § II B) is determined by a set of input spatially-uniform UV ionization and heating rates for each primordial species [HI, HeI, HeII] as a function of redshift; our default set of rates comes from Ref. [73] 2 . In order to generate a wide range of IGM histories, we vary the default photo-heating rates 0,i by an amplitude H A and an overdensity ∆-dependent rescaling H S such that heating rates i = H A 0,i ∆ HS are used. This uniformly scales the heating rates with redshift. In order to vary the redshift dependence, we employ the reionization model of Ref. [75]. We vary two free parameters of the model: the redshift of hydrogen reionization z rei , defined as the redshift at which the volume-averaged ionization fraction in HII = 0.5; and, for the first time in studying nCDM models using the flux power spectrum, the total heat injection during hydrogen reionization T rei . The reionization model of Ref. [75] self-consistently generates heating and ionization rates for chosen reionization parameters, only transitioning to the default rates once ionization is complete. The model, in particular, removes spurious heating at high redshift before and during reionization, which has been present in previous work [e. g., 73,[76][77][78].
We vary H A ∈ [0.05, 3.5], H S ∈ [−1.3, 0.7], z rei ∈ [6,15], T rei ∈ [1.5 × 10 4 , 4 × 10 4 ] K. In order to be further agnostic about the IGM states over which we marginalize and to have as robust nCDM bounds as possible, in our emulator and likelihood function (see § II D), we allow IGM parameters to vary independently with redshift. To achieve this, we label each simulation at each redshift with the parameter set [τ 0 , T 0 , γ, u 0 ]. This no longer restricts the redshift evolution of our IGM model to that determined by the input model alone, which may be incomplete [see, e. g., other estimations of UVB heating and ionization rates; 79]. In particular, this allows us to account for uncertainty in HeII reionization, which may start heating the IGM as early as z ∼ 4.5, without explicitly varying helium parameters in the input model. This would otherwise further extend the dimensionality of the model, which has important implications regarding the construction of the emulator (see § II D).
At each redshift we restrict our sample values of [T 0 , γ, u 0 ] to those generated in our simulations; i. e., we do not employ the practice of rescaling the temperaturedensity relation after a simulation snapshot has been saved [e. g., 64,80]. This practice may bias inference when using the smallest scales in the flux power spectrum, e. g., as rescaling the temperatures of simulation particles ignores their peculiar velocities, the effect of which on the flux power spectrum may not be fully captured by other IGM parameters (see above). However, we do employ the common practice of rescaling mock spectrum optical depths in order to increase our sampling of τ eff ; Ref. [81] tested the accuracy of this procedure. This process is relatively computationally cheap and means that, for each simulation snapshot, we can have an arbitrary number of effective optical depth values; we discuss this further in § II D.

Cosmology
The flux power spectrum at these redshifts (z ∼ 5) and scales (k f ∼ 10 −2 − 10 −1 s km −1 ) is sensitive to aspects of the cosmological model, which must be marginalized for robust nCDM bounds. In particular, to search for a cut-off in the power spectrum arising from an nCDM model, we marginalize over the amplitude A s ∈ [1.2 × 10 −9 , 2.5 × 10 −9 ] and slope n s ∈ [0.9, 0.995] of the primordial power spectrum. We note that these differ from the standard cosmological parameters [82] as we define them at a pivot scale k p = 2 Mpc −1 , which lies in the range of scales we are considering. This requires translating CMB priors from the larger pivot scale used by the Planck Collaboration. We also vary in the initial set of simulations, the fractional matter energy density Ω m ∈ [0.26, 0.33], although in our initial optimization analysis presented below, we fix it to a fiducial value from Planck [82,83], Ω m = 0.3209. All other cosmological parameters are fixed to fiducial values from the Planck 2018 results [83], in particular, Ω b h 2 = 0.022126 and Ω c h 2 = 0.12068.

D. Gaussian process emulation and likelihood function
In order to infer bounds on nCDM parameters and to marginalize over the nuisance parameters identified in § II C, we want to be able to model data as a function of theory parameters. A challenge arises because each evaluation of our theory vector naively requires the calculation of a hydrodynamical simulation (see § II B) and numerical methods of inference (e. g., Markov-chain Monte Carlo sampling; MCMC) may require millions of these evaluations. This would be computationally infeasible given the numerical requirements for sufficiently converged simulations (see Appendix A). We follow the solution we presented in Refs. [41,42], which is to construct a "surrogate model" or "emulator" for the simulations using a Gaussian process [e. g., 43] and to ensure convergence and sufficient accuracy in its construction by the method of Bayesian emulator optimization. A Gaussian process emulator is sufficiently computationally-cheap to evaluate that standard inference methods like MCMC can be used.
We emulate separately at each redshift z = [4.2, 4.6, 5.0], the flux power spectrum as a function of the nCDM parameters ( § II A) and the nuisance IGM and cosmological parameters ( § II C): As discussed in § II B, we emulate the flux power spectrum at 45 sample wavenumbers in inverse-distance units (h Mpc −1 ). This accounts for the fact that the velocity wavenumbers of a given simulation depend on Ω m (by the Hubble law) and so emulating in velocity space would be complicated by having simulation modes binned differently at each training point. In the likelihood function, we then linearly interpolate the flux power spectrum to velocity wavenumber bins.

Gaussian process
Our emulator model is a Gaussian process; this is a stochastic process such that a finite number of simulation outputs (i. e., the flux power spectrum) forms a multivariate Gaussian distribution: P f (θ) ∼ N (0, K(θ, θ ; ψ)).
Here, for clarity, we have dropped in the notation the dependence of P f on k f and z. The free hyper-parameters ψ of the emulator model for P f (θ) will be optimized (or "trained") using a set of "training simulations" P f (θ ) evaluated at a set of points θ . The zero mean condition is approximated by normalizing the flux power spectra by the median value in the training set. Having chosen the training set of simulations (see below), the final part of the model to determine is the covariance kernel K(θ, θ ; ψ). We use a linear combination of a squared exponential kernel or radial ba- . The Gaussian process model constitutes a wide prior in function space and so allows emulation of the flux power spectrum without any strong prior knowledge of its parameter dependence. Moreover, the model is probabilisitic and so any prediction of the flux power spectrum away from the training points will have an associated uncertainty which can be propagated to our statistical model. The choice of covariance kernel is motivated by wanting a flexible model (hence the squared exponential), recognizing that linear interpolation has been a reasonable approach in previous efforts [e. g., 27] (hence the linear kernel) and having a term which can account for noise in the training data (although we find that this is always negligible).

Training simulations
In the construction of the training data, we adapt the methods we presented in Refs. [41,42] to this more general parameterization. We run an initial set of fifty simulations spanning the wide prior range on the nCDM, IGM and cosmological parameters set out above ( § II A and II C). To achieve this, we choose the parameter values of these simulations from a Latin hypercube sampling scheme [84]. This sampling scheme is chosen because it has good space-filling properties with a low number of sampling points. It distributes N training points such that along each parameter axis, the full space is sampled in N equally-divided subspaces. We generate many different hypercubes and use the one that maximizes the minimum Euclidean distance between samples. This Latin hypercube is constructed for our input simulation parameters, i. e., [α, β, γ, H A , H S , z rei , T rei , n s , A s , Ω m ], since we do not know a priori the mapping from input to output IGM parameters. This means that in our initial set of training simulations, the IGM parameters are not sampled by a Latin hypercube. We find, however, that these parameters are still evenly sampled (see Fig. 4 for the distribution of training points) and the subsequent application of Bayesian emulator optimization will correct for any loss of precision. A partial degeneracy appears between T 0 (z = z i ) and u 0 (z = z i ) such that snapshots with high T 0 and low u 0 (and vice versa) are not produced. This constitutes a physical prior on these parameters as very hot IGMs with little previous heating (and vice versa) would be contrived. We therefore set a prior probability distribution that excludes parameter space outside the convex hull of the training points in this plane (at each redshift).
We also note that τ 0 (z = z i ) does not form part of the Latin hypercube since (as discussed in § II C), we can have an arbitrary number of τ 0 samples by rescaling mock spectrum optical depths. For each training simulation, we have ten values of τ 0 (z = z i ) ∈ [0.75, 1.25] (i. e., allowing for 25% corrections to the fiducial model). These are evenly distributed such that across all the training points, we have 10 N = 500 different values of τ 0 for maximal support in this dimension.
The framework we propose is as follows: the initial base nCDM emulator can be used for parameter inference on the many particular nCDM models it can characterize (see § II A). This means that the process of Bayesian emulator optimization (the iterative addition of training simulations until parameter inference has converged) should be applied for each particular set of dark matter model parameters (and nuisance IGM/cosmological parameters). In § III, we demonstrate the example of optimizing the emulator for inferring a bound on the mass of the ultra-light axion. Other dark matter model parameters can be inferred by equivalent analyses. We therefore defer a discussion of the specifics of the Bayesian emulator optimization to § III B.

Emulator training
We now discuss how a given emulator (whether optimized or not) propagates to the final parameter inference. With a given training set, we optimize (or "train") the hyperparameters ψ of the Gaussian process by maximizing the marginal likelihood of the training data. We train each redshift separately but use the same set of optimized hyperparameters for each wavenumber within each redshift bin. We do not model correlations between different wavenumbers; Ref. [80] found this to be a sufficient approximation. We can then make predictions for the flux power spectrum at arbitrary values of θ * by evaluating the posterior predictive distribution (PPD) for the flux power spectrum conditional on the training data. This is a Gaussian distribution with mean The generic behaviour of the Gaussian process PPD is that its uncertainty increases as a function of distance to the nearest training points. It follows that more precise theoretical estimates of the flux power spectrum will in general be made in regions of parameter space which are more densely sampled by training data; the Bayesian emulator optimization exploits this feature of Gaussian processes (see § III B).

Likelihood function
In this analysis, we use a Gaussian likelihood function, although this initial emulator could be used with any likelihood function that can robustly propagate the theoretical uncertainty arising from the Gaussian process emulation. The theory vector is taken to be the PPD mean µ(θ) and the uncertainty in the flux power spectrum emulation is propagated by adding in quadrature the diagonal PPD covariance σ 2 (θ) to the data covariance.

III. ULTRA-LIGHT AXION EMULATOR
In this section, we present the construction of the Bayesian-optimized emulator for inferring a bound on the mass of ultra-light axions as the dark matter using the Lyman-alpha forest flux power spectrum. This emulator is built using the base nCDM emulator presented in § II. It constitutes an example of how to construct an optimized emulator for a particular dark matter model characterized by the nCDM parameters. Other dark matter candidates could be studied in an equivalent way. In § III A, we explain the ultra-light axion model we employ; and in § III B, we discuss the method of Bayesian emulator optimization and how this connects to estimation of the posterior probability distribution.
A. Ultra-light axion model Our ground truth for the ULA transfer functions is given by axionCAMB 4 , a modified version of the Einstein-Boltzmann solver CAMB [88]. It solves the coupled Friedmann/Klein-Gordon system of equations for a homogeneous ULA field in an expanding universe. At early times, when m a aH (a is the cosmological scale factor; H is the conformal Hubble parameter), the evolving axion equation of state w(a) and adiabatic sound speed are used to evolve axion energy and pressure perturbations along with the other cosmological energy components. At late times, when m a aH, the equations become stiff and the axion field is well-approximated as a pressureless fluid (w = 0) and the perturbations are evolved in the WKB approximation for ULAs with scale-dependent sound speed [e. g., 89]. This approximation captures the suppression of small-scale axion perturbations, observed as the sharp cut-off in the transfer functions in Fig. 2, without requiring the resolution of the short ULA oscillation timescale (∼ m −1 a ). We find that the nCDM model (Eq. (1)) can well characterize the power spectrum cut-off produced by axionCAMB (Fig. 2). It is not able to characterize the suppressed small-scale oscillations in the transfer function. However, even in the initial conditions of our simulations ( § II B) in the matter power spectrum, these oscillations will be further suppressed since P (k) ∝ T 2 (k). They will be erased further again in propagation to our observable, the flux power spectrum at z ∼ 5 by non-linearities in the density field and the projection from a 3D to a 1D line-of-sight power spectrum. Considering that current data measure the flux power spectrum at the precision of ∼ 10% − 25% [e. g., 64], we conclude that our bounds on axion mass will be insensitive to this approximation and driven by the non-detection of the initial sharp cut-off in the power spectrum. We will discuss this further and compare to the approximations in previous efforts in § V. We consider only the case where ULAs form the entirety of the dark matter; future analyses can extend our parameter space to consider sub-dominant contributions to the dark matter energy density (see § V).

B. Bayesian-optimized emulator and posterior distribution
In order to infer a bound on the mass of ultra-light axions as the entirety of the dark matter (and marginalize over nuisance IGM/cosmological parameters; see § II C), we must numerically sample the posterior probability distribution for the full set of inference parameters φ = [log(m a [eV]), τ 0 (z = z i ), T 0 (z = z i ), γ(z = z i ), u 0 (z = z i ), n s , A s ], for z i = [4.2, 4.6, 5.0]. We use the Gaussian likelihood function presented in § II D.
Although Ω m is varied in our emulator ( § II C), since relevant data do not constrain this parameter, here we fix Ω m = 0.3209, which is equivalent to h = 0.6686 [83]. We estimate the posterior probability distribution (with a given prior distribution) using the affine-invariant MCMC ensemble sampler emcee [90]. In order to evaluate the theory flux power spectrum, we use the nCDM emulator described in § II. We map from log(m a [eV]) to [α, β, γ] using the model constructed in § III A. However, as mentioned in § II D, the initial base emulator with fifty training simulations is un-optimized and not guaranteed to give converged estimates of the posterior distribution. This can arise when the uncertainty in emulator prediction (the variance in its posterior predictive distribution) dominates over the data error (covariance) at the peak of the true posterior distribution (which is not a priori known). This can lead to weakened parameter constraints (as the emulator error broadens the peak of the likelihood) or even bias determination of the maximum posterior point due to the non-stationarity of the emulator variance. We account for this problem by optimizing the emulator training set (iteratively adding to it) such that the true posterior peak is determined; this is the Bayesian emulator optimization we introduced in Refs. [41,42].
At each iteration of the Bayesian emulator optimization, after training the emulator on the existing training set and evaluating the posterior distribution by MCMC (see above), an acquisition function is maximized in order to determine the position in parameter space for the next training simulation. Following Ref. [41], we use the modified GP-UCB (Gaussian process upper confidence bound) acquisition function [91][92][93][94] A( φ) = P( φ|d) + ασ T ( φ)Σ −1 σ( φ). P( φ|d) is the natural logarithm of the posterior probability given data d; Σ is the data covariance matrix; and σ( φ) is the emulator error vector (or standard deviation of the Gaussian process PPD; see § II D). Following Refs. [95,96], we set the acquisition hyperparameter α = 0.85; this is roughly equivalent to having ∼ 1σ confidence in estimation of the posterior distribution during optimization. Since, as discussed in § II, we can have an arbitrary number of training points in the τ 0 (z = z i ) dimensions due to a computationally-cheap simulation post-processing, we do not optimize the training set in these dimensions. In the acquisition function, φ is the truncated set of inference parameters without τ 0 (z = z i ). It can be seen that the acquisition function has two terms which balance (in the first term) exploitation of approximate knowledge of our objective function (the posterior distribution) and (in the second term) exploration of the full prior volume. The exploitation term prefers adding training simulations at the peak of the posterior (which is the most important region to characterize accurately). The exploration term prefers adding simulations where uncertainty in the estimation of the true (zero emulator error) posterior is large, which is equivalent to regions where the ratio of emulator to data covariance is large. The linear combination balances these two components.
We find the maximum of the acquisition function using a truncated Newton algorithm [97]. In order to ensure we explore the full peak of the posterior distribution (i. e., the 95% to 99% credible region), we then choose the position of the next training simulation by adding a random displacement (drawn from a multivariate Gaussian distribution) to the maximum acquisition point [98]. The size of this displacement is tuned to the size of the credible region which we want to characterize accurately. A further subtlety arises since the acquisition function is defined on the space of the (truncated) inference parameters, which include the output IGM parameters [T 0 , γ, u 0 ]. In order to determine which input IGM parameters to use for the next training simulation, we map back to [H A , H S , z rei , T rei ] using radial basis function interpolation between the existing training set [99]. This method is related to, but less general than, the Gaussian process emulation and serves as a simple yet robust method for this mapping. Once the next training simulation is run, the τ 0 (z = z i ) samples are redistributed such that the 10N (N is the total number of training simulations including the optimization set) points evenly sample the full prior range. The emulator is then re-trained (the emulator hyperparameters ψ are re-optimized) and the posterior distribution is re-estimated by MCMC using the new iteration of the emulator.
The process of Bayesian emulator optimization continues and training simulations are added to the emulator until convergence is observed. We use improved "doublelock" convergence criteria. As in Ref. [41], one criterion is that successive estimations of the posterior distribution (e. g., as summarized by marginalized statistics like the mean and 1σ and 2σ constraints) do not change within a given tolerance; we show the results of this convergence test in § IV and Fig. 7. The second criterion is that the exploration term of the acquisition function evaluated at the parameter positions of optimization simulations converges towards zero. This ensures that the acquisition of new training data is then dominated by exploitation, i. e., that training points are being added at the peak of the posterior and that the ratio of emulator to data covariance is tending towards zero in this region. The results of this convergence test are shown in § IV and Fig. 5.
In this analysis, we use the batch version of Bayesian emulator optimization [41]. This involves, at each iteration, proposing a batch of optimization simulations at once, running these in parallel and then re-training the emulator only after the entire batch has been added to the training set. The first simulation in a batch of a given size is chosen as detailed above. Subsequent members of a batch are chosen in a similar way, except that the output (flux power spectra) from previous members are not yet known. However, the positions in parameter space for these members are known and this information can at least update the emulator error distribution, since the PPD variance depends only on the positions of training data and not the data themselves (see § II D). This partial information updates the exploration term of the acquisition function for iterations within a batch. Batch optimization can be preferred to balance efficient parallel use of computational resources with the loss of optimality in the acquisition of new training data. In this study, we start with three batches of five and then switch to batches of two, although we note that the third and sixth batches are short of one simulation each, owing to hardware issues that led to an incomplete calculation.

IV. TESTS
In this section, we conduct tests of convergence ( § IV A) and cross-validation ( § IV B) on the Bayesianoptimized ULA emulator presented in § III.
A. Tests of convergence Figure 4 indicates the distribution of emulator training points, projected onto each parameter plane. It can be seen that the initial Latin hypercube set of fifty (see § II; most lightly shaded in Fig. 4) fully spans the parameter space. The optimization simulations (43 in total; see § III; more darkly shaded in Fig. 4) concentrate in a smaller sub-space, which corresponds to the peak of the posterior distribution (see Fig. 6 and below). There is however a degeneracy between T 0 (z = z i ) and u 0 (z = z i ) since it is unphysical to generate simulations with high T 0 and low u 0 and vice versa. We reiterate that these are not samples from the posterior distribution and that the posterior is always estimated with fully converged MCMC chains. Figure 5 shows the first of the two convergence tests we use. It indicates how the exploration term of the Bayesian optimization acquisition function (see § III B) tends towards zero as more training simulations are added to the emulator. The exploration term is effectively the ratio of emulator to data covariance. Its convergence to zero reflects how the optimization becomes more dominated by exploitation, i. e., the proposal of training simulations is at the peak of the posterior and the ratio of emulator to data covariance is negligible there. This means that estimation of the posterior peak (by MCMC) tends towards the true posterior, with negligible propagated emulator error. Figure 6 indicates the convergence of the posterior distribution (see § III B) as summarized by the set of marginalized 1D and 2D distributions. For clarity, we show only the projections in the IGM parameters at z = 4.2; we observe similar convergence at the other redshifts that we consider. It can be seen in each projection that the posterior estimation converges as more optimization simulations are added to the emulator training set. In most projections, there is a significant shift from using the initial emulator ( § II) to having added 19 optimization points. In particular, the axion mass m a goes from being (spuriously) detected to setting a lower , not the planes with parameters from different redshift bins; or the τ0(z = zi) planes which are densely sampled by simulation post-processing, see § II B). It follows that on the x axis, zi corresponds to the redshift indicated on the y axis. We note that for log(ma[eV]), the initial set fully spans the [α, β, γ] volume. It can be seen how the darker-shaded optimization points concentrate in a smaller sub-volume, which corresponds to the peak of the posterior distribution (see Fig. 6). bound. However, after this point, the shifts in the posterior are much less significant; and following the iteration after which 35 optimization points have been added, the shifts in the posterior are statistically negligible. We track this more quantitatively using summaries of the marginalized posterior (marginalized mean and 1σ and 2σ constraints). Figure 7 shows for each model parameter, the number of sigma shift in these quantities between iterations as a function of optimization simulation number. It can be seen that by the end of the optimization, the shifts are small, indicating the desired robustness in the estimation of the posterior distribution. In Fig. 6, in the planes of parameters that are emulated together (remembering that different redshifts are emulated separately), we indicate by colored crosses the projected positions of optimization simulations. It can be seen that by Bayesian optimization, the optimization simulations are more concentrated in the posterior peak than the initial set which is evenly distributed in the full prior volume. This reduces the emulator error within the posterior peak, leading to accurate determination of the flux power spectrum (see § IV B) and hence the likelihood function in this region.

B. Tests of cross-validation
We carry out leave-one-out cross-validation of the Bayesian-optimized emulator. This involves, in turn, removing one of the training simulations, re-training the re-duced emulator and then comparing its prediction to the truth of the validation simulation that was removed 5 . We remove simultaneously the training simulation from each redshift bin since these are emulated separately. Figure 8 shows the distribution of the ratio of empirical-topredicted emulator error in the flux power spectrum at the validation simulations, as a function of wavenumber k and redshift z. The empirical error is the difference between the emulator PPD mean (see § II D) and the true flux power spectrum as measured from the validation simulation. The predicted emulator error is the standard deviation of the emulator PPD. If the emulator model is fitting well, the distribution of the error ratio should be a unit Gaussian. When the distribution is fatter than this, this indicates under-fitting, i. e., that the emulator error is being underestimated and that the Gaussian process model does not have sufficient flexibility to describe accurately the training or the validation data. When the distribution is narrower than a unit Gaussian, this indicates over-fitting, i. e., that the emulator error is being overestimated and that the Gaussian process has too much flexibility and is fit too closely to the training data. This has negative implications for the generalizability of the emulator model for unseen (e. g., validation) data.
In Fig. 8, the distributions are split into two sets: the left-handed halves of each violin consider the validation test only for the simulations added by Bayesian optimization, while the right-handed halves consider all the simulations including the initial Latin hypercube. In each case, we consider only the final, converged emulator and in cross-validation, optimize the emulator on all other 92 training simulations. In splitting the distributions that are shown, we indicate the two populations that are formed. When the initial set of simulations is included, the overall distribution tends towards overestimating the emulator error, in particular on the largest scales (smallest k). However, when considering only the optimization simulations (and hence the training points at the peak of the posterior distribution), the Gaussian process fit is better, although the error is still overestimated on the largest scales; there is some mild underestimation on intermediate scales. Overestimating the emulator error will tend to weaken parameter constraints as large errors broaden the peak of the posterior. Figure 9 compares the empirical and predicted emulator error (i. e., respectively the numerator and denominator in the ratio illustrated in Fig. 8) to the data error (this is the square root of the diagonal elements of the data covariance used in our data analysis presented in Ref. [100]). The distributions of (the logarithm of) the   ratio of emulator-to-data error are in general bi-modal; the lower-valued modes come from the optimization simulations, while the other modes come from the initial Latin hypercube. This reflects how the ratio of emulatorto-data error is significantly lower for the optimization simulations (and therefore at the peak of the posterior distribution) than for the initial Latin hypercube. This is a direct consequence (and desired outcome) of the Bayesian optimization, which builds up training information at the posterior peak and locally reduces emulator uncertainty in the flux power spectrum, which leads to more accurate determination of the peak of the posterior distribution. The left-handed and right-handed distributions in each violin show the ratio of emulator-todata error respectively for the predicted emulator error (PPD standard deviation) and the empirical emulator error (difference between PPD mean and truth). As is consistent with the test in Fig. 8, the empirical ratio is consistently shifted to lower values than the predicted ratio, reflecting how the emulator tends towards over-fitting the training data and overestimating the emulator uncer-tainty.
Nonetheless, for the optimization simulations, the mode average of the predicted emulator-to-data error ratio is always less than one. This means that in the posterior peak (∼ 95% credible region), uncertainty in the flux power spectrum (i. e., the diagonal of the covariance matrix in the likelihood function; see § II D) is dominated by uncertainty in the data. As discussed in Ref. [41], this is a key requirement for converged estimation of the posterior distribution as it means that any residual uncertainty in the emulator model is not statistically significant. Considering the tendency to overestimate the emulator error discussed above (in particular on the largest scales), the ratio of empirical emulator-to-data error is often < 0.1. This shows that there is no statistically significant (relative to the data uncertainty) bias in emulation of the flux power spectrum in the posterior peak. For each element, the left PDF shows the ratio of emulator error to data error; for the majority of training data in the optimization set (which constitutes the lower-valued mode in each case), this is < 1. Since Fig. 8 shows through cross-validation a tendency to overestimate the prediction uncertainty, the right PDFs show the ratio of the empirical error (for validation simulations) to data error; this ratio is often 1, indicating no statistically-significant bias in emulation of the flux power spectrum (in the optimization region). See more details in the text.

V. DISCUSSION
A. Dark matter model The model we employ to describe the effect of non-cold dark matter on the matter power spectrum [first introduced in 36] can characterize many different dark matter candidates with different physical mechanisms suppressing the small-scale power spectrum. These include, e. g., warm dark matter with a free streaming scale [e. g., 101] or dark matter interacting with itself or baryons [e. g., 20,25]; this is discussed further in Ref. [36]. For the nCDM model to apply, the detectable effect on the power spectrum relative to the cold dark matter limit must be fully captured by modified initial conditions at z ∼ 100. We consider here and in an accompanying data analysis [100], the example of ultra-light axion dark matter ( § III). We fit the nCDM model to ULA matter power spectra calculated by the modified Boltzmann code axionCAMB [85, see § III A]. We find the fit to be good aside from suppressed small-scale power spec-trum oscillations which cannot be captured by the nCDM model. We argue, however, that given the precision of flux power spectrum measurements on the relevant scales [∼ 10% to 25%; 64], these suppressed oscillations are currently undetectable and so the nCDM model captures the relevant phenomenology. This compares to the approximate ULA power spectrum transfer function employed in previous Lyman-alpha forest studies [33], which is a semi-analytical approximation presented in Ref. [16]. Although this model (imperfectly) captures small-scale oscillations (which do not persist to the flux power spectrum at z ∼ 5), the initial cut-off in the power spectrum is sharper than for the full calculation made by axionCAMB. We find that the nCDM model better captures the relevant cut-off in the power spectrum (see Fig. 2).

B. Intergalactic medium model
Robust dark matter bounds using the Lyman-alpha forest must marginalize over uncertainty in the thermal and ionization state of the intergalactic medium, which leads to a similar, but distinguishable, suppression in the small-scale flux power spectrum (see § II C). We model the IGM as obeying a single power-law temperaturedensity relation (TDR; Eq. (3)). We also argue that pressure smoothing can be tracked by the cumulative energy deposited per unit mass at the mean density u 0 (z); and exploit the fact that the strength of the photo-ionizing UV background is degenerate in the flux power spectrum with the effective optical depth of Lyman-alpha absorption. This compares to previous Lyman-alpha forest studies of dark matter [e. g., 27,33] which have usually tracked the pressure smoothing by varying the redshift of hydrogen reionization. When varying this parameter simultaneously with the TDR at each redshift, this can lead to marginalizing over unphysical IGMs where the energy deposited by (re)ionization is inconsistent with the instantaneous gas temperature. This may spuriously weaken dark matter bounds. In order to avoid this, we model the IGM using only "output" parameters that describe the state of the IGM at a fixed time-slice. We also improve the input for our simulations by employing the reionization model of Ref. [75]. Compared to previous prescriptions [e. g., 73,[76][77][78], it removes spurious highredshift heating and provides additional physical flexibility by varying both the timing of and total heat injection during reionization.
However, an underlying assumption in our (and previous) IGM models is that reionization is a spatially homogeneous process. This is an approximation, as ionization and heating are expected to expand in overlapping bubbles around the (uncertain) sources of ionizing radiation [e. g., 102]. The resultant temperature and ionizing background spatial fluctuations that can persist to the IGM at z ∼ 5 [e. g., 71,103] have been suggested as a possible source of systematic error in Lyman-alpha forest dark matter bounds [e. g., 104]. Ref. [72] found, using simulations which couple gas hydrodynamics with radiative transfer [105], that the relevant effect on the flux power spectrum is below the sensitivity of current data [e. g., 64]. Nonetheless, as modeling and data improve, future versions of the emulator can benefit from explicitly marginalizing over the effect of inhomogeneity in both hydrogen and helium reionizations [e. g., 71, 103, 106].

VI. CONCLUSIONS
We have presented a general framework for testing the nature of dark matter using cosmological data and Nbody simulations, specifically implemented for searching for deviation from cold dark matter using the Lymanalpha forest. It consists of a Gaussian process emulator, trained using hydrodynamical simulations to predict how the Lyman-alpha forest flux power spectrum responds to different dark matter models, the state of the intergalactic medium and the spectrum of primordial fluctuations. This is necessary since MCMC methods for inference would otherwise be impractical owing to the computational cost of N -body simulations. The emulator exploits a flexible parameterization of the effect of non-cold dark matter on the matter power spectrum [introduced by 36]. This can accurately capture the suppression in the power spectrum arising from many different dark matter models (relative to the cold dark matter limit), e. g., warm [e. g., 101] or interacting [e. g., 20, 25] dark matter or ultra-light axion dark matter [16], among others [see 36]. The emulator can be optimized for the analysis of particular dark matter models by the Bayesian emulator optimization we presented in Ref. [41]. This tests for convergence in the fidelity of the emulator model by adaptively building up the emulator's training information.
We demonstrate the example of constructing a Bayesian-optimized emulator for the mass of ultra-light axion dark matter. We carry out tests of convergence with respect to the number of hydrodynamical simulations input to the emulator; and cross-validation for the accuracy of flux power spectrum prediction. These tests confirm the robustness of the data analysis we carry out in Ref. [100]. The emulator framework we present here will facilitate the future analysis of other dark matter candidates by an equivalent procedure. We argue that the application of Bayesian optimization to emulators of cosmological "effective theories," where many different models are described by a single set of equations, can be a powerful approach for robust and computationallyefficient parameter inference from the cosmic large-scale structure. In this analysis, we use a simulation mass resolution and box volume that at least matches (and often improves upon) the numerics of previous Lyman-alpha forest studies of the small-scale matter power spectrum [k f ∼ 10 −1 s km −1 ; e. g., 64]; here, we test convergence in these properties. The top panel of Fig. 10 shows a test of convergence in the simulated flux power spectrum with respect to the number of simulation particles, while fixing the comoving volume of the simulation box to (10 h −1 Mpc) 3 . This is equivalently a test of convergence with respect to the mass resolution of the simulation. In each case, we rescale mock spectrum optical depths such that each simulation has the same mean transmitted flux in order to remove this parameter dependence. We observe convergence as the number of particles is increased (and hence the mass resolution is improved). We find the correction from having more particles than our nominal value (2 × 512 3 ) never to be greater than 10% and always less than the typical uncertainty in flux power spectrum data [e. g., 64]. We carry out this test for fiducial model parameters corresponding to a 2 keV warm dark matter particle, since it is known that convergence criteria are more stringent in the presence of suppressed initial conditions [e. g., considering the effect of numerical fragmentation; 107]. We therefore consider this a robust test of numerical convergence.

KKR thanks
The second panel from the top of Fig. 10 shows a test of convergence with respect to the comoving volume of the simulation box, while fixing the mass resolution of the simulation. We observe convergence as the box volume is increased. We find the correction from having a larger box than our nominal value (10 h −1 Mpc) 3 to be ∼ 10% or less, which is less than the typical data uncertainty. We find no statistically significant change in the convergence in the observationally-relevant redshift window z ∼ 4 -5. power spectrum mode that we consider. This arises since the smoothing in the transmitted flux that comes from a finite spectral resolution is over-corrected in the flux power spectrum (see Ref. [65] for the full model) and too much power is added on small scales. We find the opposite effect when underestimating the spectral resolution. The typical estimated uncertainty on the spectral resolution is ∼ 10%, which we find would have no statistically significant effect on the flux power spectrum (< 5%) compared to typical data uncertainty [∼ 10% to 25%; e. g., 64]. This would remain true even if the resolution uncertainty was underestimated by a factor of two.

Appendix C: Mean flux evolution
The bottom panel of Fig. 10 shows a test of the systematic error in the flux power spectrum at z = 5 arising from ignoring the redshift evolution of the mean transmitted flux in individual sections of Lyman-alpha forest. We model this (without resorting to radiative transfer) by injecting a redshift dependence in the effective optical depth to mock spectra generated from a simulation snapshot at z = 5, using the fiducial model presented in Ref. [64]. We then calculate the flux power spectrum from this snapshot ignoring the redshift dependence, i. e., following the default procedure of normalizing the flux by the mean in the whole box. We compare this to the flux power spectrum measured from the same snapshot without the injected redshift dependence, but with the optical depths uniformly rescaled to match the global mean flux in the two settings. We find an effect on all scales since the 1D flux power spectrum is a non-linear mapping of the optical depths that we modify. However, as expected, the largest effect is on large scales since we are effectively modifying a long wavelength mode in the optical depth. Nonetheless, the effect on the power spectrum is always < 10% and less than the typical data error [e. g., 64]. Further, we expect the residual error in our analysis to be much less than this since the effect is corrected in the data by using a "rolling mean" estimator [64]. Rather than normalizing the transmitted flux by a globally-estimated mean, the mean flux is estimated locally at each pixel using a finite window centered on that point. This corrects for the redshift evolution in the mean flux within sections of Lyman-alpha forest. We therefore expect only a residual error in comparing to simulations at a fixed time-slice. We expect the effect to weaken at lower redshifts. This is because the effective optical depth evolves more slowly at later times and the redshift path length for a given comoving interval also decreases with redshift. FIG. 10. From top to bottom: a test of convergence in the simulated flux power spectrum at z = 5 with respect to the number of simulation particles (for fixed comoving box volume); a test of convergence with respect to the comoving volume of the simulation box (for fixed particle mass resolution; the emulator is built with simulations with a comoving volume of (10 h −1 Mpc) 3 and 2 × 512 3 particles); a test of the systematic error arising from mis-estimating the spectral resolution of the data [up to ±20%; the uncertainty on the spectral resolution is generally estimated to be ∼ 10%, e. g., 64]; and a test of the systematic error arising from ignoring the redshift evolution of the mean flux in sections of Lyman-alpha forest [the residual error in our analysis will be much less since data are corrected for this effect by the "rolling mean" estimator; e. g., 64]. For comparison, uncertainty in flux power spectrum data are ∼ 10% to 25% depending on wavenumber [e. g., 64].