Fully Bayesian Forecasts with Evidence Networks

Sensitivity forecasts inform the design of experiments and the direction of theoretical efforts. To arrive at representative results, Bayesian forecasts should marginalize their conclusions over uncertain parameters and noise realizations rather than picking fiducial values. However, this is typically computationally infeasible with current methods for forecasts of an experiment's ability to distinguish between competing models. We thus propose a novel simulation-based methodology capable of providing expedient and rigorous Bayesian model comparison forecasts without relying on restrictive assumptions.

Introduction.-Jeffrey and Wandelt [1] recently proposed Evidence Networks.These are classifier artificial neural networks trained on data generated from two competing models using specific loss functions, so that the output of the network coverges to an invertible function of the Bayes ratio [2] between the two models.Consequently, these networks can be used for simulation-based Bayesian model comparison as illustrated in various examples by those authors.Here, we highlight a promising use case of these networks not mentioned in that original paper, forecasts of Bayesian model comparison analyses from upcoming experiments.
Forecasts are an essential part of science.Estimates of required sensitivities guide the design of experiments, and whether these experiments can probe different theoretical models informs which models time and resources are spent developing.Furthermore, funding agencies use anticipated scientific conclusions as part of their decisionmaking.As a result, fast, accurate, and reliable forecasting techniques play a vital role in the modern scientific method.
Due to its importance, many techniques exist to perform forecasts [e.g.[3][4][5][6][7][8].Within a Bayesian paradigm, arguably the most accurate forecasting technique is to generate mock data and perform Bayesian analysis on the data as if it were experimental data [e.g. 9, 10].However, such an analysis is typically computationally costly, limiting its application to a few mock data realizations.Hence, these analyses often break a central tenant of Bayesian statistics by drawing conclusions based on a small number of fiducial parameter values, which may lead to erroneous conclusions if the mock data used is not representative of reality.
Current analysis methodologies for Bayesian model comparison are thus either too slow for fully Bayesian forecasts to be feasible e.g., nested sampling, or only applicable in specific cases, e.g., Savage-Dickey forecasts.In this letter, we propose utilizing simulation-based inference in the form of Evidence Networks [1] to overcome the constraints of current methods [4,11], enabling expedient, fully Bayesian forecasts on any scientific question formulated as a condition on the Bayes ratio between two models.Such questions include whether a signal can be detected from within noise, whether two competing theories can be distinguished by expected data, or at what sensitivity level will a piece of additional physics be required within a model?We then demonstrate the technique by finding the a priori chance of detection of the global 21-cm signal by a REACH-like experiment [32], an analysis that would have been computationally impracticable using traditional methods.
Bayesian Forecasting.-Atraditional Bayesian analysis [see 2, for a more detailed discussion] starts with a model M that takes some parameters θ.From the model and experimental considerations, a likelihood of observing data D is constructed L(D|θ, M ).In addition, an initial measure, the prior π(θ|M ), is put over the parameter space to quantify the a priori knowledge of the parameter values.Then given some (mock) observed data, the measure over the parameter space is updated via an application of Bayes' theorem into the a posteriori knowledge of the parameter values P(θ|D, M ), called the posterior.Here Z is the Bayesian evidence which serves both as a normalization constant and as a natural goodness-of-fit statistic for comparing competing models of the same data.Now suppose there are competing models, M 0 and M 1 , with corresponding Bayesian evidences Z i .If we have an a priori belief in each model of P (M i ) then for each model the a posteriori belief P (M i |D) in the model is found using a further application of Bayes' theorem Cancelling P (D) between the above with i = 0 and i = 1 then gives the Bayesian Model Comparison equation This equation shows that after observing some data D the relative belief in the two models is updated with the ratio of their evidences, e.g. the belief in the higher evidence model increases, and the belief in the lower evidence model belief decreases.It is common to assume initially that the two models are equally likely P (M 1 ) = P (M 0 ) in which case equation (4) simplifies to so that the Bayes ratio K of the model evidences is simply the relative belief in model 1 compared to model 0 after observing some data.The above can also be rearranged to find the posterior belief in a model which can be compared to a statistical threshold to determine if one model is significantly preferred over the other.
With the mathematical notation of Bayesian analysis established, we can return to the central topic of this letter: forecasts.For many scientific applications [11,14], the question of interest is best posed as whether a proposed experiment could distinguish between two competing models.A traditional Bayesian forecast of such a question would go through the following steps.First, the models and corresponding likelihoods are established.
Then, mock data is simulated from one of the models by choosing some fiducial parameter values θ, adding noise η, and experimental effects.Z is evaluated for each model treating the mock data as you would actual observations from which K (or equivalently P (M i |D)) is calculated.Finally, K is compared to some condition to determine if the favouring of a model is statistically significant and so the models are distinguishable.
The conclusion of the outlined forecast is hence based on a condition on K of the form K > K crit with K crit a statistical significance threshold.However, K is conditional on the fiducial parameters chosen to generate the mock data and the noise added, so it is more accurate to state Using such a criterion presents two major issues (previously identified and discussed in Mukherjee et al. [11] and Trotta [4]).Firstly, the result is dependent on random noise η, leading to differences in conclusion purely based on chance.Secondly, these conclusions stem from a single point in the parameter space θ, leading to differences based on the choice of fiducial model (anathema to Bayesian statistics).A central tenant of Bayesian statistics being that particular sets of parameter values, even those that maximize the likelihood or posterior, have zero measure and thus are vanishingly unlikely and should not be used to draw conclusions.Instead, in Bayesian statistics, conclusions should be reached by marginalizing over the parameter space weighted by some measure, typically the prior or posterior.This issue is further compounded in the case of forecasts as there is often limited advance knowledge of the parameter values leading to the conclusion drawn potentially varying widely between choices of θ.
A rigorously Bayesian approach would be to marginalize the condition in equation ( 7) over θ and η.Such a procedure would give the expected probability of the condition holding and thus of drawing some scientific conclusion rather than a binary yes or no answer.Since the prior represents our knowledge of the parameters before any data is measured, it is the natural measure for a forecasting analysis.So, for our example, we should calculate , (8) to find a fully Bayesian forecast of our expected chances of distinguishing between the models at the statistical significance set by K crit .More generally, the condition in equation ( 8) could be replaced with any other condition on K to draw a scientific conclusion to perform a fully Bayesian forecast on drawing said conclusion.In addition, equation ( 9) can be modified to produce fully Bayesian forecasts for a broader range of scientific questions.For example, when appropriate to the question being posed, the marginalization should be performed over the model prior as well (this is equivalent to marginalizing over the predictive posterior odds distribution introduced in Trotta [4]), or over conditional priors.Thus, fully Bayesian forecasts can in theory be used to give principled and interpretable answers to a range of forecasting questions.
However, performing the above average over the parameters and noise (and potentially models) in practice would require thousands or millions of mock data sets and, in turn, thousands or millions of Bayes ratio calculations.Mukherjee et al. [11] proposed using Nested Sampling [33,34] to calculate the K values for fully Bayesian model comparison forecasts, but in practice, this is typically prohibitively computationally expensive, even for simple problems.To circumvent these computational limitations Trotta [14] proposed using the Savage-Dickey density ratio [17,35] to rapidly evaluate K values between nested models, facilitating a fully Bayesian forecast of whether the Planck satellite could detect a deviation in the scalar spectral index of primordial perturbations n s from 1.The use of the Savage-Dickey density ratio for model comparison forecasts is analogous to the widespread usage of Fisher forecasts to perform parameter constraint forecasts over uncertain data spaces.Both techniques utilize the analytic results available for linear models and Gaussian likelihoods to calculate approximate analytic parameter posteriors [e.g.18] or the Bayes ratio between models.However, the reliability of these results is conditional on the accuracy of the implicit linearization and Gaussianity assumptions [see 30,31].Furthermore, usage of the Savage-Dickey density ratio requires the models being compared to be nested.Thus, if we had nested models, an explicit likelihood, and knew the above assumptions to hold well, the Savage-Dickey density ratio would suffice to perform fully Bayesian model comparison forecasts.However, these requirements significantly limit the usage of such a methodology.Reliable and widely applicable fully Bayesian forecasts will hence require a novel methodology that maintains the expedience of Savage-Dickey forecasts but is applicable to non-nested models and avoids the same assumptions.
Evidence Networks.-Evidencenetworks are a type of classifier artificial neural networks introduced recently in Jeffrey and Wandelt [1].They take in simulated data D and output a single value f EN (D), with training performed on data generated from two models (labels m = 1 and m = 0).The principal insight with these networks is that for a broad category of choices of loss function the evidence network's value converges toward an invertible function of the Bayes ratio K(D) between model 1 and model 0. Hence, if such a network is converged, the Bayes ratio between the two models for any set of 'observed' data can be directly calculated from the network's output.
As established in the previous section, efficient evaluation of K for a wide range of mock data is the main obstacle to performing practicable fully Bayesian forecasts.Since evaluating a neural network on a GPU is orders of magnitude faster than direct Bayesian evidence calculation techniques, utilizing evidence networks may circumvent this computational limitation.We thus propose an alternative evidence-network-based methodology for performing fully Bayesian forecasts: • First, create simulators of mock data from the two competing models (including experimental considerations such as noise or selection effects).
• Then, using the two simulators, generate a training set and validation set of labelled mock data and train the evidence network.
• Validate the evidence network using a blind coverage test [1] to verify the network's accuracy and, if possible, also compare its output to a sample of K values calculated from traditional Bayesian techniques.This step is essential, as for data manifolds too complex to be classified by the chosen network architecture, or for insufficient training data sets, the converged network will not correspond to an accurate calculation of K1 .If validation indicates the network has not accurately converged, the network architecture or training will need to be refined and the previous steps repeated.
• Finally, evaluate equation ( 9), or equivalent, using the network and previously developed simulators to evaluate K over π and η (and potentially M i ) efficiently.
We anticipate this approach to be highly performant relative to the traditional Bayesian approach since each K evaluation no longer requires an exploration of the parameter space.Furthermore, our methodology does not require the models used to be nested or to satisfy the restrictive assumptions of the Savage-Dickey forecasting method; and only needs efficient mock data simulators rather than explicit likelihoods.As a result, it can even be applied in simulation-based inference contexts where no closed-form likelihood exists.We thus also anticipate it to be widely applicable.
An Application to Cosmology.-Wehave motivated fully Bayesian forecasts and argued that evidence networks may make performing one feasible.Here, we demonstrate this feasibility while also illustrating some of the insights gained from such an analysis.
The problem we tackle is determining the expected chances of a 21-cm global signal experiment making a definitive detection.Through the redshift evolution of the 21-cm global signal the thermal evolution of the Universe from recombination to reionization can be traced, giving insights into cosmology and structure formation [see 36, for a more detailed introduction to the field].However, measurement of the signal is challenging as its expected magnitude is five orders of magnitude below that of galactic foregrounds.As a result, there are currently no definitive detections of the global 21-cm signal.The EDGES 2 experiment claimed a signal detection [37], but this is disputed [e.g.38] due to the signal not matching theoretical expectations [e.g.39], being better fit by the presence of a systematic [e.g.40], and the null-detection from the SARAS 3 experiment [41].Because of this lack of definitive detection, there are several ongoing and proposed experiments to try and measure the sky-averaged 21-cm signal, e.g.REACH [32], PRIZM [42], and EDGES 3. The question then naturally arises, what is the a priori expectation of a global signal experiment with given sensitivity making a definitive detection of the uncertain 21-cm signal?Since a significant signal detection can be determined to occur when K between a model with a signal and a model without a signal exceeds some threshold K crit , this question can be rigorously answered through a fully Bayesian forecast of the form given in equation (8).
Hence, following our fully Bayesian forecasting methodology outlined above, we began by constructing the two mock data simulators, one that modelled a global 21-cm signal and one without.We imagined a REACHlike global 21-cm signal experiment with frequency-band covering redshift 7.5 to 28.0 and an analysis spectral resolution of 1 MHz.For both simulators, we used the physically motivated galactic and ionospheric foreground model from Hills et al. [38], and assumed Gaussian white noise with magnitude σ noise = 0.015 K2 .For our global 21-cm signal model, we used globalemu [43], an emulator of a more computationally costly semi-numerical simulation code [e.g., [44][45][46].This model has seven parameters, star formation efficiency f * , minimum circular virial velocity V c , X-ray emission efficiency f X , optical depth to the cosmic microwave background τ , exponent α and lower-cutoff E min of the X-ray spectral energy distribution, and the maximum root-mean free path of ionizing photons R mfp .We specified our a priori knowledge of these parameters and the foreground parameters through our priors listed in table I.The astrophysical priors used are uniform or log-uniform distributions centred on theoretically expected values, except for τ , which we used a truncated Gaussian prior based on the Planck 2018 posterior on τ [47].For the foreground parameter priors, we used physically restricted priors following Hills et al.  (5 and 12) and depend nonlinearly on their parameters, leading to complex data manifolds.In addition, this particular classification task is made more challenging due to the large shared foreground component of the two models being 10 4 to 10 5 times larger than the signal.
To train our evidence network, we generated 32,000,000 (12,800,000) mock data sets from each simulator to form our training (validation) set.Our evidence network was implemented in tensorflow [48], and consisted of an initial Cholesky whitening transform [49] followed by dense hidden layers of size 256-256-64-64-64-64-64-64-1, with batch normalization and a ReLU activation functions on all layers except for the output node, and an additive skip connection between the third and sixth layers to ease training.We used the α = 2 l-POP exponential loss function recommended for evident networks by Jeffrey and Wandelt [1] and the Adam optimizer with an initial learning rate of 10 −3 , decay steps of 10 5 , and decay rate of 0.95.
The network was trained with early-stopping for 900 epochs using a batch size of 32,768.
To validate that the network had converged to an accurate prediction of K, we performed a blind coverage test as outlined in Jeffrey and Wandelt [1].This test consists of using the network to predict the posterior model probabilities for a range of test data sets, and then binning the data sets by these probabilities.If the network has correctly converged, the model probability corresponding to each bin should equal the proportion of the data sets in the bin generated from said model, which we indeed find to be the case for our testing set composed of 1,000,000 mock data sets generated from each model.Furthermore, as in this case, we can construct an explicit likelihood; we additionally validated our network and methodology by comparing whether a significant detection would be concluded based on the network K values and K computed using polychord [50,51].For a signal detection threshold of 3 σ (5 σ) 3 , corresponding to log(K crit ) = 5.91 (log(K crit ) = 14.4), we found the two methods came to the same conclusion on whether a signal was detected for 96.6% (95.1%) of 1000 mock data sets from our noisysignal model, showing good agreement.
Finally, we evaluated equation ( 8) using our evidence network and 1,000,000 sets of noisy-signal model mock data.Ultimately, we found the expected chance of the experiment detecting the global 21-cm signal at a statistical significance of 3 σ (5 σ) was 46.0% (32.4%).This approximately 50% change of a 21-cm signal detection at > 3 σ confidence, suggests the 0.015 K sensitivity at 1 MHz resolution considered here is indicative of the minimum sensitivity global signal experiments such target.
Additionally, our 1,000,000 K evaluations allow for a broad range of further analyses at little to no additional computation cost.Instead of performing the average over π in equation ( 8), we can marginalize over conditional priors with one or two parameters fixed, giving insight into under which early Universe astrophysical scenarios we would expect to detect the 21-cm signal.These conditional detection probabilities are depicted in figure 1 for the parameters with which strong variation in detection probability was seen, f * , f X , and τ .We find the detection of the 21-cm signal is more likely for high star formation efficiencies, X-ray efficiencies around the theoretically expected value of 1, and higher optical depths to reionization.With an almost 100% chance of a 3 σ detection for f * > 0.01 and f X = 1, and an effectively 0% chance of a 3 σ detection for f * < 0.001 or f X > 30.This strong variation in definitive detection chances retroactively provides further motivation for fully Bayesian forecasts as we see the conclusion drawn would be highly sensitive to the fiducial global 21-cm signal parameters chosen for a traditional Bayesian forecast.
There are a myriad of further analyses we could perform with our methodology.For example, we could study the evolution of the detection probability against the experiment noise level or bandwidth, which could, in turn, help inform experimental design.Additionally, we could consider the functional distribution of the detected and non-detected limits [52,53], or the SARAS 3 null detection [41,54].Furthermore, if this were a problem where particular parameter(s) values were of special interest (e.g., the minimum sum of the neutrino masses permitted by particle physics) this methodology also allows for determining detection chance with the parameter(s) fixed to that value while still marginalizing over noise and nuisance parameters.However, we shall leave such analyses to future work since the focus of this letter is on the method and its feasibility rather than particular scientific problems.Let us return now to the computational performance of our method.The training data generation, network training, forecast data generation, network K evaluations, and plotting of figure 1, took a combined total of 5.54 GPU hours 4 .Conversely, the 1000 polychord evaluations of K as part of our validation process took a total of 45,000 CPU hours 5 , from which we can estimate the 1,000,000 K evaluations used in our fully Bayesian forecast would have required 45,000,000 CPU hours using traditional methods.While it is not meaningful to directly compare GPU to CPU hours we can compare the costs of those hours.On the cluster we utilized GPU hours are charged at 50 times the rate of CPU hours.Thus, our method gives a cost-weighted performance improvement of 10 5.2 .Since this performance gain was for a single problem, and our implementation was not optimized, this level of performance gain cannot be assumed to apply universally.However, it is indicative our methodology is highly performant compared to traditional techniques and, as we have directly demonstrated, facilitates analyses that were previously computationally prohibitively expensive.
Conclusions.-We have argued, like Mukherjee et al. [11] and Trotta [4] before us, that to arrive at accurate and interpretable predictions, the conclusions of scientific forecasts should be marginalized over any uncertain model parameters and noise realizations.However, such fully Bayesian forecasts are computationally infeasible with traditional methods for model comparison forecasts.We thus propose a novel methodology for performing fully Bayesian forecasts based on Evidence Networks.
To illustrate our method and the insights that can be gained from fully Bayesian forecasts, we applied it to determine the chances of a REACH-like experiment detecting the global 21-cm signal from beneath foregrounds and noise.For a frequency resolution of 1 MHz and a noise level of σ noise = 0.015 K, we find a 46.0%(32.4%) chance of detection at 3 σ (5 σ).Thus suggesting this noise level is indicative of the minimum sensitivity global 21-cm signal experiments should target.Additionally, our methodology allows us to produce triangle plots of how this chance of detection varies when one or two model parameters are fixed, at no extra computational cost.For this example problem, we find a cost-weighted speed-up of 10 5.2 using our approach compared to a traditional nested-sampling-based method that would have taken 45, 000, 000 CPU hours.
The method we propose can be applied to any forecasting question which can be formulated as a condition on the Bayes ratio between two models.This includes: if a signal can be detected from within noise (e.g.gravita-tional waves [55], or the 21-cm signal [36]); whether two competing theories can be distinguished by anticipated data (e.g.MOND [56] or General Relativity [57]); or if the inclusion of novel physics in a model will be necessary (e.g.neutrinos in CMB experiment analysis [26]).Additionally, the method only requires simulators of mock data, and thus can still be used in cases where closedform likelihoods or explicit priors are not available.As a result, this methodology should allow for reliable and efficient fully Bayesian forecasts on a wide range of forecasting problems.
Since the proposed methodology is simulation-based and has a low computational cost, we anticipate it will be particularly suited to performing forecasts for a range of potential experimental configurations.Thus allowing for the optimization of experimental configurations to minimize cost or maximize the chance of the detection of new physics.To facilitate the application of this methodology by others, we make public on GitHub all codes and data used in the writing of this letter.

FIG. 1 .
FIG.1.Triangle plot depicting the probability of a global 21-cm signal detection at ≥ 3 σ statistical significance, by an experiment covering redshift 7.5 to 28.0, with frequency resolution ∆ν = 1 MHz, assuming the Hills et al.[38] physical foreground model and white noise of σnoise = 0.015 K.The diagonal shows the total probability of a detection marginalized over noise realizations and the parameter space except for one fixed parameter, and the below diagonal is the equivalent with two fixed parameters.The total detection probability with no parameters fixed was 46.0%.Vc, α, Emin, and R mfp are not shown because the probability of detection was found to vary only weakly with them.We find high f * , fX ≈ 1, and high τ values increase the chance of a 21-cm signal detection, with f * the most impactful.For different parameter combinations, we find detection probabilities varying from ∼ 0% to ∼ 100%, illustrating the need for fully Bayesian forecasts as the conclusion of detectability varies strongly over the a priori uncertain parameter space.

TABLE I .
Priors on our foreground and 21-cm global signal parameters.To keep the parameter values within the training parameter ranges of globalemu we use a truncated Gaussian prior on τ derived from the Planck 2018 measurements rather than a Gaussian prior.
[38].Thus, by combining noise generators, our analytic foreground model, globalemu, and samplers over our priors, we constructed simulators of mock global 21-cm signal data from a model with only noise and foreground (no-signal) and a model with noise, foreground, and signal (with-signal).Our two competing models thus have moderate dimensionalities