Real-time gravitational-wave science with neural posterior estimation

We demonstrate unprecedented accuracy for rapid gravitational-wave parameter estimation with deep learning. Using neural networks as surrogates for Bayesian posterior distributions, we analyze eight gravitational-wave events from the first LIGO-Virgo Gravitational-Wave Transient Catalog and find very close quantitative agreement with standard inference codes, but with inference times reduced from O(day) to a minute per event. Our networks are trained using simulated data, including an estimate of the detector-noise characteristics near the event. This encodes the signal and noise models within millions of neural-network parameters, and enables inference for any observed data consistent with the training distribution, accounting for noise nonstationarity from event to event. Our algorithm -- called"DINGO"-- sets a new standard in fast-and-accurate inference of physical parameters of detected gravitational-wave events, which should enable real-time data analysis without sacrificing accuracy.

Introduction.-Since the first detection of a signal from a pair of merging black holes [1], gravitational waves have quickly emerged as an important new probe of gravitational theory [2], neutron-star physics [3], cosmology [4], and black-hole astrophysics [5]. These scientific successes were made possible by a growing rate of detections by the LIGO [6] and Virgo [7] observatories, and their subsequent analysis and characterization as signals from merging compact binary systems. The LIGO and Virgo Collaborations (LVC) have now published results from over 50 such systems [8,9], and this number promises to grow ever-faster as detectors are made more sensitive in the future [10].
Given a detection, Bayesian inference is used to characterize the originating source [11]. This is based on having models for the signals and the detector noise. For gravitational waves, signal models take the form of waveform predictions h(θ) depending on the source parameters θ (masses, location, etc.). Waveform models are based on solutions to Einstein's equations (and any relevant matter equations) for the two-body dynamics and gravitational radiation, using a combination of numerical-relativity and perturbative calculations [12][13][14] and phenomenological fitting [14][15][16]. Detector noise is typically modeled as stationary and Gaussian, with some spectrum which can be estimated empirically. Together, these "forward" models give rise to the likelihood p(d|θ) for the observed strain data d, which is assumed to consist of a signal plus noise. With the choice of a prior p(θ) over parameters, the posterior distribution is given via Bayes' theorem, where p(d) is a normalizing factor called the evidence. The posterior gives our belief about the source parameters, given the observed data. The task of inference is to characterize the posterior by drawing samples from it. This can be accomplished with stochastic algorithms like Markov chain Monte Carlo (MCMC). The LVC have developed software tools such as LALInference [17] and Bilby [18][19][20] to carry this out. However, these algorithms are computationally expensive as they require many likelihood evaluations for each independent posterior sample θ ∼ p(θ|d), and each likelihood requires a waveform simulation. An analysis producing ∼ 10 4 independent samples typically requires millions of waveform evaluations and a total inference time of hours to months, depending on the signal duration and waveform model. More physically-realistic waveform models [21] are also more costly, so carrying out inference for all events with the best models is an enormous computational effort. When rapid results are desired-for alerts to trigger electromagnetic follow-up of transient phenomena [22], or when processing large numbers of events-accuracy usually has to be traded off for speed, by restricting to a limited set of fast models [23,24] or specialized inference algorithms [25][26][27].
In this Letter, we describe an alternative approach to gravitational-wave inference which delivers both dramatically reduced analysis time and high accuracy, in stark contrast to the trade-off intrinsic to standard algorithms. The basic idea is to produce a large number of simulated data sets (with associated parameters), and use these to train a type of neural network known as a normalizing flow to approximate the posterior. The trained network can then generate new posterior samples extremely quickly once a detection is made. This bypasses the need to generate waveforms at inference time, thereby amortizing the expensive training costs over all future detections. The general approach of building such "surrogate" inverse models is called neural posterior estimation (NPE) [28][29][30], and is beginning to see application in several scientific domains [31]. When applied to gravitational waves, with all of the optimizations we describe, we call the method Deep INference for Gravitational-wave Observations, or DINGO.
NPE and conventional methods both involve the same inputs: a prior and a likelihood. A key difference, however, is the way in which the likelihood is used: for conventional methods, its density is evaluated, whereas for NPE it is used to simulate data, i.e., d ∼ p(d|θ). This distinction is important when dealing with nonstationary or non-Gaussian detector noise, for which an analytic likelihood is either expensive or unavailable. In this case, one could nevertheless simulate data, in a noise-model-independent way, by injecting simulated signals into real noise. Our present focus is on speed and on validating DINGO on real data with the common assumption of stationary-Gaussian noise, but the ultimate aim of more accurate inference using real noise should be kept in mind.
There have been several previous studies that applied NPE or related approaches to gravitational waves [32][33][34][35][36][37][38][39]; see also [40]. However, most of these are limited in some way: they either restrict the number of parameters or the distributional form of the posterior, or they do not analyze real data, or there are clear deviations from results obtained using standard algorithms. The best performance to-date was achieved in the study [36] by some of us. This was the only study to infer all 15 parameters 1 of a binary black hole (BBH) system in real data and demonstrate close agreement to standard samplers. However, even that study did not achieve full amortization, as it did not address the fact that detector noise varies from event to event. Rather, the neural network of [36] was tuned to the noise power spectral densities (PSDs) of the detectors at the time of the event analyzed, and it would require retraining for each new event.
We now present for the first time completely amortized inference for BBHs using DINGO. This is achieved by conditioning the neural network not only on the event strain data, but also on the detector noise PSD, which can be estimated using nearby data [17]. We also achieve unprecedented accuracy thanks to a new iterative algorithm for time-shifting the coalescence times, as well as various architecture improvements. We use our trained networks to analyze all events in the first Gravitational-Wave Transient Catalog (GWTC-1) [8] with component masses greater than 10 M ⊙ (our prior bound) and find close (sometimes indistinguishable) quantitative agreement with standard algorithms. This Letter sets a new standard for rapid gravitational-wave inference, which 1 Parameters consist of detector-frame component masses (m 1 , m 2 ), time of coalescence at geocenter tc, reference phase ϕc, sky position (α, δ), luminosity distance d L , inclination angle θ JN , spin magnitudes (a 1 , a 2 ), spin angles (θ 1 , θ 2 , ϕ 12 , ϕ JL ) [41], and polarization angle ψ. should enable real-time gravitational-wave science in the near future. It shows that NPE has moved beyond toy models and is competitive with conventional algorithms. More broadly, it provides a demonstration of these new methods in a realistic use case, which we hope will inspire wider adoption in experimental science.
Method.-The central object of DINGO is the densityestimation neural network, which defines a conditional probability distribution q(θ|d). This should be distinguished from the posterior p(θ|d), which q(θ|d) learns to approximate through training. We use so-called normalizing flows [42][43][44] to define a sufficiently flexible q(θ|d) via a d-dependent mapping f d : u → θ from a simple "base" distribution π(u), If π(u) can be rapidly evaluated and sampled from, and if f d is invertible and has simple Jacobian determinant, then q(θ|d) can also be rapidly evaluated and sampled from. Following [36], we take π(u) to be multivariate standard normal, and f d a composition of spline coupling flows [45], each of which is defined with a neural network. The overall structure of DINGO is illustrated in Fig. 1. This contains three key enhancements compared to the study [36]. First, since the data generation process depends on the detector noise PSD S n , we include this as additional context to the neural network, i.e., q(θ|d, S n ). This allows us to tune the network at inference time to the PSD estimated just prior to the event, corresponding to standard "off-source" noise estimation [17]. An alternative would be to estimate the noise "on-source" [46], but since we consider only short-duration BBH events here, the off-source approach is sufficient.
The second enhancement addresses the problem of highdimensional observed data by using an additional neural network to first compress to a small number of features. This network (called an "embedding network") is trained alongside the flow network. Our data is in the frequency domain, between 20 Hz and 1024 Hz, with 0.125 Hz resolution, so combined with the PSDs, this gives 24,096 input dimensions for each of the two or three interferometers. The first stage of the embedding network maps this linearly to 400 components per detector. To provide an inductive bias to extract signal information, we seed this layer with the principal components of clean waveforms from our training set, and then allow these parameters to float during training. Following this, a fully-connected residual network [47] compresses to 128 features, which are provided to the flow.
Finally, we developed a new method to treat time translations of the strain data. For standard algorithms, inference of (α, δ, t c ) requires sampling over waveforms with varying coalescence times t I in each detector I. Likewise for NPE, the network must learn to interpret strain data with different t I . For frequency-domain data, however, time translations correspond to local phase shifts, which, although explicitly known, are challenging for neural networks to learn. Indeed, this occupied much of the network capacity in Ref. [36]. Our new approach-called group equivariant neural posterior estimation (GNPE)leverages explicit knowledge of the time-translation symmetry along with approximate knowledge of t I to simplify the data representation and allow the network to focus on more nontrivial parameters. For further details see [48].
For GNPE, we train the network to infer θ given perturbed coalescence times τ I and manually-time-shifted strain data d −τ I . Using maximum likelihood estimation [49], this means we minimize the loss function with respect to the network parameters [50]. Here, E refers to the expected value over the specified distribution, which is evaluated stochastically using Monte Carlo draws. κ(δt I ) is a uniform kernel used to perturb t I . For inference, even though we do not have direct access to t I , all parameters can be inferred using Gibbs sampling starting with an approximate t I (obtained, e.g., using standard NPE): first, convolve t I with κ(δt I ) to obtain τ I , then use the network to infer a new estimate for t I ; then convolve again and repeat. We find that this converges after O(10) iterations. Evaluating (3) requires sampling θ (i) ∼ p(θ) and S  closely. In particular we use the same prior over parameters, with m 1 , m 2 ∈ [10, 80] M ⊙ . We train separate networks for the noise distributions in the first (O1) and second (O2) observing runs of LIGO and Virgo, with PSD samples estimated empirically from stretches of interferometer noise data [51]. Mpc. In addition to these two-detector networks, we train a three-detector network with distance prior [100, 1000] Mpc to analyze GW170814. With future enhancements of network architecture we expect to cover the entire distance range with a single network. Finally, as in Ref. [36], training data are generated from a fixed set of spin-precessing frequency-domain waveforms, described by the IMRPhe-nomPv2 [16,52,53] model, but with extrinsic parameters and noise realizations drawn randomly during training. With training sets of 5 × 10 6 waveforms, there is no indication of overfitting. Training takes roughly 10 days on a single NVIDIA A100. Further details on the networks and training are provided in the Supplementary Material.
Results.-As a first test, we evaluate DINGO on data entirely consistent with the training distribution, i.e., simulated waveforms in stationary-Gaussian noise. This is an easier task than using observational data, which includes real signals in noise that is neither strictly stationary nor Gaussian, and therefore lies outside the training distribution. We sample posteriors from 1000 simulated data sets and construct a P-P plot (see Fig. 2). For each param-  eter, we compute the percentile score of the true value within its marginalized posterior, and then we plot the cumulative distribution function (CDF) of these scores. For true posteriors, the percentiles should be uniformly distributed, so the CDF should be diagonal. Kolmogorov-Smirnov test p-values are indicated in the legend, with combined p-value of 0.46. This shows that DINGO is performing properly on simulated data. We now proceed to our main result, which is a demonstration of performance on real events. We perform inference on the eight GWTC-1 BBH events compatible with our prior, using both DINGO and LALInference MCMC. For DINGO, generation of 50,000 sample points with 30 GNPE iterations takes roughly 20 seconds. Comparisons of inferred component masses and sky position for all events show good agreement (see Fig. 3), including multimodality for the sky position. The one exception is GW170104, where the mass posterior is slightly flatter. Nevertheless, 90% credible intervals are in good agreement.
For quantitative comparisons, we compute the Jensen-Shannon divergence (JSD) [54] between DINGO and LALInference one-dimensional marginalized posteriors (see Fig. 4). This is a symmetric divergence that measures the difference between two probability distributions, with values ranging from 0 to ln(2) ≈ 0.69 nat. We find a mean JSD across all events and parameters of 0.0009 nat, which is slightly higher than the variation (0.0007 nat) found between LALInference runs with identical settings but different random seeds [19]. By comparing such LAL-Inference runs, Ref. [19] also established a maximum JSD of 0.002 nat for indistinguishability; our results are approaching this threshold, with two events below for all parameters, and the others with one to three parameters above. The slight visible disagreement between mass posteriors for GW170104 is also reflected in larger JSDs. For comparison, we note that PSD variations (see Supplemental Material) and the choice of waveform model [19] both impact the JSD at a much higher level (0.02 nat). Additional comparisons between samplers, including posteriors for all events, are provided in the Supplemental Material.
Conclusions.-In this Letter, we introduced DINGO and applied it to perform extremely fast Bayesian parameter inference for gravitational waves observed by the LIGO and Virgo detectors. We analyzed eight GWTC-1 events, and showed excellent agreement with standard algorithms, with inference times reduced by factors of 10 3 -10 4 . This was achieved by conditioning on the detector noise characteristics and making a number of architecture and algorithm improvements. The DINGO code is available at https://github.com/dingo-gw/dingo.
A critical component of DINGO is a new iterative algorithm-GNPE-to partially off-load the modeling of time translations from the neural network. Although convergence of GNPE may take 20 seconds, initial samples with slightly reduced accuracy can, however, be produced in just a few seconds by taking fewer iterations.
Going forward, the next steps are to extend the prior to include longer-duration binary neutron star signals [55] (for which rapid results are especially important to identify electromagnetic counterparts) and to extend to more physically-realistic waveform models, which include higher multipole modes and more accurate spin-precession effects [21]. Long and complex waveforms are much more expensive for standard algorithms, so the relative improvement in performance should be even more significant. If successful, this would also enable the routine use of the most physically-realistic waveforms, resulting in consistently reduced systematic errors. These extensions will likely require somewhat larger networks and improved  data representation or compression. 2 Another natural extension would be to study signals without making the common stationary-Gaussian idealization for the detector noise during the training stage. For DINGO, performing inference with realistic noise is simply a matter of training with simulated signals injected into real noise realizations taken from detectors. Using real noise should lead to improved accuracy that is not possible using standard likelihood-based methods, and would serve as an excellent demonstration of the advantages of NPE. For real-time analysis, it will also be necessary to develop approaches to progressively retrain networks to keep pace with changing data distributions during an observing run, e.g., as detector sensitivity is improved. All of these enhancements, particularly the treatment of nonstationary noise, will be critical for extensions to future observatories such as LISA.
Deep-learning tools are now ready to analyze the vast majority of LIGO/Virgo events. In the past, the primary challenge has been in obtaining sufficiently accurate results, but with DINGO, we have now achieved this in a realistic context. Through planned future extensions, we expect that DINGO could become one of the leading approaches to gravitational-wave inference.
Acknowledgments.-We thank S. Ossokine, M. Pürrer, C. Simpson and P. Züge for helpful discussions. This research has made use of data, software and/or web tools ob-  We perform inference over the full 15D parameter space for quasicircular binary black holes, which includes detector-frame component masses m 1 , m 2 , time of coalescence at geocenter t c , reference phase ϕ c , sky position (right ascension α and declination δ), luminosity distance d L , inclination angle θ JN , spin magnitudes a 1 , a 2 , tilt angles θ 1 , θ 2 , other spin angles ϕ 12 , ϕ JL [41], and polarization angle ψ. Priors are taken to be the same as in Ref. [36]: standard over all angles, and uniform in all other parameters, with m 1 ≥ m 2 , m 1 , m 2 ∈ [10, 80] M ⊙ , a 1 , a 2 ∈ [0, 0.88], and t c ∈ [−0.1, 0.1] s. We found that it was difficult to train a neural network for accurate inference over the entire relevant range of luminosity distance, so we partition the prior as shown in Tab. I. To perturb the signal coalescence times t I for GNPE ,we use a uniform kernel κ(δt I ) in the range [−1, 1] ms.
Training data consist of labeled strain data sets (θ, d) and associated noise power spectral densities S n . To construct the data sets, we first draw samples from the prior, θ ∼ p(θ). This is done in two stages as in Ref. [36]: first, intrinsic parameters are sampled ahead of training, and waveforms are generated and saved based on these; second, extrinsic parameters are sampled during training and applied to the waveforms, since this involves simple transformations. For our purposes, intrinsic parameters consist of θ intrinsic = (m 1 , m 2 , ϕ c , θ JN , a 1 , a 2 , θ 1 , θ 2 , ϕ 12 , ϕ JL ) and extrinsic parameters are θ extrinsic = (t c , α, δ, d L , ψ).
Each strain data set i consists of a waveform with additive stationary Gaussian noise, d (i) = h(θ (i) ) + n (i) . This is represented in frequency domain, with f min = 20 Hz, f max = 1024 Hz, and ∆f = 0.125 Hz, corresponding to a duration of 8 s. Waveforms are generated using the IMR-PhenomPv2 frequency-domain model [16,52,53], which is fast (so that comparisons against standard samplers are feasible) and also includes spin-precession effects. We save intrinsic waveforms to disk in an SVD representation, which is accurate to a mismatch of 2 · 10 −5 for the 99.9th percentile of the data. Noise realizations are generated during training, after first sampling an associated PSD, i.e., S (i) n ∼ p(S n ), n (i) ∼ p(S (i) n ). We construct training sets based on 5 × 10 6 sets of intrinsic parameters, but by sampling extrinsic parameters and noise realizations during training, the effective size of the training set is infinite in these dimensions.
To construct the empirical PSD distributions p(S n ), we estimate PSDs using noise data from each observing run. Using BURST_CAT2 data from GWOSC [51], we identify stretches of at least 1024 s in duration that do not overlap with events. Each PSD is estimated by taking a 1024 s stretch of data, dividing this into non-overlapping 8 s subintervals, and using the Welch "median-average" method to average PSDs estimated on each of these [17]. We use a Tukey window with a roll-off of 0.4 s. For inference, we use the same construction to estimate the PSD from detector data just prior to an event. The number of PSDs obtained for each observing run is given in Tab. II.

NEURAL NETWORK
There are two main components to our conditional density-estimation neural network, the embedding network for compressing data to a sufficiently small number of features, and the normalizing flow, which produces the Bayesian posterior from these features.

Embedding network
For each of the two or three detectors, the embedding network takes as input the real and imaginary parts of the whitened frequency-domain strain data, as well as the inverse amplitude spectral density (ASD). This results in 24,096 inputs per detector. We provide the inverse ASD rather than the PSD because of the more numerically stable behavior of spectral lines. We scale the ASD with a constant factor of 10 23 .
The first embedding layer is a linear mapping that serves to drastically reduce the number of dimensions. This is initialized based on a singular value decomposition (SVD) to provide an inductive bias to facilitate training. The strain data are initially projected onto the first n SVD = 200 (complex) singular vectors, defined based on a set of 50,000 signal waveforms drawn from the prior. The inverse ASDs are likewise mapped to n SVD complex numbers, which are added to the projected strain; this projection is initialized to 0. After this initial layer, the data has therefore been projected onto 2n detectors n SVD (real) features (i.e., either 800 or 1200 components, depending on the number of detectors). There is no nonlinear activation following this layer.
Following the SVD layer is a fully-connected residual network. This consists of a sequence of two-layer residual blocks; prior to each linear mapping are batch normalization layers and ELU activation functions. We take six blocks each of 1024, 512, 256, and 128 hidden dimensions, resulting in a total of 48 hidden layers in the residual network.

Normalizing flow
Our normalizing flow is very similar to that of [36]. This consists of a sequence of coupling transforms, each of which transforms half of the components of a sample element-wise based on the context information and the values of the untransformed components. We use a rational-quadratic spline coupling transform [45], with the same parameters as in [36], except for the number of residual blocks, which is reduced from 10 to 5. Between the coupling transforms, the ordering of the sample components is randomized to ensure all components are sufficiently transformed by the sequence of transforms. We also increase the total number of coupling transforms to 30 from 15 in [36]. The flow therefore consists of 300 hidden layers. In total, the embedding network and the flow combined have 1.31 · 10 8 learnable parameters for n detectors = 2 and 1.42 · 10 8 for n detectors = 3.
The context information for the flow consists of the 128 features output from the embedding network, as well as the two or three perturbed detector coalescence times τ I .
We also train a neural network to provide an initial estimate for t I , which is used as a starting point for the iterative GNPE algorithm. This does not require τ I as context, but otherwise has the same form as the main network.

Training
Training consists of two stages, an initial pretraining stage and a fine-tuning stage. During the pretraining stage, the SVD layer is frozen, and all noise realizations are drawn from an average PSD for the observing run and detector. This is designed to simplify early-stage training. During the fine-tuning stage, the SVD layer is unfrozen, and PSDs are randomly drawn from the empirical distribution, and noise realizations are drawn from these. Our networks are trained for 300 epochs of pretraining and 150 epochs of fine tuning with a batch size of 4096. We use the Adam optimizer [50]. We begin training with learning rates of 3·10 −4 and 3·10 −5 in the pretraining and finetuning stages, respectively, and decrease the learning rate by a factor of 2 when the validation loss has not improved in the previous 10 epochs. For three detectors, we reduce the batch size to 2048 and the initial learning rates to 2 · 10 −4 and 2 · 10 −5 due to memory limitations. As shown in Fig. 5, the loss jumps at the beginning of the fine-tuning stage. This occurs because the distribution of training data becomes much broader with the inclusion of varying noise PSDs. The final loss at the end of fine tuning is just above the pretraining loss, which indicates that the network has learned to process the varying PSDs to the same performance level of the fixed pretraining PSD. During training, we reserve 2% of the training set for validation to check for overfitting. Since the training and validation loss are in close agreement in Fig. 5 we conclude that overfitting is minimal. Training 450 epochs with a batch size of 4096 takes roughly 10 days on a single NVIDIA A100 GPU. 3

EFFECT OF PSD
To give a sense of the size of the effect of the PSD on the posterior, we perform inference on GW150914, where we whiten the strain data with the correct PSD, but provide instead the PSD for GW151012 as context to the neural network. This gives a mean JSD across all parameters of 0.005 nat and a maximum JSD of 0.030 nat when compared against the DINGO result using the correct PSD. Both of these numbers significantly exceed GW151012 PSD GW150914 PSD  Figure 6. Comparison between DINGO evaluated on GW150914 using correct PSD as context, and using GW151012 PSD. These four parameters have a mean JSD of 0.020 nat.
the JSDs between DINGO and LALInference runs for all events [largest mean JSD: 0.001 nat (GW170729); largest maximum JSD: 0.006 nat (GW170104, m 1 )]. This demonstrates that conditioning the neural network on the PSD is required at the level of accuracy we achieve in this study.
The worst performing parameters with the incorrect PSD are t c , d L , α, and δ, which have a mean JSD of 0.020 nat (see Fig. 6). These results are consistent with the expectation that the main effect due to using the wrong PSD is to cause a change in the inferred amplitude of the signal in each detector: d L becomes biased due to the overall amplitude being off, whereas sky position and t c become biased from relative incorrect amplitudes in the two detectors.

COMPARISONS AGAINST STANDARD SAMPLERS
We compare our results against LALInference [17] with MCMC and nested sampling algorithms, and with Bilby [18,19] with the dynesty [20] nested sampling algorithm. To compare as closely as possible with DINGO, we use the same data conditioning for strain data and PSDs. However, we sample in chirp mass M = (m 1 m 2 ) 3/5 /(m 1 +m 2 ) 1/5 and mass ratio q = m 2 /m 1 instead of component masses, since this simplifies the form of the posterior and improves convergence. We also sample sky position in the detector-based azimuth/zenith reference frame rather than the (α, δ) sky frame for Bilby. This also simplifies the form of the posterior to improve sampling [19]. With standard samplers it is possible to analytically marginalize over some parameters to reduce the dimensionality of the space. This improves sampling performance, with the marginalized parameters reconstructed in postprocessing. For LALInference, we marginalize over time of coalescence, and with Bilby we marginalize over time, distance, and phase. Phase marginalization is only valid in the absence of precession, but we had difficulty obtaining converged results without it. For IMRPhenomPv2 waveforms, this should not lead to a significant difference, but it should be kept in mind. For Bilby, we use nlive=4000 and nact=50.
We find closest agreement with LALInference MCMC, which is what we report in the main text, and is displayed on all posterior plots. For completeness we include in Fig. 7 comparisons against LALInference nested sampling and Bilby with phase marginalization. In general, when using phase marginalization, spin and sky position are less well recovered, as were some events. To give a sense of how JSD values translate into parameter estimates, we provide 90% credible intervals in Tab. III, which are all in extremely close agreement.
Deviations between posteriors obtained from DINGO and standard samplers are associated with multiple sources of error. Firstly, imperfect training of DINGO can lead to inaccurate results. Indeed, training the neural networks is challenging due to the high dimensionality of the input data to this inference problem, which inspired us to adopt the GNPE method [48] to improve  , respectively. As mentioned in the main text, the average JSD between LALInference runs with identical settings but different random seeds is 0.0007 nat, which sets a practical lower bound to the achievable values. Moreover, the DINGO framework allows us to obtain an arbitrary number of independent samples from the same distribution. This enables us to compute the JSDs between two sets of samples, that are, by construction, sampled from the same distribution. This value, 0.0002 nat, provides a fundamental lower bound for perfectly optimized samplers.
convergence. While the P-P plot in Fig. 2 of the main text suggests that our networks are well converged, small deviations between posteriors for real events could arise due to the networks not being fully converged. Secondly, even the well-established standard samplers do not produce perfect posterior samples. The mean JSD between LALInference runs with identical settings but different random seeds is a factor 3.5 higher than that expected for samples from identical distributions (see Fig. 7 for details). Thirdly, perfect agreement between (ideal implementations of) DINGO and standard samplers is only expected for data consistent with the training distribution.
In reality, however, where the noise is neither perfectly stationary nor Gaussian, the measured data is slightly out-of-distribution. In those cases, there is no theoretical guarantee that DINGO extrapolates to this data in the same way as standard samplers. However, as stated in the conclusion, DINGO is not limited to stationary Gaussian noise, and we plan to lift this assumption in future work.
In the remainder of this section, we include 1D and 2D marginalized posteriors for all nontrivial parameters for all events. All DINGO posteriors are produced with 50,000 samples, except for the skymaps, which use 10,000. LALInference posteriors use all samples produced with the given sampler settings, typically of order 30,000-50,000.