Neural Importance Sampling for Rapid and Reliable Gravitational-Wave Inference

We combine amortized neural posterior estimation with importance sampling for fast and accurate gravitational-wave inference. We first generate a rapid proposal for the Bayesian posterior using neural networks, and then attach importance weights based on the underlying likelihood and prior. This provides (1) a corrected posterior free from network inaccuracies, (2) a performance diagnostic (the sample efficiency) for assessing the proposal and identifying failure cases, and (3) an unbiased estimate of the Bayesian evidence. By establishing this independent verification and correction mechanism we address some of the most frequent criticisms against deep learning for scientific inference. We carry out a large study analyzing 42 binary black hole mergers observed by LIGO and Virgo with the SEOBNRv4PHM and IMRPhenomXPHM waveform models. This shows a median sample efficiency of $\approx 10\%$ (two orders-of-magnitude better than standard samplers) as well as a ten-fold reduction in the statistical uncertainty in the log evidence. Given these advantages, we expect a significant impact on gravitational-wave inference, and for this approach to serve as a paradigm for harnessing deep learning methods in scientific applications.

Introduction.-Bayesian inference is a key paradigm for scientific discovery. In the context of gravitational waves (GWs), it underlies analyses including individual-event parameter estimation [1], tests of gravity [2], neutron-star physics [3], populations [4], and cosmology [5]. Given a prior p(θ) and a model likelihood p(d|θ), the Bayesian posterior summarises, as a probability distribution, our knowledge of the model parameters θ after observing data d. When p(d|θ) is tractable (as in the case of GWs) likelihood-based samplers such as Markov chain Monte Carlo (MCMC) [6,7] or nested sampling [8] are typically used to draw samples from the posterior. If it is possible to sample d ∼ p(d|θ) (i.e., simulate data) one can alternatively use amortized simulation-based (or likelihoodfree) inference methods [9]. These approaches are based on deep neural networks and can be several orders-ofmagnitude faster at inference time. For GW inference, they have also been shown to achieve similar accuracy to MCMC [10]. In general, however, it is not clear how well such networks generalize to out-of-distribution data and they lack diagnostics to be confident in results [11]. These powerful approaches are therefore rarely used in applications where accuracy is important and likelihoods are tractable.
In this Letter, we achieve the best of both worlds by combining likelihood-free and likelihood-based methods for GW parameter estimation. We take samples from Dingo 1 [10]-a fast and accurate likelihood-free method using normalizing flows [12][13][14][15]-and treat these as a proposal for importance sampling [16]. The combined method ("Dingo-IS") generates samples from the exact posterior and now provides an estimate of the Bayesian evidence p(d). Moreover, the importance sampling efficiency arises as a powerful and objective performance metric, which flags potential failure cases. Importance sampling is fully parallelizable.
After describing the method more fully in the following section, we verify on two real events that Dingo-IS produces results consistent with standard inference codes [17][18][19][20]. Our main result is an analysis of 42 events from the Second and Third Gravitational-Wave Transient Catalogs (GWTC-2 and GWTC-3) [1,21], using two waveform models, IMRPhenomXPHM [22] and SEOBNRv4PHM [23]. Due to the long waveform simulation times, SEOBNRv4PHM inference would take several months per event with stochastic samplers. However Dingo-IS with 64 CPU cores takes just 10 hours for these waveforms. (Initial Dingo samples are available typically in under a minute.) Our results indicate that Dingo(-IS) performs well for the majority of events, and that failure cases are indeed flagged by low sample efficiency. We also find that the log evidence is recovered with statistical uncertainty reduced by a factor of 10 compared to standard samplers.
Machine learning methods have seen numerous applications in GW astronomy, including to detection and parameter estimation [24]. For parameter estimation, these methods have included variational inference [25,26], likelihood ratio estimation [27], and posterior estimation with normalizing flows [10,26,28,29]. Aside from directly estimating parameters, normalizing flows have also been used to accelerate classical samplers, with significant efficiency improvements [30].
Neural density estimation and importance sampling have previously been combined [31] under the guise of "neural importance sampling" [32], and similar approaches have been applied in several contexts [33][34][35][36]. Our contributions are to (1) extend this to amortized simulationbased inference, (2) use it to improve results generated with classical inference methods such as MCMC, and (3) to highlight how the use of a forward Kullback-Leibler (KL) loss improves reliability. We also apply it to the challenging real-world problem of GW inference. 2 We demonstrate results that far outperform classical methods in terms of sample efficiency and parallelizability, while maintaining accuracy and including simple diagnostics. We therefore expect this work to accelerate the development and verification of probabilistic deep learning approaches across science.
Method.-Dingo trains a conditional densityestimation neural network q(θ|d) to approximate p(θ|d) based on simulated data sets (θ, d) with θ ∼ p(θ), d ∼ p(d|θ)-an approach called neural posterior estimation (NPE) [38]. Once trained, Dingo can rapidly produce (approximate) posterior samples for any measured data d. In practice, results may deviate from the true posterior due to insufficient training, lack of network expressivity, or out-of-distribution (OOD) data (i.e., data inconsistent with the training distribution). Although it was shown in [10] that these deviations are often negligible, verification of results requires comparing against expensive standard samplers.
Here, we describe an efficient method to verify and correct Dingo results using importance sampling (IS) [16]. Starting from a collection of n samples θ i ∼ q(θ|d) (the "proposal") we assign to each one an importance weight w i = p(d|θ i )p(θ i )/q(θ i |d). For a perfect proposal, w i = constant, but more generally the number of effective samples is related to the variance, [39]. The sample efficiency ϵ = n eff /n ∈ (0, 1] arises naturally as a quality measure of the proposal. Importance sampling requires evaluation of p(d|θ)p(θ) rather than the normalized posterior. The Bayesian evidence can then be estimated from the normalization of 2 A similar approach using convolutional networks to parametrize Gaussian and von Mises proposals was used to estimate the sky position alone [37] Using the normalizing flow proposal (as we do here) significantly improves the flexiblity of the conditional density estimator and enables inference of all parameters.
the weights as p(d) = 1/n i w i . The standard deviation of the log evidence, σ log p(d) = (1 − ϵ)/(n · ϵ) (see Supplemental Material), scales with 1/ √ n, enabling very precise estimates. The evidence is furthermore unbiased if the support of the posterior is fully covered by the proposal distribution [40]. The log evidence does have a bias, but this scales as 1/n, and in all cases considered here is completely negligible (see Supplemental Material). If q(θ|d) fails to cover the entire posterior, the evidence itself would also be biased, toward lower values. NPE is particularly well-suited for IS because of two key properties. First, by construction the proposal has tractable density, such that we can not only sample from q(θ|d), but also evaluate it. Second, the NPE proposal is expected to always cover the entire posterior support. This is because, during training, NPE minimizes the forward KL divergence D KL (p(θ|d)||q(θ|d)). This diverges unless supp(p(θ|d)) ⊆ supp(q(θ|d)), making the loss "probability-mass covering". Probability mass coverage is not guaranteed for finite sets of samples generated with stochastic samplers like MCMC (which can miss distributional modes), or machine learning methods with other training objectives like variational inference [12,41,42].
Neural importance sampling can in fact be used to improve posterior samples from any inference method provided the likelihood is tractable. If the method provides only samples (without density) then one must first train an (unconditional) density estimator q(θ) (e.g., a normalizing flow [12,13,43]) to use as proposal. This is generally fast for an unconditional flow, and using the forward KL loss guarantees that the proposal will cover the samples. Success, however, relies on the quality of the initial samples: if they are light-tailed, sample efficiency will be poor, and if they are not mass-covering, the evidence will be biased. Nevertheless, for initial samples that well represent the posterior, this technique can provide quick verification and improvement.
In the context of GWs, we refer to neural importance sampling with Dingo as Dingo-IS. Although this technique requires likelihood evaluations at inference time, in practice it is much faster than other likelihood-based methods because of its high sample efficiency and parallelizability. Indeed, Dingo samples are independent and identically distributed, trivially enabling full parallelization of likelihood evaluations. This is a crucial advantage compared to inherently sequential methods such as MCMC.
Results.-For our experiments, we prepare Dingo networks as described in [10], with several modifications. First, we extend the priors over component masses to m 1 , m 2 ∈ [10, 120] M ⊙ and dimensionless spin magnitudes to a 1 , a 2 ∈ [0, 0.99]. We also use the waveform models IM-RPhenomXPHM [22] and SEOBNRv4PHM [23], which include higher radiative multipoles and more realistic precession. Finally, in addition to networks for the first  [20,1024] Hz with ∆f = 0.125 Hz and identical data conditioning as in [10]. The network architecture, hyperparameters, and training algorithm are also unchanged. We consider the two LIGO [44] detectors for all analyses, and leave inclusion of Virgo [45] data to a future publication of a complete catalog.
In our experiments, we found that Dingo often has difficulty resolving the phase parameter ϕ c . Although ϕ c itself is of little physical interest, it is nevertheless needed to evaluate the likelihood for importance sampling. We therefore sample ϕ c synthetically, by first evaluating the likelihood across a ϕ c grid and caching the waveform modes for efficiency (see Supplemental Material). This approach is similar to standard phase marginalization [17,46,47], but it is valid even with higher modes; it can therefore be adapted also to stochastic samplers.
For Dingo-IS, with 10 5 proposal samples per event, the total time for inference using one NVIDIA A100 GPU and 64 CPU cores is typically less than 1 hour for IMRPhenomXPHM and ≈ 10 hours for SEOBNRv4PHM. In both cases, the computation time is dominated by waveform simulations, which could be further reduced using more CPUs. The rest of the time is taken up to generate the initial Dingo proposal samples. 3 We first validate Dingo-IS against standard inference codes for two real events, GW150914 and GW151012, using IMRPhenomXPHM. (For SEOBNRv4PHM it is not feasible to run classical samplers, and one would instead need to use faster methods such as RIFT [49,50].) We generate reference posteriors using LALInference-MCMC [17], and compare one-dimensional marginalized posteriors for each parameter using the Jensen-Shannon divergence (Tab. I). For both events, the initial small deviations of Dingo samples from the reference are made Table II. 42 BBH events from GWTC-3 analyzed with Dingo-IS. We report the log evidence log p(d) and the sample efficiency ϵ for the two waveform models IMRPhenomXPHM (upper rows) and SEOBNRv4PHM (lower rows). Highlighting colors indicate the sample efficiency (green: high; yellow: medium; orange/red: low); Dingo-IS results can be trusted for medium and high ϵ (see Supplemental Material). Events in gray suffer from data quality issues [1,21]. ‡See remarks on these events in text.
For the evidence, we compare against Bilbydynesty [18][19][20], since nested sampling generally provides a more accurate estimate than MCMC. In Tab. I we see that Dingo-IS is more precise by a factor of ≈ 10, but the Bilby evidence is larger for both events by roughly one standard deviation. This deviation could be statistical, but it could also indicate a bias in one of the methods. (Recall that IS requires the proposal to be mass-covering for an unbiased evidence.) To further investigate for GW151012, we perform neural importance sampling starting from 10 6 Bilby samples (see Supple-4 Initial deviations are larger than those reported in [10] since we use a more complicated waveform model and a larger prior, while keeping the size of the neural network and training time the same. Any remaining deviations after importance sampling can in principle also be due to sampling inaccuracies of LALInference-MCMC. Note that a direct comparison to published LIGO-Virgo-KAGRA results is impeded by different data settings. mental Material). This achieves a slightly lower ϵ = 8.3% than Dingo-IS, but log p(d) = −16412.89 ± 0.01 in close agreement. While this does not fully rule out a bias in Dingo-IS samples (since the test is not fully independent) we take this as an indication that Dingo-IS indeed infers an unbiased evidence. More generally, it showcases how our method can be extended to improve the output of stochastic samplers. We now perform a large study analyzing all 42 events in GWTC-2 [21] and GWTC-3 [1] that are consistent with our mass prior. 5 We stress that a study of this scope would be infeasible with standard codes, since SEOB-NRv4PHM inference for a single event would take several months. Across all events we achieve a median sampling efficiency of ϵ = 10.9% for IMRPhenomXPHM and ϵ = 4.4% for SEOBNRv4PHM (Tab. II). For most events, the initial Dingo results are already accurate and only deviate slightly from Dingo-IS; furthermore, Dingo-IS shows excellent agreement between the two waveform models (see the Supplemental Material for more detailed comparisons). Note that these results are based on highly complex precessing higher-mode waveform models, and do not include any mitigation of noise transients (see below). With the simpler IMRPhenomPv2 [53][54][55] model and a smaller mass prior (in a study on drifting detector noise distributions [56]) Dingo-IS achieves an even larger median sample efficiency of ϵ = 36.8% on 37 events. Importance sampling guarantees robust results by marking failure cases with a low sample efficiency. By this metric, Dingo struggles slightly with chirp masses near the lower prior boundary (GW191204 110529 and GW200322 091133). For such systems, efficiency may be improved by increasing the prior range used for training. Events with known data quality issues also often have low sample efficiency (see Tab. II): several low-ϵ events are contaminated by glitch artifacts (which would be mitigated in a more complete analysis [1, 21]); GW200129 065458, in addition to having a glitch [57], may not be well modeled by either of our waveform models due to having strong precession [58]; and GW200322 091133 may be simply a Gaussian noise fluctuation [59]. In these cases, Dingo-IS marks events for additional investigation.
Data quality issues such as non-Gaussian noise or observed signals that do not match models correspond to OOD data, i.e., data not consistent with the training distribution. Since OOD data are not seen during training, Dingo cannot be expected to return their true posterior, which results in a low sample efficiency. As an additional test, running Dingo-IS on signal-free data with a blip glitch [60] in the LIGO Hanford detector (GPS time 1238613687.5) results in ϵ ≈ 0.001%. Likewise, we find that Dingo-IS successfully flags adversarial examples [61,62] that are intentionally corrupted to mislead the inference network (ϵ ≈ 0.01%; see Supplemental Material)-addressing a common failure mode of neural networks. Our general view, therefore, is that although there can be various reasons for low-ϵ results, it often serves as a useful heuristic to identify OOD events.
Conclusions.-We have described the use of importance sampling to improve the results of NPE in amortized inference problems, and we applied it to the case of GWs. Neural importance sampling provides rapid verification of results and corrects any inaccuracies in deep learning output; it provides an evidence estimate with precision far exceeding that of classical samplers; and it marks potentially OOD data for further investigation. With high sample efficiency and rapid initial results, Dingo-IS becomes a comprehensive inference tool for accurately analyzing the large numbers of BBH events expected soon.
High sample efficiencies are predicated on a high quality proposal, which Dingo thankfully provides. A key element is the probability-mass covering property, which is guaranteed by the forward KL training loss. This tends to produce broad tails, which are downweighted in importance sampling. Overly broad proposals would nevertheless result in low sample efficiency, so highly expressive density estimators such as normalizing flows are essential, along with Dingo innovations such as GNPE and GW training data augmentation. Dingo posteriors are rarely light tailed, but this does occasionally lead to underestimated evidence for small n.
With the inclusion of importance sampling, the Dingo pipeline can now be used in several different ways. When low latency is desired, complete posteriors are still available without importance sampling in a matter of seconds. Results include sky position and mass parameters and could therefore play an important role in directing electromagnetic followup observations once we extend Dingo to mergers involving neutron stars (see footnote 5). By comparing against Dingo-IS, we have shown that in the majority of cases, initial results are already very reliable, with only minor deviations in marginal distributions. Indeed, validation of Dingo results was a major motivation in exploring importance sampling.
When high accuracy is desired, Dingo-IS reweights results to the true posterior and includes an estimate of the evidence. Results are verified and include probability mass-covering guarantees that ensure secondary modes are not missed. Sample efficiencies are often two orders-ofmagnitude higher than MCMC or nested sampling, and importance sampling is fully parallelizable. As a consequence, results are typically available within an hour for IMRPhenomXPHM, or 10 hours for SEOBNRv4PHM. This represents a significant advantage when considering the event rates likely to be reached with advanced detectors (three per week or higher in the upcoming LIGO-Virgo-KAGRA observing run O4).
Dingo-IS opens several new possibilities for GW analysis: (1) rapid inference means that the most accurate waveform models, which include all physical effects, could be used for all events; (2) high-precision evidences enable detailed model comparison; and (3) low sample efficiencies can identify data that do not fit the noise or waveform model. We believe that these results have highlighted clear benefits of combining likelihood-free and likelihoodbased methods in Bayesian inference. Going forward, as Dingo-IS validates and builds trust in Dingo, it will help to set the stage for noise-model free inference, which is truly likelihood-free.
The code for Dingo and Dingo-IS is available at https://github.com/dingo-gw/dingo.
Acknowledgments.-We thank V. Raymond for encouraging us to pursue importance sampling in the early stages of the project, and C. García Quirós, N. Gupte, S. Ossokine, A. Ramos-Buades and R. Smith for useful discussions. This material is based upon work supported by NSF's LIGO Laboratory which is a major facility fully funded by the National Science Foundation. This research has made use of data or software obtained from the Gravitational Wave Open Science Center (gw-openscience.org),

IMPORTANCE-SAMPLED BAYESIAN EVIDENCE
The Bayesian evidence is given by which can be estimated using n samples θ i ∼ q(θ|d) in the Monte Carlo approximation as p(d) =μ w witĥ where w i = p(d|θ i )p(θ i )/q(θ i |d) are the weights used for importance sampling. The variance for this Monte Carlo estimate is given by where we denote normalized weights withw i = w i /μ w and the sample efficiency with ϵ = n eff /n. Since we use n samples to estimate p(d) =μ w , the standard deviation of the evidence is given by In practice, we are interested in the log evidence, for which the uncertainty is
(2) provides an unbiased estimate for p(d) [40], The logarithm of the evidence however has a bias. Defining Y =μ w − p(d), we find where we used E[Y ] = 0 and Var[Y ] = σ 2 w /n and neglected terms of order O((Y /p(d)) 3 ). The bias of the log evidence thus depends on the sample efficiency ϵ = n eff /n and scales with 1/n. Given that the uncertainty of logμ w scales with 1/ √ n, this bias is completely negligible in practice.

ANALYTIC ESTIMATE OF THE PHASE PARAMETER
The parameter ϕ c describes the phase of the gravitational wave at a fixed reference frequency. It provides no physical insight, but it is necessary to define a complete likelihood [46]. While the marginal p(ϕ c |d) usually has a simple structure, the conditional distribution p(ϕ c |d,θ), whereθ denotes the 14 remaining parameters, is typically very tightly constrained. Furthermore, ϕ c is strongly correlated withθ. We observed that Dingo has difficulties learning the phase parameter, and often infers the prior instead, q(ϕ c |d,θ) = p(ϕ c ). While we did not find this to have a negative impact on the remaining parameters, it leads to a substantially reduced sample efficiency.
Inspired by phase marginalization [46,47], a technique commonly used to increase the efficiency of stochastic samplers, we analytically estimate ϕ c . The approach outlined below differs in two ways from typical phase marginalization-(1) we retrieve ϕ c instead of marginalizing over it, and (2) this technique is exact even in the presence of higher modes, where phase marginalization is an approximation.
We decompose our posterior estimate into where q(θ|d) is estimated with Dingo. For each Dingo sampleθ ∼ q(θ|d), we then synthetically sample ϕ c using the analytic likelihood. This is done by evaluating p(ϕ c |d,θ) on a uniform grid over ϕ c with 5001 points in the range [0, 2π] and interpolating in between. Each likelihood evaluation requires a waveform simulation, which accounts for the bulk of the computational cost. As we outline below, by caching suitable combinations of the waveform modes, we can cheaply evaluate waveform polarizations for arbitrary ϕ c . Hence sampling the synthetic ϕ c is barely more expensive than a single likelihood evaluation.

Phase transformations
We work in the L 0 frame, which aligns the z axis with the orbital angular momentum of the binary at the reference frequency, and takes ϕ c as the azimuthal angle of the observer relative to the axis connecting the two bodies. In these coordinates, the observer is located at (θ, ϕ) = (ι, π/2 − ϕ c ), where ι is the inclination of the binary. This is convenient for caching the modes, since ϕ c enters the waveform entirely via the spin-weighted spherical harmonics (as opposed to the modes themselves).
Waveform modes h ℓm combine into polarizations h +,× as In frequency domain, Considering just the plus polarization and substituting for the mode expansion, Now we use the fact that the ϕ-dependence enters the spin-weighted spherical harmonics as −2 Y ℓm (θ, ϕ) = −2 Y ℓm (θ, 0)e imϕ . Since h + is real, we only need to consider f > 0. In the L 0 frame, we can then writẽ where we have grouped the terms according to their mdependence, Notice that we combined the positive frequency parts of modes with azimuthal number m together with negative frequency modes of azimuthal number −m. With this decomposition, we only need to cache theh +,m . Likewise for the cross polarization, we havẽ where One additional complication arises because waveform models are usually given in terms of Cartesian spin components, and ϕ c also enters into their definition in terms of the spin parameters used for parameter estimation. Consequently the modes retain a dependence on ϕ c . We overcome this by fixing the phase parameter used in effecting this transformation. This results in a slightly different definition of the spin parameters θ JN and ϕ JL , which we undo in post-processing. Since the standard priors are invariant under this transformation, other parameters are not affected.
This approach enables likelihood evaluations on a ϕ c grid at the computational cost of a single likelihood evaluation, plus a small additional cost for the inner products. 6 The implementation is fully contained in the Dingo package, which uses low-level LALSimulation [65] functions to compute frequency domain modes in the L 0 frame, and combines them into theh +/×,m . For SEOBNRv4PHM, this requires Fourier transforming the time domain modes provided by LALSimulation in L 0 frame. For IMRPhe-nomXPHM it requires transforming from J to L 0 frame, such that the ϕ c dependence enters via the spherical harmonics, not via the modes themselves.

DENSITY RECOVERY
IS requires access to the density of the inferred samples. While for NPE, this density is tractable, this is not necessarily the case for other inference methods. Below, we describe how we use neural density estimation to recover the density in these cases.

Group equivariant neural posterior estimation
Dingo uses an iterative algorithm called group equivariant NPE (GNPE) [10,48] to integrate physical symmetries and thereby improve the accuracy of inference. With GNPE, we train a density estimation network q(θ|d,t I (θ)) that is also conditional on a set of GNPE proxy parameterŝ t I . These parameters are defined as blurred versions of  Table III. Settings for the neural spline flow [14] architecture (upper part) and training (lower) used for density recovery.
For Dingo-IS with GNPE, we need to estimate a two dimensional distribution over the proxy parameters, which requires a smaller network than the distribution over the 14 dimensional parameter space used for Bilby-IS.
the coalescence times t I in the individual interferometers (which can be computed as a function of θ) aŝ with κ = U [−1 ms, 1ms]. With GNPE, we iteratively infer the posterior p(θ,t I |d) in the joint parameter space with Gibbs sampling, and obtain the posterior over θ by marginalizing overt I . We use a paralllelized Gibbs sampler that typically converges after 30 iterations, but some events require up to 500 iterations. Each iteration corresponds to a forward pass through the density estimator q(θ|d,t I (θ)). 500 GNPE iterations for a batch of 5 · 10 4 samples take about 6 minutes on an A100 GPU. In contrast to NPE, GNPE does not have a tractable density. To recover the density, we first generate 4 · 10 5 GNPE samples (48 minutes on one GPU or 6 minutes on eight GPUs for 500 iterations). We then train an unconditional normalizing flow q(t I ) to estimate the distribution over the inferred proxy parameters with a maximum likelihood objective. We use a neural spline flow with rational-quadratic spline coupling transforms [14] with the hyperparameters from Tab. III. Once trained, we can sample without the need for additional GNPE iterations via θ ∼ q(θ|d,t I ),t I ∼ q(t I ).
The proposal density is now tractable, log q(θ,t I |d) = log q(θ|d,t I ) + log q(t I ).
We then perform IS in the joint parameter space (θ,t I ), where the target density is given by log p(θ,t I |d) = − log p(d) + log p(d|θ) + log p(θ) The last term accounts for p(t I |θ). As described in the main part, we omit log p(d) and estimate this from the normalization of the weights. Alternatively, we could also train an unconditional density estimator for the converged θ samples, but this is less sample efficient and more costly to train.

Stochastic samplers
We apply IS to Bilby-dynesty [18][19][20], which is based on nested sampling. To recover the density, we first generate ≈ 10 6 posterior samples with 50 Bilby runs with identical settings. With nlive=1000 and nact=5, this takes about one day per run on 10 CPUs, when using the IMRPhenomXPHM model. One typically uses larger nact for production results, but this substantially increases the computational cost. For reference, the runs for GW150914 and GW151012 reported in the main paper with nlive=4000 and nact=50 took about a week. We then estimate the distribution over the Bilby samples by training an unconditional normalizing flow q(θ), see Tab. III. To ensure a fair comparison with Dingo, we also use the analytic estimate for the phase parameter, such that we only need to estimate the distribution over the remaining 14 parameters. Due to the higher dimensional parameter space compared to Dingo (for which we only need to recover the two dimensional density overt I ), we need more samples and a larger normalizing flow for the density estimate.
For GW151012, Bilby-IS achieves a sample efficiency ϵ = 8.3%, compared to ϵ = 12.5% for Dingo-IS, and estimates an evidence of log p(d) = −16412.89 ± 0.01. Since Bilby-IS is computationally very expensive, we do not expect it to be routinely used, but rather view it as an insightful diagnostic.

IMPORTANCE SAMPLING CONVERGENCE
Due to the probability mass covering training objective, Dingo inaccuracies tend to show up as overly broad posteriors (Fig. 2). When the tails of the posterior are overestimated by Dingo, a low sample efficiency may be encountered due to many low-weight samples. These cases are straightforward to handle with Dingo-IS. The sample efficiency is approximately constant, and the statistical uncertainty of the evidence fully captures the error, even for low n eff . To get smooth marginals one simply needs to generate more samples, which is cheap with Dingo.
In contrast, Dingo posteriors should rarely be light tailed. For real data, however, parts of the parameter space are occasionally strongly undersampled, which is problematic for IS. Indeed, for small n, the light tails may not be sampled at all, resulting in an underestimate of the evidence and the magnitude of its statistical error. Deviations between Dingo and the true posterior are primarily found below that line, but rarely above. This is a manifestation of the probability-mass covering behavior, making Dingo particularly well-suited for importance sampling.
Moreover, when a sample from the tail is encountered it has very large importance weight, which greatly decreases the sample efficiency. In order to assess the validity of IS results with low sample efficiency it is therefore useful to check whether log p(d) has converged as a function of n (Fig. 3). If Dingo is not truly mass covering, the IS weights are not upper-bounded, and the sample efficiency approaches zero with increasing n. This happens for the OOD event GW200129 065458.
Fortunately non-convergence is rare, and for the majority of events, Dingo posteriors are indeed mass covering and heavy tailed. Even when this is not the case and the sample efficiency is very low, the Dingo marginals are often still accurate. This is because the light tailed parts of the parameter space are often negligibly small and randomly distributed throughout the parameter space. In such cases one can apply batched self-normalized IS: instead of normalizing the weights of all n samples simultaneously, one normalizes batches of size k < n. This regularizes IS by decreasing the largest possible weight from n to k. This should be done with caution, as it introduces a bias which is only small if the undersampled regions carry an overall low probability mass, or are distributed unsystematically throughout the parameter space.

ROBUSTNESS TO ADVERSARIAL EXAMPLES
An adversarial example [61,62] refers to data d adv. that is specifically designed to mislead a neural network. Such examples can be generated by following gradients of the network output (or some function thereof) starting from some real data d true and sequentially adding small perturbations to maximally change the output. Although the resulting adversarial example d adv. is often barely distinguishable from d true , the neural network output can change dramatically.
In the context of posterior estimation, the output is a high-dimensional distribution which one can alter in multiple ways. We tried to shift or truncate the predicted Dingo distribution q(θ|d adv. ) by applying only minimal modifications to the data. We found that Dingo is remarkably robust to such attacks, its output could barely be changed without significantly changing the input data d. This unusual robustness is attributed to two factors. First, the training data itself is very noisy, which regularizes Dingo models. Second, the first layer of Dingo networks is seeded with principal components of clean GW signals [10], so adversarial perturbations are projected onto the manifold of GW signals.
We thus explore a slightly different notion of adversarial attacks. Starting from strain data d initialized with random Gaussian noise, we aim to modify d such that Dingo estimates identical posteriors for d adv. and the real strain data d true for GW150914. Specifically, we minimize the KL divergence D KL (q(θ|d true )||q(θ|d adv. )) via In contrast to the technique mentioned above, we here do not constrain the difference between d adv. and d true to be small. To optimize (22) we need to take gradients of the Dingo density with respect to d, which is intractable with the iterative GNPE [10,48] method. Instead, we use a Dingo network trained with standard NPE. We use the adam [66] optimizer with a learning rate of 0.03 to optimize Eq. (22) with 400 gradient steps (batch size 1024). The resulting strain d adv is visibly different from the true GW150914 strain d true , but the estimated Dingo posteriors are almost identical (Fig. 4).
With Dingo-IS, we find a sample efficiency of ϵ = 1.48% for the real GW150914 strain d true . This is substantially smaller than the sample efficiency achieved with GNPE (ϵ = 28.8%), since standard NPE does not use the physical symmetries and is hence less accurate. However, the Dingo-IS posterior is still accurate and the evidence estimate (log p(d) = −15831.88 ± 0.03) is in good agreement with the result reported in the main paper. For d adv. on the other hand, Dingo-IS achieves a sample efficiency of ϵ = 0.006%, clearly identifying the adversarial example as a Dingo failure case.  paring the two waveform models IMRPhenomXPHM and SEOBNRv4PHM. We see that the models give results that appear to be in good agreement. Fig. 6 shows posterior marginals for several GW events from O3. A large sample efficiency often corresponds to good agreement of the Dingo and Dingo-IS marginals. The sample efficiency is sensitive to deviations in the full 15 dimensional parameter space, so small sample efficiencies do not necessarily imply inaccurate marginal distributions.