Gravitational-wave parameter estimation with autoregressive neural network flows

We introduce the use of autoregressive normalizing flows for rapid likelihood-free inference of binary black hole system parameters from gravitational-wave data with deep neural networks. A normalizing flow is an invertible mapping on a sample space that can be used to induce a transformation from a simple probability distribution to a more complex one: if the simple distribution can be rapidly sampled and its density evaluated, then so can the complex distribution. Our first application to gravitational waves uses an autoregressive flow, conditioned on detector strain data, to map a multivariate standard normal distribution into the posterior distribution over system parameters. We train the model on artificial strain data consisting of IMRPhenomPv2 waveforms drawn from a five-parameter $(m_1, m_2, \phi_0, t_c, d_L)$ prior and stationary Gaussian noise realizations with a fixed power spectral density. This gives performance comparable to current best deep-learning approaches to gravitational-wave parameter estimation. We then build a more powerful latent variable model by incorporating autoregressive flows within the variational autoencoder framework. This model has performance comparable to Markov chain Monte Carlo and, in particular, successfully models the multimodal $\phi_0$ posterior. Finally, we train the autoregressive latent variable model on an expanded parameter space, including also aligned spins $(\chi_{1z}, \chi_{2z})$ and binary inclination $\theta_{JN}$, and show that all parameters and degeneracies are well-recovered. In all cases, sampling is extremely fast, requiring less than two seconds to draw $10^4$ posterior samples.


I. INTRODUCTION
The task of gravitational-wave parameter estimation is to determine the system parameters that gave rise to observed detector strain data. This is accomplished using Bayesian inference. Assuming a likelihood model p(y|x) for strain data y conditioned on system parameters x, and a prior distribution p(x), one obtains through Bayes' theorem the posterior distribution, where the normalization p(y) is called the model evidence.
Generally, one can evaluate the likelihood explicitly (although this may be computationally expensive) and the prior is also known, so this allows for calculation of the posterior up to normalization. One can then use an algorithm such as Markov chain Monte Carlo (MCMC) to obtain samples from the posterior. Standard sampling algorithms are effective, but they can be computationally costly. Indeed, for binary black holes, obtaining a sufficient number of posterior samples can take on the order of days, whereas for binary neutron stars, this extends to weeks. Especially for binary neutron stars, which might have multimessenger counterparts, it is critical to reduce this time to provide accurate information to electromagnetic observers. * stephen.green@aei.mpg.de † christine.simpson@ed.ac.uk ‡ jonathan.gair@aei.mpg. de There have recently been several efforts to speed up parameter estimation by using deep learning [1][2][3]. The main tool of deep learning is the neural network, which is a trainable and very flexible function approximator. Neural networks can have millions of parameters, which are tuned through stochastic gradient descent to optimize a loss function. In this way, very complex functions can be approximated. Since conditional distributions can be parametrized by functions, neural networks can be used to model gravitational-wave posteriors.
A key observation of [1,2] is that Bayes' rule can be applied in such a way that training only requires samples from the prior and the gravitational-wave likelihood, i.e., strain realizations y ∼ p(y|x). Although the network learns the posterior distribution, posterior samples are not needed for training. Training also does not require any likelihood evaluations, and for this reason this approach is known as likelihood-free inference. Since obtaining posterior samples and evaluating the likelihood are expensive, the likelihood-free approach is very fast in comparison. It is also applicable in contexts where a simulation of the data generative process is available, but the likelihood function may be unknown.
In [1], this approach was applied successfully with a multivariate Gaussian posterior model, i.e., p(x|y) = N (µ(y), Σ(y)), where the mean µ(y) and covariance matrix Σ(y) are given by neural networks. Once trained on simulated waveforms and noise realizations, the Gaussian posterior model is trivial to sample.
The challenge, then, is to define a sufficiently flexible model distribution for the posterior. Gaussians are good approximations for very high signal-to-noise ratio, but posteriors can in general have higher moments and multimodality. Ref. [1] also experimented with Gaussian mixture models and posterior histogram outputs. The performance of the Gaussian mixture, however, was not a significant improvement over the single Gaussian, and although histogram was effective, it is limited to describing one or two parameters.
A promising approach to increase the flexibility of a posterior model is to introduce latent variables z and perform variational Bayesian inference. The posterior is then given by marginalization over the latent variables, i.e., p(x|y) = p(x|z, y)p(z|y)dz. With p(x|z, y) and p(z|y) both taken to be Gaussian, this nevertheless gives non-Gaussian p(x|y). This can be implemented using the variational autoencoder framework [4,5], and recent results [2] for gravitational-wave parameter estimation with a conditional variational autoencoder (CVAE) have demonstrated the recovery of non-Gaussian posteriors (although not multimodality).
In this work, we use the method of normalizing flows [6] (specifically, masked autoregressive flows [7,8]) to increase the flexibility of the gravitational-wave posterior. A normalizing flow f : X → X is an invertible mapping on a sample space X, with simple Jacobian determinant. Using the change of variables rule for probabilities, this induces a transformation, from a base distribution π(u) to a new distribution p(x).
Here n = dim X. Starting from a simple standard normal distribution for π(u), one can obtain a more complex distribution by applying a normalizing flow. To describe conditional distributions, such as gravitational-wave posteriors, a normalizing flow may also be conditioned on additional variables; for our application, we condition f on detector strain data y.
The fact that f is invertible means that the normalizing flow enables both sampling and density evaluation of p(x), provided this is possible for π(u). To sample, one first samples u ∼ π(u), and then x = f (u) is a sample from p(x). For given x, to evaluate the density, the inverse mapping u = f −1 (x) is used in the right hand side of (2).
Normalizing flows may be used to model gravitationalwave posterior distributions directly, and this is the first application that we describe. We fix the base distribution to be multivariate standard normal of the same dimension as the system parameter space. We then take the basic flow unit to be a three-hidden-layer MADE neural network [9] conditioned on y, and we stack several of these to obtain a sufficiently flexible posterior model. This is called a masked autoregressive flow (MAF) [8]. Since the density can be evaluated directly through (2), the network can be trained using stochastic gradient descent to maximize the likelihood over the network parameters that the training data ((x, y) pairs) came from the model.
Normalizing flows can also be incorporated into the CVAE: they can be used to enhance the flexibility of the encoder, decoder, and prior component networks, thereby increasing the overall flexibility of the model [7,10]. In section II of this work, we describe all of the above networks in further detail.
In section III we present the results of our experiments in using these models to describe gravitational-wave posteriors over a five-dimensional space of system parameters. Following [2], we study binary black hole waveforms over the parameters (m 1 , m 2 , φ 0 , t c , d L ), with added noise drawn from a stationary Gaussian distribution with fixed power spectral density. We work in the time domain. We find that the basic MAF network achieves results comparable to [2], but with the advantage of not having any latent variables to marginalize over. In our experiments, however, neither of these networks successully models the posterior over φ 0 , which is multimodal. We next test the more powerful autoregressive CVAE, which does succeed in modeling the multimodality in φ 0 . To validate our recovered posteriors, we present p-p plots consistent with uniformly distributed percentile scores in each parameter, as well as comparisons to MCMC sampling.
We then expand the parameter space to include aligned spins (χ 1z , χ 2z ) and binary inclination θ JN in section IV. We train the CVAE network with autoregressive flows to model the posterior over this eight-dimensional space. We find that the network once again successfully models all parameters. This includes the various degeneracies, e.g., between χ 1z and χ 2z , and between θ JN and d L . We validate our results again with a p-p plot.
This work is organized as follows. In the following section we describe in more detail the various types of neural networks that we use for parameter estimation. In section III we describe our experiments with the fivedimensional parameter space, and in section IV the enlarged eight-dimensional space. Finally, in section V we conclude with a discussion of potential further improvements.
Software: All of our neural networks are implemented in PyTorch [11], with the autoregressive network implemented using Pyro [12]. We used emcee [13] to generate MCMC comparisons, and ChainConsumer [14] to produce corner plots.
Notation: The various vector spaces, along with their dimensionalities are given in table I.

II. PRELIMINARIES
In this section we review important deep learning concepts, and we discuss them in the context of gravitational-wave parameter estimation. The first two subsections describe ideas that have already been implemented for parameter estimation, in [1] and [2], respectively, and the last two describe the autoregressive flows that we explore in this work.
A. Neural network models of gravitational-wave posteriors Suppose we have a posterior distribution p true (x|y). Our aim is to train a neural network to give an approximation p(x|y) to p true (x|y). The "true" posterior is itself a model for the physical system. For gravitational waves, it is defined through Bayes' theorem in terms of a prior p true (x) over the system parameters x and a likelihood p true (y|x). The likelihood depends on a choice of waveform model h and a measured noise power spectral density S n (f ): it is the probability density that the residual n = y − h(x) is drawn from a stationary Gaussian noise distribution with PSD S n (f ). For further details on the noise model and likelihood, see, e.g., [15].
For stationary Gaussian noise and known h(x), it is trivial to sample from the likelihood. In contrast, the "inverse problem" of sampling from the posterior-sampling over parameters x-requires an algorithm such as MCMC and many evaluations of the waveform model and the likelihood. This repeated comparison between model waveforms and data is computationally expensive, and for this reason we wish to develop the neural network model, p(x|y).
The first step in developing the model is to parametrize the posterior in terms of a neural network. For now, we take as our model a multivariate normal distribution [1], although our main interest later will be in defining more flexible models. That is, we take where the mean µ(y) and covariance matrix Σ(y) are functions of data y defined by a feedforward neural network.
(To ensure Σ(y) is positive definite and symmetric, it is in practice better to take the Cholesky decomposition, Σ(y) = A(y)A(y) , where A(y) is lower-triangular with positive diagonal entries. A(y) is then modeled with the neural network.) Feedforward neural networks can have a variety of configurations, but the simplest consists of a sequence of fully-connected layers. The output of the first hidden layer (of dimension d 1 ) is where W 1 is a d 1 × m matrix, and b 1 is a d 1 -dimensional vector. σ is an element-wise nonlinearity, typically taken to be a Rectified Linear Unit (ReLU), The output h 1 is then passed through a second hidden layer, and so on, for as many hidden layers as desired. A final affine transformation W p h p−1 +b p is then applied, and the outputs repackaged into a vector µ and lower-triangular matrix A. A suitable nonlinearity should be applied to ensure positive diagonal components of A, but otherwise the components of µ and A should be unconstrained. The weight matrices W i and bias vectors b i are initialized randomly (see, e.g., [16]) and then trained through stochastic gradient descent to optimize a loss function.
With the posterior defined, the loss function should be chosen so that after training, p(x|y) is a good approximation to p true (x|y). We therefore take it to be the expectation value (over y) of the cross-entropy between the two distributions, i.e., we maximize the likelihood over the network parameters (the weights and biases) that the training data came from the model. Minimizing L is equivalent to minimizing the expectation value of the Kullback-Leibler (KL) divergence between the true and model posteriors, since p true (x|y) is fixed. The loss function in the form (7) is actually not ideal for our purposes. The reason is that it requires sampling from p true (x|y)-a very costly operation. Instead, as pointed out by [1,2], we can use Bayes' theorem in a very advantageous way, to write To train the network now requires sampling from the likelihood, not the posterior. On a minibatch of training data of size N , we approximate where x (i) ∼ p true (x) and y (i) ∼ p true (y|x (i) ). The loss function can be explicitly evaluated using the expression (3). The gradient of L with respect to the network parameters can be calculated using backpropagation (i.e., the chain rule), and the network optimized with stochastic gradient descent. The training set consists of parameter samples x and waveforms h(x); random noise realizations are added at train time to obtain data samples y.

B. Variational autoencoders
One way to increase the flexibility of the model is to introduce latent variables z, and define the gravitationalwave posterior by first sampling from a prior over z, and then from a distribution over x, conditional on z. In other words, p(x|y) = dz p(x|z, y)p(z|y).
If one takes p(x|z, y) and p(z|y) to both be multivariate Gaussians, then p(x|y) is a Gaussian mixture of Gaussians.
In this way one can describe a more flexible posterior. At first glance it is not clear how to train such a model: the posterior (10) is intractable, since evaluation involves marginalizing over z. If one knew the posterior 1 p(z|x, y), then could be evaluated directly, but p(z|x, y) is also intractable. A (conditional on y) variational autoencoder [4,5] is a deep learning tool for treating such a latent variable model. It introduces a recognition (or encoder ) model q(z|x, y), which is an approximation to the posterior p(z|x, y). As with the first two networks, the recognition network should have tractable density and be easy to sample, e.g., a multivariate Gaussian. One can then take the expectation (over z) of the logarithm of the posterior, log p(x|y) = E q(z|x,y) log p(x|y) where the last term is the KL divergence, Since this is nonnegative, L is known as the variational lower bound on log p(x|y). If q(z|x, y) is identical to p(z|x, y), then L = log p(x|y). The variational autoencoder maximizes the expectation of L over the true distribution. The associated loss function can be written 1 The variational posterior p(z|x, y) should not be confused with the gravitational-wave posterior p(x|y). It should be clear from context to which posterior distribution we are referring.
The three networks, p(z|y), p(x|z, y), and q(z|x, y), are trained simultaneously. To evaluate the loss, the outer two expectation values are treated the same as in the previous subsection. The inner expectation value is evaluated using a Monte Carlo approximation, typically with a single sample from q(z|x, y). For Gaussian q(z|x, y) and p(z|y) the KL divergence term may be calculated analytically; otherwise, a single Monte Carlo sample suffices. For training, it is necessary to take the gradient of the loss with respect to network parameters. The stochasticity of the Monte Carlo integral estimates must, therefore, be carefully treated. This can be done by using the reparametrization trick [4,5], namely by treating the random variable as an additional input to the network drawn from a fixed distribution. For example, if q(z|x, y) is multivariate Gaussian with mean µ(x, y) and Cholesky matrix A(x, y), then is a sample from q(z|x, y). With this trick, one can now take the gradient of z with respect to network parameters. This setup is called a variational autoencoder because the first term in the loss function has the form of an autoencoder. The recognition network q(z|x, y) is known as the encoder, and p(x|z, y) as the decoder. This first term (the reconstruction loss) is minimized if x is recovered after being encoded into z and then decoded. The other term in the loss function (the KL loss) pushes the encoder to match the prior p(z|y) and acts as a regulator. When the variational autoencoder works as an autoencoder, the latent variables z can give a useful low-dimensional representation of x.
The recent work [2] used a CVAE with diagonal Gaussian networks to model gravitational-wave posterior distributions, achieving excellent results over the four parameters (m 1 , m 2 , t c , d L ). In the following two subsections we describe the use of masked autoregressive flows to build even more general distributions.

C. Masked autoregressive flows
In this subsection we review the concept of a masked autoregressive flow, a type of normalizing flow that we use in our work to map simple distributions into more complex ones. We refer the reader to [8] for a much more in-depth discussion.
Consider a probability density p(x). Without any loss of generality, this may be written using the chain rule as An autoregressive model restricts each conditional distribution in the product to have a particular parametrized form. We will take this to be univariate normal [8], for i = 1, . . . , n. In [7] it was observed that an autoregressive model defines a normalizing flow. In other words, suppose u ∼ N (0, 1) n . Then gives a sample from p(x). This mapping f : u → x is defined recursively in (18), but the inverse mapping, is nonrecursive. Because f is autoregressive, the Jacobian determinant is very simple, Hence, f defines a normalizing flow. Starting from a simple base distribution, the change of variables rule (2) can be used to evaluate the density p(x). An autoregressive flow may be modeled by a neural network with masking [9]. That is, one starts with a fully connected network with n inputs, n outputs, and several hidden layers to define f , but then carefully sets certain connections to zero such that the autoregressive property holds. This defines a MADE network [9]. Because the inverse direction is nonrecursive, this requires just a single pass through the network [7]. To map in the forward direction requires n passes. During training it is, therefore, preferable to only require inverse passes. For gravitational waves, the parameter space has reasonably low dimensionality, so even the forward pass is not too expensive.
With a single MADE network, the first component x 1 is independent of all other components, and follows a fixed Gaussian distribution. To achieve sufficient generality, it is necessary to stack several MADE networks in sequence, permuting the order of the components between each pair [7]. This is called a masked autoregressive flow (MAF) [8]. For stability during training it is also useful to insert batch normalization layers [17] between the MADE layers, and between the base distribution and the first MADE layer; in the context of a flow, these also contribute to the Jacobian [8]. MAFs and related networks have been used to model very complex distributions over high-dimensional spaces, including audio [18] and images [19].
In the context of gravitational-wave parameter estimation, the sample space is relatively low-dimensional. To model the posterior distribution, however, each MADE flow unit should be made conditional on the (highdimensional) strain data y, while maintaining the autoregressive property over x. We can then take a standard normal base distribution, and flow it through all the MADE and batch normalization layers, to obtain the complex posterior. The loss function is the same as in section II A, but with the change of variables rule used to evaluate the density, i.e., where now f denotes the entire sequence of flows in the network. Notice that this involves only the inverse flow f −1 , which is fast to evaluate. This network is fast to train, but somewhat slower to sample.

D. Combined models
More powerful models can be obtained by combining autoregressive flows with the variational autoencoder. Each of the three networks comprising the CVAE-the encoder q(z|x, y), the decoder p(x|z, y), and the prior p(z|y)-can be made more flexible by applying autoregressive flows to their sample spaces. We discuss each of these possibilities in turn. In our experiments, we found that the best performance was achieved when combining all three. In all cases, the CVAE loss function (14) is optimized, with the change of variables rule used to evaluate the component densities.

Encoder with inverse autoregressive flow
Normalizing and autoregressive flows were first proposed as a way to increase the flexibility of the encoder network [6,7]. This is important because the CVAE loss function (14) differs from the desired cross-entropy loss (8) by the expectation of the KL divergence between q(z|x, y) and the intractable p(z|x, y). If this can be made smaller, then the two losses converge, hence q(z|x, y) should be as flexible as possible.
A flexible encoder is also desired to avoid a situation called "posterior collapse." The reconstruction and KL loss terms are in competition during training, and if the KL loss term collapses to zero, the network can fail to autoencode. In this situation, the encoder matches the prior, so it ignores its x input; the latent variables z contain no information about x beyond what is contained in y. This can happen either because the loss gets stuck in an undesired local minimum during training, or the configuration with vanishing KL loss is actually the global minimum [10]. The former can be alleviated by careful training strategies such as annealing the KL loss term [20]. The latter can occur because the use of latent variables incurs a cost related to D KL (q(z|x, y) p(z|x, y)) [10]. If the decoder is sufficiently powerful such that it can model p(x|y) on its own, then p(z|x, y) → p(z|y) and L CVAE → L is the global minimum. The network simply decides it is not worthwhile to use the latent variables.
Thus, a flexible encoder is important for performance and to make full use of latent variables. Since evaluation of L CVAE requires sampling from q(z|x, y), to map the Gaussian into a more complex distribution an inverse autoregressive flow (IAF) should be used [7]. Sampling is fast since the inverse flow (19) does not involve recursion. The density evaluation needed for the KL loss term only needs to take place for z sampled from q(z|x, y), so caching may be used.
Each MADE layer comprising the encoder IAF may be conditioned on x, y, and an optional additional output of the Gaussian encoder, h. (The initial work [7] used only h.) In our experiments we found it most effective to condition only on h and x. It is possible that the high dimensionality of y meant that it overpowered x.

Decoder with masked autoregressive flow
Since the reconstruction loss requires density evaluation of the decoder distribution, one should apply a forward MAF after the Gaussian distribution to increase flexibility of p(x|z, y). The MADE layers may be conditioned on y (as in the basic MAF of section III B 2) and the latent variable z. This powerful decoder increases the risk of posterior collapse, so it is useful to use also the IAF encoder.

Prior with masked autoregressive flow
As described in [10], an autoregressive flow at the end of the prior network p(z|y) effectively adds flexibility to the encoder network and the decoder network. For fast training, one should again use a forward MAF, and its MADE layers should be conditioned on y.
Although a prior MAF and an encoder IAF are very closely related, they differ in that the IAF can be conditioned on x. We found that this had a significant impact on performance, and therefore included both autoregressive flows in our models.

III. NONSPINNING BINARIES
In this section we describe our experiments in using deep learning models to describe nonspinning coalescing black hole binaries. Binaries are described by five parameters: the masses m 1 and m 2 , the luminosity distance d L , the time of coalescence t c , and the phase of coalescence φ 0 . These parameters and their ranges are chosen to facilitate comparison with [2].

A. Training data
Training data for our models consist of pairs of parameters x and strain time series y. Parameters are sampled from a prior distribution p(x), which is uniform over each parameter except for the volumetric prior over d L .
Parameter ranges are taken from [2], We also take m 1 ≥ m 2 . Strain realizations are in the time domain and 1 s long (0 ≤ t ≤ 1 s) with a sampling rate of 1024 Hz. (We found that the sampling rate of 256 Hz of [2] was not sufficient to eliminate Gibbs ringing.) Strain data consist of a waveform h(x), deterministic from parameters x, and random noise n, sampled from the Advanced LIGO [21] Zero Detuned High Power PSD [22]. We took h(x) to be the "+" polarization of IMRPhenomPv2 waveforms [23], which, following [2], we whitened in the frequency domain using the PSD before taking the inverse Fourier transform to the time domain. Finally, we rescaled our waveforms by dividing by √ 2∆t so that in the end, time domain noise is described by a standard normal distribution in each time sample, i.e., n ∼ N (0, 1) m .
The whitening and rescaling procedure serves two purposes: it means that we can easily draw noise realizations at train time by sampling from a standard normal distribution, and it ensures that the input data to the neural network has approximately zero mean and unit variance. The latter condition ("standardization" of data) has been shown to improve training performance. Similarly, we rescale x to have zero mean and unit variance in each component across the training set.
Our dataset consists of 10 6 (x, h) pairs, which we split into 90% training data and 10% validation data. Noise realizations n are sampled at train time to give strain data y = h + n; a sample time series is given in figure 1. The median signal-to-noise ratio (SNR) of our training set is 25.8, and the complete SNR distribution is given in given in figure 2. Our dataset is the same size as that of [2] and a factor of 10 3 smaller than the effective dataset of [1]. Larger datasets are generally preferred to prevent overfitting, but they are also more costly to prepare and store. We would like to build a system that can generalize well with as small a dataset as possible in order to minimize these costs when moving to more complicated and longer waveforms in the future. By keeping track of performance on the validation set when training our models, we found that our training set was sufficiently large to avoid overfitting. We also experimented with a 10 5 -element dataset; this had slightly reduced, but still acceptable, performance.

B. Experiments
We modeled our gravitational-wave posteriors with three types of neural network described in section II: a CVAE (similar to [2]), a MAF, and a CVAE with autoregressive flows appended to the three subnetworks (denoted CVAE+). We selected hyperparameters based on choices in the literature [2,8] and on sweeps through hyperparameter space. Approximate network sizes and training times for each architecture are listed in table II, and final loss functions after training in table III. All of our networks use ReLU nonlinearities. They were trained for 250 epochs with a batch size of 512, using the Adam optimizer [24]. The learning rate was reduced by a factor of two every 80 epochs. Sampling performance results for the three network architectures are collected in figures 3 and 4.

CVAE
Our CVAE network is designed to be similar to that of [2]. The encoder, decoder, and prior distributions are all taken to be diagonal Gaussians, each parametrized by fully connected neural networks with three hidden layers of dimension 2048. The latent space is of dimension l = 8. We used the same initial training rate as [2], of 0.0001.
To train the CVAE, we optimize the variational lower bound loss L CVAE of (14), with a single Monte Carlo sample to estimate the expectation value over q(z|x, y). Since L CVAE is only an upper bound on the cross-entropy loss L, the value of the loss function alone is not entirely indicative of performance. Indeed, when posterior collapse occurs (see section II D 1), the gap between L CVAE and L can vanish; in this case, the value of L is larger than that of a network with the same L CVAE and no posterior collapse. For this reason, it is important to also use other metrics to evaluate performance. For CVAE models we always quote the KL loss as an indication that the latent space is being used.
To encourage use of the latent space, we found it to be beneficial to use KL annealing during training, i.e., we multiplied the KL loss part of L CVAE by a factor between zero and one during the early stages. This reduces the importance of the KL loss term compared to the reconstruction loss. For all CVAE-based moels, we adopted a cyclic KL annealing strategy [25]: for the first 6 epochs we used annealing factors of (10 −5 , 1/3, 2/3, 1, 1, 1), and we repeated this cycle 3 more times; see figure 5. We also ignored the KL loss term whenever it was less than 0.2.
In figure 3a, we show a posterior distribution corresponding to the strain data y of figure 1. This is constructed from N = 5 × 10 4 posterior samples, obtained using formula (10) with the prior and decoder networks as follows: (1) pass y through the prior network to obtain the distribution p(z|y), (2) draw N latent-space samples z (i) from the prior, (3) pass these through the decoder network to obtain N distributions p(x|z (i) , y), and finally (4) draw one sample x (i) from each of these distributions.
By inspection of the posterior, it is clear that the latent space is being used in the model, since the distribution is not a diagonal Gaussian. The distributions for most of the parameters look reasonable, and they cover the true values of the parameters. The phase of coalescence, φ 0 , is, however, not being captured at all. Indeed, φ 0 should be precisely π-periodic because our training set waveforms only contain the (l, m) = (2, 2) mode of the signal. Moreover, since we are taking a single polarization, φ 0 should be resolvable. We can evaluate the performance of the network with a p-p plot [26]. To do this, we compute posterior distributions, each comprised of 10 4 samples, for 10 3 different strain realizations (i.e., y = h(x) + n for x drawn from the prior over parameters and n a noise realization). For each one-dimensional posterior, we then compute the percentile value of the true parameter. For each parameter, the p-p curve is then the cumulative density function of the percentile values. This is shown in figure 4a for the CVAE network. If the CDF is diagonal, then the percentile val-ues are distributed according to a uniform distribution, as one would expect for the true one-dimensional posteriors. We can see that the CVAE appears to capture all of the one-dimensional distributions except for φ 0 .
To confirm that the percentile scores are welldistributed, we also performed a Kolmogorov-Smirnov test. We calculated the KS statistic to compare the distribution of percentile scores to a uniform distribution. We found that φ 0 had miniscule p-value, as expected, the p-value of t c was 0.15, and all other p-values were greater than 0.29. This indicates that all parameters are well-recovered, except for φ 0 .

MAF
Next, we modeled the gravitational-wave posterior p(x|y) directly using a masked autoregressive flow. As described in section III B 2, the MAF network describes the posterior in terms of an invertible mapping f from a five-dimensional standard normal distribution N (0, 1) 5 (u) into the gravitational-wave posterior p(x|y). The flow f (u, y) is autoregressive over u and has arbitrary dependence on strain data y. The MAF network does not involve latent variables, and optimizes the cross-entropy with the data distribution. Thus the loss function L [given in (21)] can be used to directly compare performance for different models.
Our best performing MAF network consists of five MADE units, each with three hidden layers of dimension 512, and conditioned on y. We also inserted batch normalization layers between each pair of MADE units, and between the first MADE unit and the base space. We found this to be important for training stability. We were able to train the MAF successfully at a higher initial learning rate than the CVAE, of 0.0004. Following [8], at the end of each training epoch, before computing the validation loss, we set the stored mean and variance vectors of the batch normalization layers to the mean and variance over the entire training set. (We did this also for the CVAE+ network below.) In figure 3b, we show a corner plot for the same strain data as before. Samples are obtained by first sampling from the standard normal base space, u ∼ N (0, 1) 5 and then applying the the flow f (u, y). All quantities appear to be well-modeled except, again, for φ 0 . 2 This is consistent with the p-p plot shown in figure 4b. The KS statistic p-values were 0.14 for t c , and they were greater than 0.5 for all other parameters (except for φ 0 ). (The p-p plots for all networks are computed from the same strain realizations, so it is consistent to see the same KS statistic p-values for the different networks.) Performance of the MAF network is comparable to that of the CVAE, with (in our case) one eighth the number of trainable parameters. In both cases, all parameters except for φ 0 are well-modeled by the networks. In addition, the final loss values given in table III are very close for both networks, with a slight edge in validation loss for the MAF. However, since L CVAE is an upper bound on L, the cross-entropy loss for the CVAE network may actually be lower than that of the MAF. Indeed, the fact that the CVAE made use of the latent space suggests that this gap is nonzero. Comparison of the training and validation loss functions shows slight overfitting for the CVAE, but none for the MAF.

CVAE+
Finally, we experimented with combinations of CVAE and MAF networks. As described in section II D, all three component distributions of the CVAE can be made more flexible by applying MAF transformations to the diagonal Gaussian distributions, thereby increasing the total modeling power of the network [7,10]. Indeed, appendix A of [7] shows that a single linear autoregressive flow is capable of generalizing a Gaussian distribution from diagonal to generic covariance.
For the combined models, the initial Gaussian distributions (the base spaces for the autoregressive flows) are modeled in the same way as in section III B 1, as fully connected three-hidden-layer networks. However, we reduced the size of hidden layers to 1024 dimensions. We also kept the l = 8 dimensional latent space. We found that best performance was achieved by applying autoregressive flows to all three component distributions as follows: q(z|x, y): We added an IAF after the initial Gaussian encoder. This was made conditional on x and extra 8-dimensional context output h from the initial encoder. We also experimented with conditioning on y, but found this to reduce performance.

p(z|y):
We added a MAF after the initial Gaussian prior network. This was made conditional on y.
p(x|z, y): We added a MAF after the initial Gaussian decoder network. This was conditional on y and z.
In all cases the MAF/IAF parts consist of 5 MADE units (three hidden layers of 512 dimensions each) and batch normalization layers. Although the CVAE+ network contains far more layers than the basic CVAE, it has roughly half the number of free parameters since we reduced the width of the hidden layers in the Gaussian components. Training was performed by optimizing L CVAE , with the change of variables rule used to evaluate the component densities. Sampling was similar to the basic CVAE, but now with the appropriate flows applied after sampling from the Gaussians. We found that optimization was best when we trained the Gaussian distributions at a learning rate of 0.0001, and the autoregressive networks at a higher rate of 0.0004, combining the rates of the previous two subsections. As always, the learning rate was reduced by half every 80 epochs.
It is especially important for the CVAE+ network to apply some sort of KL annealing to encourage use of the latent space. An important difference compared to the basic CVAE of section III B 1 is that now the decoder network is sufficiently powerful to model much of the posterior on its own. Indeed, it is just as flexible as the MAF network of the previous subsection, which produced the posterior in figure 3b, and it is well known that a CVAE will often ignore latent variables if the decoder is sufficiently powerful [10,20]. This strategy combined with the flexible encoder distribution resulted in higher KL validation loss for the CVAE+ network (6.77) than the basic CVAE (4.49) by the end of training; see table III. A plot showing the KL and total loss terms as training proceeds is given in figure 5.
A sample gravitational-wave posterior distribution for CVAE+ is given in figure 3c. In contrast to the simpler models, this captures the periodicity in φ 0 . Sampling was very fast, requiring ≈ 3.3 s to obtain the 5 × 10 4 samples that make up this figure. The p-p plot in figure 4c shows excellent statistical performance, with all KS statistic p-values greater than 0.35.
For validation of our network against standard methods, figure 3c also includes samples obtained using MCMC. We used the emcee ensemble sampler [13] with the same prior and likelihood as defined by our training set construction. There is clear agreement between MCMC sampling and neural network sampling, with slight deviation in the mass posteriors. We expect that this could be reduced with hyperparameter improvements or further training. The KL divergence between the two distributions shown is estimated [27] at  ter χ = (m 1 χ 1z + m 2 χ 2z )/(m 1 + m 2 ). These are derived quantities, with sampling done instead over the parameters listed above. Although posteriors are simpler over the derived parameters, to test our method we chose to sample over parameters with more nontrivial posteriors. Sampling from the larger CVAE+ model used for the extended parameter space is slightly slower than the smaller model of the previous section, now requiring ≈ 1.6 s to obtain 10 4 posterior samples. This is partially due to the larger 16-dimensional latent space: the forward pass through a MAF is recursive, so twice as many passes are required to sample from the variational prior and decoder. Both MAFs also have twice as many layers.

V. CONCLUSIONS
In this work we introduced the use of masked autoregressive flows for improving deep learning of gravitationalwave posterior distributions. These learnable mappings on sample spaces induce transformations on probability distributions, and we used them to increase the flexibility of neural network posterior models. The models that we built can all be rapidly sampled, requiring < 2 s to obtain 10 4 samples.
For nonspinning, quasi-circular binary black holes, and a single gravitational-wave detector (a five-dimensional parameter space) we compared models involving a single MAF, a CVAE, and a CVAE with autoregressive flows applied to the encoder, decoder, and prior networks (CVAE+). We found that the performance of the single MAF and CVAE models were comparable, and that best performance was achieved by the CVAE+ model. The CVAE+ model was able to capture the bimodality in the phase φ 0 , which eluded the other models.
We then considered a larger eight-dimensional parameter space, with aligned spins and nonzero binary inclination angle. With a higher-capacity CVAE+ network, we successfully recovered the posterior distribution over all parameters. This demonstrates that our approach extends to higher-dimensional parameter spaces. Modest increase in network capacity may, however, be required.
Although best performance was achieved with the CVAE+ model, an advantage of models without latent variables (such as the MAF alone) is that it is possible to directly evaluate the probability density function. Since the posterior distribution that the MAF models is normalized, one could then calculate the Bayesian evidence by separately evaluating the likelihood and prior. (In CVAE+, this would require marginalization over z.) Moreover, since the MAF loss function is just the crossentropy loss with the true distribution, this means that loss functions of competing models can be compared directly. It would therefore be worthwhile to also try to improve performance of the basic MAF model.
In contrast to typical applications of these deep learning tools, the space Y on which all of our models are conditioned is of much higher dimensionality than the space X that we are modeling. One way to improve performance further may be to introduce convolutional neural networks to pre-process the strain data y and compress it to lower dimensionality. In the future, when we extend our models to treat longer waveforms, higher sampling rates, and additional detectors, this will become even more important.
Going forward, it will also be important to understand better the uncertainty associated to the neural network itself, particularly in regions of parameter space that are not well covered by training data. Rather than taking the maximum likelihood estimate for neural network parameters, as we did in this work, one can model them as random variables with some probability distribution. This distribution can be learned through variational inference, and ultimately marginalized over, resulting in a slightly broader posterior over system parameters [3,28]. In our work, overfitting was not a problem, but such approaches to neural network uncertainty could be useful in situations where the binary system parameter space is not well covered due to high waveform generation costs.
As the rate of detected gravitational-wave events grows with improved detector sensitivity, methods for rapid parameter estimation will become increasingly desirable. For deep learning approaches to become viable alternatives to standard inference methods, they must be extended to cover the full space of binary system parameters, and to treat longer duration waveforms from multiple detectors with detector PSDs that vary from event to event. The methods discussed in this work bring us closer to these goals.