Variational inference with a quantum computer

Inference is the task of drawing conclusions about unobserved variables given observations of related variables. Applications range from identifying diseases from symptoms to classifying economic regimes from price movements. Unfortunately, performing exact inference is intractable in general. One alternative is variational inference, where a candidate probability distribution is optimized to approximate the posterior distribution over unobserved variables. For good approximations, a flexible and highly expressive candidate distribution is desirable. In this work, we use quantum Born machines as variational distributions over discrete variables. We apply the framework of operator variational inference to achieve this goal. In particular, we adopt two specific realizations: one with an adversarial objective and one based on the kernelized Stein discrepancy. We demonstrate the approach numerically using examples of Bayesian networks, and implement an experiment on an IBM quantum computer. Our techniques enable efficient variational inference with distributions beyond those that are efficiently representable on a classical computer.

Inference is the task of drawing conclusions about unobserved variables given observations of related variables. Applications range from identifying diseases from symptoms to classifying economic regimes from price movements. Unfortunately, performing exact inference is intractable in general. One alternative is variational inference, where a candidate probability distribution is optimized to approximate the posterior distribution over unobserved variables. For good approximations, a flexible and highly expressive candidate distribution is desirable. In this work, we use quantum Born machines as variational distributions over discrete variables. We apply the framework of operator variational inference to achieve this goal. In particular, we adopt two specific realizations: one with an adversarial objective and one based on the kernelized Stein discrepancy. We demonstrate the approach numerically using examples of Bayesian networks, and implement an experiment on an IBM quantum computer. Our techniques enable efficient variational inference with distributions beyond those that are efficiently representable on a classical computer.

I. INTRODUCTION
Probabilistic graphical models describe the dependencies of random variables in complex systems [1]. This framework enables two important tasks: learning and inference. Learning yields a model that approximates the observed data distribution. Inference uses the model to answer queries about unobserved variables given observations of other variables. In general, exact inference is intractable and so producing good approximate solutions becomes a desirable goal. This article introduces approximate inference solutions using a hybrid quantum-classical framework.
Prominent examples of probabilistic graphical models are Bayesian and Markov networks. Applications across many domains employ inference on those models, including in health care and medicine [2,3], biology, genetics, and forensics [4,5], finance [6], and fault diagnosis [7]. These are applications where qualifying and quantifying the uncertainty of conclusions is crucial. The posterior distribution can be used to quantify this uncertainty. It can also be used in other downstream tasks such as determining the likeliest configuration of unobserved variables that best explains observed data.
Approximate inference methods broadly fall into two categories: Markov chain Monte Carlo (MCMC) and variational inference (VI). MCMC methods produce samples from the true posterior distribution in an asymptotic limit [8,9]. VI is a machine learning technique that casts inference as an optimization problem over a parameterized family of probability distributions [10]. If quantum computers can deliver even a small improvement to these methods, the impact across science and engineering could be large. * marcello.benedetti@cambridgequantum.com † matthias.rosenkranz@cambridgequantum.com Initial progress in combining MCMC with quantum computing was achieved by replacing standard MCMC with quantum annealing hardware in the training of some types of Bayesian and Markov networks [11][12][13][14][15][16][17]. Despite promising empirical results, it proved difficult to show if and when a quantum advantage could be delivered. An arguably more promising path towards quantum advantage is through gate-based quantum computers. In this context, there exist algorithms for MCMC with proven asymptotic advantage [18][19][20][21], but they require error correction and other features that go beyond the capability of existing machines. Researchers have recently shifted their focus to near-term machines, proposing new quantum algorithms for sampling thermal distributions [22][23][24][25][26].
The advantage of combining quantum computing with VI has not been explored to the same extent as MCMC. Previous work includes classical VI algorithms using ideas from quantum annealing [27][28][29]. In this work, we turn our attention to performing VI on a quantum computer. We adopt a distinct approach that focuses on improving inference in classical probabilistic models by using quantum resources. We use Born machines, which are quantum machine learning models that exhibit high expressivity. We show how to employ gradientbased methods and amortization in the training phase. These choices are inspired by recent advances in classical VI [30].
In Sec. II we describe VI and its applications. In Sec. III we describe using Born machines to approximate posterior distributions. In Sec. IV we use the framework of operator VI to derive two suitable objective functions, and we employ classical techniques to deal with the problematic terms in the objectives. In Sec. V we demonstrate the methods on Bayesian networks. We conclude in Sec. VI with a discussion of possible generalizations and future work.

II. VARIATIONAL INFERENCE AND APPLICATIONS
It is important to clarify what type of inference we are referring to. Consider a probabilistic model p over some set of random variables, Y. The variables in Y can be continuous or discrete. Furthermore, assume that we are given evidence for some variables in the model. This set of variables, denoted X ⊆ Y, is then observed (fixed to the values of the evidence) and we use the vector notation x to denote a realization of these observed variables. We now want to infer the posterior distribution of the unobserved variables, those in the set Z := Y\X . Denoting these by a vector z, our target is that of computing the posterior distribution p(z|x), the conditional probability of z given x. By definition, the conditional can be expressed in terms of the joint divided by the marginal: p(z|x) = p(x, z)/p(x). Also, recall that the joint can be written as p(x, z) = p(x|z)p(z). Bayes' theorem combines the two identities and yields p(z|x) = p(x|z)p(z)/p(x).
As described above, the inference problem is rather general. To set the scene, let us discuss some concrete examples in the context of Bayesian networks. These are commonly targeted in inference problems, and as such shall be our test bed for this work. A Bayesian network describes a set of random variables with a clear conditional probability structure. This structure is a directed acyclic graph where the conditional probabilities are modeled by tables, by explicit distributions, or even by neural networks. Figure 1a is the textbook example of a Bayesian network for the distribution of binary variables: cloudy (C), sprinkler (S), rain (R), and the grass being wet (W ). According to the graph this distribution factorizes as P (C, S, R, W ) = P (C)P (S|C)P (R|C)P (W |S, R). A possible inferential question is: what is the probability distribution of C, S, and R given that W = tr? This can be estimated by "inverting" the probabilities using Bayes' theorem: p(C, S, R|W = tr) = p(W = tr |C, S, R)p(C, S, R)/p(W ).  a time series of asset returns and an unobserved "market regime" (e.g., a booming versus a recessive economic regime). A typical inference task is to detect regime switches by observing asset returns [54]. Figure 1c illustrates a modified version of the "lung cancer" Bayesian network that is an example from medical diagnosis (see, e.g., Ref. [3] and the references therein). This network encodes expert knowledge about the relationship between risk factors, diseases, and symptoms. In health care, careful design of the network and algorithm are critical in order to reduce biases, e.g., in relation to health care access [55]. Note that inference in medical diagnosis is often causal instead of associative [3]. Bayesian networks can be interpreted causally and help answer causal queries using Pearl's do-calculus [56]. Finally, Fig. 1d shows a Bayesian network representation of turbo codes, which are error correction schemes used in 3G and 4G mobile communications. The inference task for the receiver is to recover the original information bits from the information bits and codewords received over a noisy channel [57]. Inference is a computationally hard task in all but the simplest probabilistic models. Roth [58] extended results of Cooper [59] and showed that exact inference in Bayesian networks with discrete variables is sharp-P-complete. Dagum and Luby [60] showed that even approximate inference is NP-hard. Therefore, unless some particular constraints are in place, these calculations are intractable. In many cases one is able to perform a "forward pass" and obtain unbiased samples from the joint (x, z) ∼ p(x, z) = p(x|z)p(z). However, obtaining unbiased samples from the posterior z ∼ p(z|x) = p(x, z)/p(x) is intractable due to the unknown normalization constant. One can perform MCMC sampling by constructing an ergodic Markov chain whose stationary distribution is the desired posterior. MCMC methods have nice theoretical guarantees, but they may converge slowly in practice [9]. In contrast, VI is often faster in high dimensions but does not come with guarantees. The idea in VI is to optimize a variational distribution q by minimizing its "distance" from the true posterior p (see Ref. [10] for an introduction to the topic and Ref. [30] for a review of the recent advances).
VI has experienced a resurgence in recent years due to substantial developments. First, generic methods have reduced the amount of analytical calculations required and have made VI much more user friendly (e.g., black-box VI [61]). Second, machine learning has enabled the use of highly expressive variational distributions implemented via neural networks [62], probabilistic programs [63], nonparametric models [64], normalizing flows [65], and others. In contrast, the original VI methods were mostly limited to analytically tractable distributions such as those that could be factorized. Third, amortization methods have reduced the costs by optimizing q θ (z|x) where the vector of parameters θ is "shared" among all possible observations x instead of optimizing individual parameters for each observation. This approach can also generalize across inferences.
Casting the inference problem as an optimization problem comes itself with challenges, in particular when the unobserved variables are discrete. REINFORCE [66] is a generic method that requires calculation of the score ∂ θ log q θ (z|x) and may suffer from high variance. Gumbel-softmax reparameterizations [67,68] use a continuous relaxation that does not follow the exact distribution.
We now show that near-term quantum computers provide an alternative tool for VI. We use the quantum Born machine as a candidate for variational distributions. These models are highly expressive and they naturally represent discrete distributions as a result of quantum measurement. Furthermore, the Born machine can be trained by gradient-based optimization. Figure 2: Probabilistic models used in our methods. The classical model comprises a prior over unobserved discrete variables and a likelihood over observed variables. The quantum model approximates the posterior distribution of the unobserved variables given observed data. All distributions can be sampled efficiently as long as one follows the arrows and uses a suitable computer.

III. BORN MACHINES AS IMPLICIT VARIATIONAL DISTRIBUTIONS
By exploiting the inherent probabilistic nature of quantum mechanics, one can model the probability distribution of classical data using a pure quantum state. This model based on the Born rule in quantum mechanics is referred to as the Born machine [69]. Let us consider binary vectors z ∈ {0, 1} n where n is the number of variables. The Born machine is a normalized quantum state |ψ(θ) parameterized by θ that outputs n-bit strings with probabilities q θ (z) = | z|ψ(θ) | 2 . Here |z are computational basis states, thus sampling the above probability boils down to a simple measurement. Other forms of discrete variables can be dealt with by a suitable encoding. When using amortization, the variational distribution requires conditioning on observed variables. To extend the Born machine and include this feature, we let x play the role of additional parameters. This yields a pure state where the output probabilities are q θ (z|x) = | z|ψ(θ, x) | 2 . Figure 2 shows the relation between the classical model and the quantum model for approximate inference.
Born machines have been applied for benchmarking hybrid quantum-classical systems [70][71][72][73], generative modeling [74][75][76], finance [47,48,77], anomaly detection [78], and have been proposed for demonstrating quantum advantage [43] [79]. These models can be realized in a variety of ways, in both classical and quantum computers. When realized via certain classes of quantum circuits, they are classically intractable to simulate. [80] For example, instantaneous quantum polytime (IQP) circuits are Born machines with O(poly(n)) parameters that yield classically intractable distributions in the average case under widely accepted complexity theoretic assumptions [81]. Additional examples for such classically hard circuits are boson sampling [82] and random circuits [83,84]. Thus, quantum Born machines have expressive power larger than that of classical models, including neural networks [85] and partially matrix product states [86]. It can be shown that the model remains clas-sically intractable throughout training [43]. We return to this discussion in Appendix A.
A useful way to classify probabilistic models is the following: prescribed models provide an explicit parametric specification of the distribution, implicit models define only the data generation process [87]. Born machines can be effectively regarded as implicit models. It is easy to obtain an unbiased sample as long as we can execute the corresponding circuit on a quantum computer and measure the computational basis. However, it requires exponential resources to estimate the probability of a sample with multiplicative error.
Implicit models are challenging to train with standard methods. The major challenge is the design of the objective function precisely because likelihoods are "prohibited". Valid objectives involve only statistical quantities (such as expectation values) that can be efficiently estimated from samples [90]. For generative modeling with Born machines, progress has been made towards practical objectives such as moment matching [70], maximum mean discrepancy [91], Stein and Sinkhorn divergences [43], adversarial objectives [92,93], as well as incorporating Bayesian priors on the parameters [85].
In the next section we make some progress towards practical objectives for VI with Born machines. First, we mention some related work. The approaches in Ref. [23] use VI ideas to deal with thermal states and quantum data. In contrast, our inference methods apply to classical graphical models and classical data. The approaches of Refs. [18,94,95] aim to encode Bayesian networks directly in a quantum state, and subsequently perform exact inference. In contrast, our inference methods are approximate, efficient, and apply also to graphical models that are not Bayesian networks.

IV. OPERATOR VARIATIONAL INFERENCE
Operator variational inference (OPVI) [96] is a rather general method that uses mathematical operators to design objectives for the approximate posterior. Suitable operators are those for which (i) the minima of the variational objective is attained at the true posterior, and (ii) it is possible to estimate the objective without computing the true posterior. In general, the amortized OPVI objective is where f (·) ∈ R d is a test function within the chosen family F, O p,q is an operator that depends on p(z|x) and q(z|x), and h(·) ∈ [0, ∞] yields a non-negative objective. Note that we have an expectation over the data distribution p D (x). This indicates the average over a dataset D = {x (i) } i of observations to condition on. We present two methods that follow directly from two operator choices. These operator choices result in objectives based on the Kullback-Leibler (KL) divergence and Step 1 optimizes a classifier d φ to output the probabilities that the observed samples come from the Born machine rather than from the prior.
Step 2 optimizes the Born machine q θ to better match the true posterior. The updated Born machine is fed back into step 1 and the process repeats until convergence of the Born machine.
the Stein discrepancy. The former utilizes an adversary to make the computation tractable, whereas in the latter, the tractability arises from the use of a kernel function. The KL divergence is an example of an f divergence, while the Stein discrepancy is in the class of integral probability metrics, two fundamentally different families of probability distance measures. OPVI can yield methods from these two different classes under a suitable choice of operator. As shown in Ref. [97], these two families intersect nontrivially only at the total variation distance (TVD). It is for this reason that we choose the TVD as our benchmark in later numerical results. Integral probability metrics [43,91] and f divergences [70,73] have both been used to train Born machines for the task of generative modeling, and we allow the more thorough incorporation of these methods in future work. Furthermore, these works demonstrate that the choice of training metric has a significant effect on the trainability of the model. It is for this reason that we adopt the OPVI framework in generality, rather than focusing on any specific instance of it. This allows the switching between different metrics depending on the need.

A. The adversarial method
One possible objective function for VI is the KL divergence of the true posterior relative to the approximate one. This is obtained from Eq. (1) by choosing h to be the identity function, and by choosing the operator (O p,q f )(z) = log q(z|x) p(z|x) for all f : Here q θ is the variational distribution parameterized by θ. To avoid singularities we assume that p(z|x) > 0 for all x and all z. The objective's minimum is zero and is attained by the true posterior q θ (z|x) = p(z|x) for all x.
There are many ways to rewrite this objective. Here we focus on the form known as prior contrastive. Substituting Bayes' formula p(z|x) = p(x|z)p(z)/p(x) in the equation above we obtain where we ignore the term E x∼p D (x) [log p(x)] since it is constant with respect to q. This objective has been used in many generative models, including the celebrated variational autoencoder (VAE) [62,98]. While the original VAE relies on tractable variational posteriors, one could also use more powerful, implicit distributions as demonstrated in adversarial variational Bayes [99] and in priorcontrastive adversarial VI [100]. Let us use a Born machine to model the variational distribution q θ (z|x) = | z|ψ(θ, x) | 2 . Since this model is implicit, the ratio q θ (z|x)/p(z) in Eq. (3) cannot be computed efficiently. Therefore, we introduce an adversarial method for approximately estimating the ratio above. The key insight here is that this ratio can be estimated from the output of a binary classifier [87]. Suppose that we ascribe samples (z, x) ∼ q θ (z|x)p D (x) to the first class, and samples (z, x) ∼ p(z)p D (x) to the second class. A binary classifier d φ parameterized by φ outputs the probability d φ (z, x) that the pair (z, x) belongs to one of two classes. Hence, 1 − d φ (z, x) indicates the probability that (z, x) belongs to the other class. There exist many possible choices of objective function for the classifier [87]. In this work we consider the cross-entropy The optimal classifier that maximizes this equation is [99] Since the probabilities in Eq. 5 are unknown, the classifier must be trained on a dataset of samples. This does not pose a problem because samples from the Born machine q θ (z|x) and prior p(z) are easy to obtain by assumption.
Once the classifier is trained, the logit transformation provides the log odds of a data point coming from the Born machine joint q θ (z|x)p D (x) versus the prior joint p(z)p D (x). The log odds are an approximation to the log ratio of the two distributions, i.e., which is exact if d φ is the optimal classifier in Eq. (5). Now, we can avoid the computation of the problematic term in the KL divergence. Substituting this result into Eq. (3) and ignoring the constant term, the final objective for the Born machine is The optimization can be performed in tandem as using gradient ascent and descent, respectively. It can be shown [99,100] that the gradient of log q θ (z|x) with respect to θ vanishes. Thus, under the assumption of an optimal classifier d φ , the gradient of Eq. (7) is significantly simplified. This gradient is derived in Appendix B.
A more intuitive interpretation of the procedure just described is as follows. The log likelihood in Eq. (3) can be expanded as log p(x|z) = log p(z|x) p(z) + log p(x).
Then Eq. (3) can be rewritten as , revealing the difference between two log odds. The first is given by the optimal classifier in Eq. (6) for the approximate posterior and prior. The second is given by a hypothetical classifier for the true posterior and prior. The adversarial method is illustrated in Fig. 3.

B. The kernelized method
Another possible objective function for VI is the Stein discrepancy (SD) of the true posterior from the approximate one. This is obtained from Eq. (1), assuming that the image of f has the same dimension as z, choosing h to be the absolute value, and choosing O p,q to be a Stein operator. A Stein operator is independent from q and is characterized by having zero expectation under the true posterior for all functions f in the chosen family F. For binary variables, a possible Stein operator is is the difference score function. The partial difference operator is defined for any scalar function g on binary vectors as where ¬ i z flips the ith bit in binary vector z. We have also defined ∆f (z) as the matrix with entries (∆f (z)) ij = ∆ zi f j (z). Under the assumption that p(z|x) > 0 for all x and all z, we show in Appendix C At this point one can parameterize the test function f and obtain an adversarial objective similar in spirit to that presented in Sec. IV A. Here, however, we take a different route. If we take F to be a reproducing kernel Hilbert space of vector-valued functions, H, and restricting the Hilbert space norm of f to be at most one (||f || H ≤ 1), the supremum in Eq. (11) can be calculated in closed form using a kernel [101]. This is achieved using the "reproducing" properties of kernels, f (z) = k(z, ·), f (·) H , substituting into Eq. (11) and explicitly solving for the supremum (the explicit form of this supremal f * is not illuminating for our discussion; see [101] for details). Inserting this f * into the SD results in the kernelized Stein discrepancy (KSD), where κ p is a so-called Stein kernel. For binary variables, this can be written as: z )). (13) Here ∆ z k(z, ·) is the vector of partial differences with components ∆ zi k(z, ·), ∆ z,z k(z, z ) is the matrix produced by applying ∆ z to each component of the vector ∆ z k(·, z ), and tr(·) is the usual trace operation on this matrix. The expression κ p (z, z |x) results in a scalar value. Equation (12) shows that the discrepancy between the approximate and true posteriors can be calculated from samples of the approximate posterior alone. Note that the Stein kernel, κ p , depends on another kernel k. For n unobserved Bernoulli variables, one possibility is the generic Hamming kernel k(z, z ) = exp − 1 n z − z 1 . The KSD is a valid discrepancy measure if this "internal" kernel, k, is positive definite, which is the case for the Hamming kernel [101].
In summary, constraining f H ≤ 1 in Eq. (11) and substituting the KSD we obtain (14) and the problem consists of finding min θ L KSD (θ). The gradient of Eq. (14) is derived in Appendix D. The kernelized method is illustrated in Fig. 4.
The KSD was used in Ref. [43] for generative modeling with Born machines. In that context the distribution p is unknown; thus, the authors derived methods to approximate the score s p from available data. VI is a suitable application of the KSD because in this context we do know the joint p(z, x). Moreover the score in Eq. (9) can be computed efficiently even if we have the joint in unnormalized form.

V. EXPERIMENTS
To demonstrate our approach, we employ three experiments as the subject of the following sections. First, we validate both methods using the canonical "sprinkler" Bayesian network. Second, we consider continuous observed variables in a hidden Markov model (HMM), and illustrate how multiple observations can be incorporated effectively via amortization. Finally, we consider a larger model, the "lung cancer" Bayesian network, and demonstrate our methods using an IBM quantum computer.
A. Proof of principle with the "sprinkler" Bayesian network As a first experiment, we classically simulate the methods on the "sprinkler" network in Fig. 1a, one of the  simplest possible examples. The purpose is to show that both methods are expected to work well even when using limited resources and without much fine tuning. First, we randomly generate the entries of each probability table from the uniform distribution U([0.01, 0.99]) and produce a total of 30 instances of this network. For each instance, we condition on "Grass wet" being true, which means that our data distribution is p D (W ) = δ(W = tr). We infer the posterior of the remaining three variables using the three-qubit version of the hardware-efficient ansatz shown in Fig. 5. We use a layer of Hadamard gates as a state preparation, S(x) = H ⊗ H ⊗ H, and initialize all parameters to approximately 0. Such hardware-efficient ansätze, while simple, have been shown to be vulnerable to barren plateaus [103][104][105][106], or regions of exponentially vanishing gradient magnitudes that make training untenable. Alternatively, one could consider ansätze that have been shown to be somewhat "immune" to this phenomenon [107,108], but we leave such investigation to future work. For the KL objective, we utilize a multilayer perceptron (MLP) classifier made of three input units, six hidden rectified linear units, and one sigmoid output unit. The classifier is trained with a dataset of 100 samples from the prior p(C, R, S) and 100 samples from the Born machine q(C, R, S|W = tr). Here we use stochastic gradient descent with batches of size ten and a learning rate of 0.03. For the KSD objective, we use the Hamming kernel as above. For both methods, the Born machine is trained using 100 samples to estimate each expectation value for the gradients, and using vanilla gradient descent with a small learning rate of 0.003. We compute the TVD of true and approximate posteriors at each epoch. We emphasize that TVD cannot be efficiently computed in general and can be shown only for small examples. Figure 6a (Fig. 6b) shows the median TVD out of the 30 instances for 1000 epochs of training for the KL (KSD) objective. A shaded area indicates a 68% confidence interval obtained by bootstrapping. Increasing the number of layers leads to better approximations to the posterior in all cases.
We compare the representation power of Born machines to that of fully factorized posteriors (e.g., those used in the naïve mean-field approximation). We perform an exhaustive search for the best full factorization q(z) = Optimizing the KL objective is faster than the KSD objective. This is because training the MLP classifier in Eq. (4) has cost linear in the dimension of the input, while calculating the Stein kernel in Eq. (13) has quadratic cost. However, training the MLP classifier may still be nontrivial in some instances, as we show in Appendix E. Furthermore, at a large scale the performance of the classifier may suffer in quality, if kept efficient, due to the intractability of computing the KL divergence [109]. In contrast, the kernelized method is sample efficient at all scales. However, for the small problem sizes we use here, the adversarial method gives better results on average, so we choose it as the primary method in all subsequent experiments.

B. Continuous observed variables and amortization with a hidden Markov model
In our second experiment, we simulate the adversarial VI method on the HMM in Fig. 1b. The purpose was to demonstrate two interesting features: continuous observed variables and amortization. We set up the HMM for T = 8 time steps (white circles in Fig. 1b), each represented by a Bernoulli latent variable with conditional dependency These represent the unknown "regime" at time t. The regime affects how the observable data are generated. We   Fig. 5). Each instance is a "sprinkler" Bayesian network as in Fig. 1a where the conditional probabilities are chosen at random.
use Gaussian observed variables (green circles in Fig. 1b) whose mean and standard deviation depend on the latent variables as We sample two time series observations x (1) , x (2) from the HMM and take this to be our data distribution. These time series are shown by green dots in Fig. 7. Instead of fitting two approximate posteriors separately, we use a single Born machine with amortization |ψ(θ, x) . We used the ansatz in Fig. 5 for eight qubits with a state preparation layer S(x) = ⊗ 8 t=1 R x (x t ). In practice, this encoding step can be done under more careful considerations [110][111][112]. Parameters θ are initialized to small values at random. We optimize the KL objective and a MLP classifier with 16 inputs, 24 rectified linear units, and one sigmoid unit, and the system is trained for 3000 epochs. The learning rates are set to be 0.006 for the Born machine and 0.03 for the MLP. The Born machine used 100 samples to estimate each expectation value, the MLP used 100 samples from each distribution and minibatches of size ten.
The histograms in Fig. 7 show the ten most probable configurations of latent variables for the true posterior along with the probabilities assigned by the Born machine. Conditioning on data point x (1) , the inferred most likely explanation is |01100011 (i.e., the mode of the Born machine). This coincides with the true posterior mode. For x (2) , the inferred mode is |10001000 , which differs from the true posterior mode |10000000 by a single bit. The regime switching has therefore been modeled with high accuracy.
Rather than focusing on the mode, one can make use of the whole distribution to estimate some quantity of interest. This is done by taking samples from the Born machine and using them in a Monte Carlo estimate for such a quantity. For example, we could predict the expected value of the next latent variable z T +1 given the available observations. For data point x (1) , this entails the estimation of To conclude, we mention that inference in this HMM can be performed exactly and efficiently with classical algorithms. The most likely explanation for the observed data can be found with the Viterbi algorithm in time O(T |S| 2 ), where T is the length of the sequence and |S| is the size of the unobserved variables (e.g., |S| = 2 for Bernoulli variables). This is not the case for more general models. For example, a factorial HMM has multiple independent chains of unobserved variables [113]. Exact inference costs O(T M |S| M +1 ), where M is number of chains, and is typically replaced by approximate inference [113,114]. Our VI methods are generic and apply to factorial and other HMMs without changes.
C. Demonstration on IBMQ with the "lung cancer" Bayesian network As a final experiment, we test the performance on real quantum hardware using the slightly more complex "lung cancer" network [53], specifically using the fivequbit ibmq rome quantum processor. We access this device using PyQuil [115] and tket [116].
The lung cancer network (also called the "Asia" network) is an example of a medical diagnosis Bayesian network, and is illustrated by Fig. 1c. This network was chosen since it is small enough to fit without adaptation onto the quantum device, but also comes closer to a "realworld" example. A more complicated version with extra variables can be found in Ref. [117]. We note that the version we use is slightly modified from that of Ref. [53] for numerical reasons.
The network has eight variables: two "symptoms" for whether the patient presented with dyspnoea (D) (shortness of breath), or had a positive x ray (X); four possible "diseases" causing the symptoms bronchitis (B), tuberculosis (T ), lung cancer (L), or an "illness" (I) (which could be either tuberculosis or lung cancer or something other than bronchitis); and two possible "risk factors" for whether the patient had traveled to Asia (A), or whether they had a history of smoking (S). Based on the graph structure in Fig. 1c, the distribution over the variables, p (A, T In Appendix F we show the explicit probability table we use (Table I) for completeness. Modifying an illustrative example of a potential "real-world" use case in Ref. [53], a patient may present in a clinic with an "illness" (I = tr) but no shortness of breath (D = fa) as a symptom. Furthermore, an x ray has revealed a negative result (X = fa). However, we have no patient history and we do not know which underlying disease is actually present. As such, in the experiment, we condition on having observed the "evidence" variables, x: X, D and I. The remaining five are the latent variables, z, so we will require five qubits. In Fig. 8, we show the results when using the fivequbit ibmq rome quantum processor, which has a lin-ear topology. The topology is convenient since it introduces no major overheads when compiling from our ansatz in Fig. 5. We plot the true posterior versus the one learned by the Born machine both simulated, and on the quantum processor using the best parameters found (after about 400 epochs simulated and about 50 epochs on the processor). To train in both cases, we use the same classifier as in Sec. V B with five inputs and ten hidden rectified linear units using the KL objective, which we observed to give the best results. We employ a two-layer ansatz in Fig. 5 (L = 2), and 1024 shots are taken from the Born machine in all cases. We observe that the simulated Born machine is able to learn the true posterior very well with these parameters, but the performance suffers on the hardware. That being said, the trained hardware model is still able to pick three out of four highest probability configurations of the network (shown on the x axis of Fig. 8). The hardware results could likely be improved with error mitigation techniques. ibmq rome quantum processor Figure 8: Histograms of true versus the learned posteriors with a simulated and hardware trained Born machine for the "lung cancer" network in Fig. 1c. Here we use the ibmq rome quantum processor (whose connectivity is shown in the inset). We condition on X = fa, D = fa, I = tr. The x axis shows the configuration of observed (gray represents X, D, I) and unobserved variables (blue represents A, S, T , L, B) corresponding to each probability (filled circles = tr, empty circles = fa). For hardware results, the histogram is generated using 1024 samples.

VI. DISCUSSION
We present two variational inference methods that can exploit highly expressive approximate posteriors given by quantum models. The first method is based on minimizing the Kullback-Leibler divergence to the true posterior and relies on a classifier that estimates probability ratios. The resulting adversarial training may be challenging due to the large number of hyperparameters and well-known stability issues [118,119]. On the other hand, it only requires the ability to (i) sample from the prior p(z), (ii) sample from the Born machine q θ (z|x), and (iii) calculate the likelihood p(x|z). We can apply the method as is when the prior is implicit. If the likelihood is also implicit, we can reformulate Eq. (3) in joint contrastive form [100] and apply the method with minor changes. This opens the possibility to train a model where both the generative and inference processes are described by Born machines (i.e., replacing the classical model in Fig. 2 with a quantum one). This model can be thought of as an alternative way to implement the quantum-assisted Helmholtz machine [15] and the quantum variational autoencoder [16].
We also present a second VI method based on the kernelized Stein discrepancy. A limitation of this approach is that it requires explicit priors and likelihoods. On the other hand, it provides plenty of flexibility in the choice of kernel. We use a generic Hamming kernel to compute similarities between bit strings. For VI on graphical models, a graph-based kernel [120] that takes into account the structure could yield improved results. Moreover, one could attempt a demonstration of quantum advantage in VI using classically intractable kernels arising from quantum circuits [39][40][41].
Another interesting extension could be approximating the posterior with a Born machine with traced out additional qubits, i.e., locally purified states. This may allow trading a larger number of qubits for reduced circuit depth at constant expressivity [86].
We present a few examples with direct application to financial, medical, and other domains. Another area of potential application of our methods is natural language processing. Quantum circuits have been proposed as a way to encode and compose words to form meanings [121][122][123]. Question answering with these models could be phrased as variational inference on quantum computers where our methods fit naturally.

This research was funded by Cambridge Quantum
Computing. We gratefully acknowledge the cloud computing resources received from the "Microsoft for Startups" program. We acknowledge the use of IBM Quantum Services for this work. We thank Bert Kappen, Konstantinos Meichanetzidis, Robin Lorenz, and David Amaro for helpful discussions.

Appendix A: Quantum advantage in inference
The intuitive reason for why approximate inference may be a suitable candidate for a quantum advantage is similar to that which has already been studied for the related problem of generative modeling using quantum computers [42][43][44][45]85]. Here, the primary argument for quantum advantage reduces to complexity theoretic arguments for the hardness of sampling from particular probability distributions by any efficient classical means. At one end of the spectrum, the result of Ref. [44] is based on the classical hardness of the discrete logarithm problem and is not suitable for near-term quantum computers. On the other end, the authors of Refs. [43,85] leveraged arguments from "quantum computational supremacy" to show how Born machines are more expressive than their classical counterparts. Specifically, assuming the noncollapse of the polynomial hierarchy, the Born machine is able to represent IQP time computations [81].
A similar argument can be made for approximate inference. The key distinction in this case over the previous arguments for generative modeling is that now we must construct a conditional distribution, the posterior p(z|x), which at least requires a Born machine to represent it. For an IQP circuit, probability amplitudes are given by partition functions of complex Ising models [88,89]. Thus, for a demonstration of quantum advantage in inference, we may seek posterior distributions of the form p(z|x) ∝ | tr e −iH(z,x) | 2 . Here the Ising Hamiltonian H(z, x) is diagonal and its coefficients depend on both z and x. Alternative "quantum computational supremacy" posteriors could arise from boson sampling [82] and random circuits [83,84]. In the latter case, experimental validations have been performed [124,125] demonstrating the physical existence of candidate distributions. The former experiment [124] used 53 qubits in the Sycamore quantum chip with 20 "cycles", where a cycle is a layer of random single-qubit gates interleaved with alternating two-qubit gates. Notably, a cycle is not conceptually very different from a layer in our hardware-efficient ansatz in Fig. 5 in the main text. The average (isolated) single-(two-)qubit gate error is estimated to be 0.13% (0.36%) in this experiment, which is sufficient to generate a cross-entropy (a measure of distribution "quality" related to the KL divergence) benchmarking fidelity of 0.1%. The latter [125] repeated this experiment by using 56 qubits from the Zuchongzhi processor, again with 20 cycles. In this case, the single-(two-)qubit gates carried an average error of 0.14% (0.59%) to achieve a fidelity of approximately 0.07%. Clearly, quantum processors exist that can express classically intractable distributions, and so are also likely capable of representing intractable posterior distributions for suitably constructed examples. In all cases, the posterior shall be constructed nontrivially from the prior p(z) and likelihood p(x|z) via Bayes' theorem. We leave this construction for future work. As a final comment, it is also important to note here that we are discussing only representational power and not learnability, which is a separate issue. It is still unknown whether a distribution family (a posterior or otherwise) exists that is learnable by (near-term) quantum means, but not classically. See Refs. [43,44] for discussions of this point.

Appendix B: Gradients for the adversarial method
In the adversarial VI method we optimize the objective functions given in Eqs. (4) and (7) via gradient ascent and descent, respectively. Assuming that the optimal discriminator d φ = d * has been found, the objective in Eq. (7) becomes While d * implicitly depends on θ, the gradient of its expectation vanishes.
To see this, recall that logit(d * (z, x)) = log q θ (z|x) − log p(z). Fixing x and taking the partial derivative for the jth parameter, The second term is zero since The approximate posterior in Eq. (B2) is given by the expectations of observables O z = |z z|, i.e., q θ (z|x) = O z θ,x . For a circuit parameterized by gates U (θ j ) = exp −i θj 2 H j , where the Hermitian generator H j has eigenvalues ±1 (e.g., single-qubit rotations), the derivative of the first term in Eq. (B2) can be evaluated with the parameter shift rule [126][127][128]: with θ ± j = θ ± π 2 e j and e j a unit vector in the jth direction. In summary, the partial derivative of the objective with respect to θ j given the optimal discriminator is (B5)

Appendix C: The Stein operator
Here we prove that (O p f )(z) = s p (x, z) T f (z) − tr(∆f (z)) is a Stein operator for any function f : {0, 1} n → R n and probability mass function p(z|x) > 0 on {0, 1} n . To do so, we need to show that its expectation under the true posterior vanishes. Taking the expectation with respect to the true posterior we have The first summation is while the second summation is The first and third terms cancel. The second and fourth terms are identical and also cancel. This can be seen by substituting y i = ¬ i z into one of the summations and noting that z = ¬ i y i . Thus, the operator O p satisfies Stein's identity E z∼p(z|x) [(O p f )(z)] = 0 for any function f as defined above, and is a valid Stein operator.

Appendix D: Gradients for the kernelized method
We wish to optimize the kernelized Stein objective, Eq. (14), via gradient descent. This gradient was derived in Ref. [43] and is given by The final line (suppressing the dependence on the variables, z, x: q θ := q θ (z|x)) in Eq. (D1) follows from the parameter shift rule, Eq. (B4), for the gradient of the output probabilities of the Born machine circuit, under the same assumptions as in Appendix B. The relatively simple form of the gradient is due to the fact that the Stein kernel, κ p , does not depend on the variational distribution, q θ (z|x). This gradient can be estimated in a similar way to the KSD itself, by sampling from the original Born machine, q θ (z|x), along with its parameter shifted versions, q θ ± j (z|x) for each parameter j.
Appendix E: Learning curves for the adversarial method In this section we provide examples of successful and unsuccessful VI using the KL objective and adversarial methods. We inspect the random instances used in the "sprinkler" network experiment in Sec. V A. We cherrypick two out of the 30 instances where a one-layer Born machine is employed. Figure 9a shows a successful experiment. Here the MLP classifier closely tracks the ideal classifier, providing a good signal for the Born machine. The Born machine is able to minimize its loss, which in turn leads to approximately 0 total variation distance. Figure 9b shows an unsuccessful experiment. Here, the MLP tracks the ideal classifier, but the Born machine is not able to find a "direction" that minimizes the loss (see epochs 400 to 1000). We verified that a more powerful two-layers Born machine performs much better for this instance (not shown).
In these figures gray lines correspond to quantities that can be estimated from samples, while blue and magenta lines represent exact quantities that are not practical to compute. It is challenging to assess the training looking at the gray lines. This discussion is to emphasize the need for new techniques to assess the models. These would provide an early stopping criterion for unsuccessful experiments as well as a validation criterion for successful ones. This is a common problem among adversarial approaches.
Appendix F: Probability table for the "lung cancer" network We derived our VI methods under the assumption of nonzero posterior probabilities. In tabulated Bayesian networks, however, zero posterior probabilities may occur. The original "lung cancer" network [53] contains a variable "either" (meaning that either lung cancer or tuberculosis are present). This variable is a deterministic or function of its parent variables and yields zero probabilities in some posterior distributions. In order to avoid numerical issues arising from zero probabilities, we replace "either" by "illness", which may be true even if both parent variables "lung cancer" and "tuberculosis" are false. In practice, this is done by adding a small > 0 to the entries of the "either" table and then renormaliz-    Fig. 1c.
ing. This simple techniques is known as additive smoothing, or Laplace smoothing. Table I shows the modified probability table for the network in Fig. 1c.
An alternative approach to deal with a deterministic variable that implements the or function is to marginal-ize its parent variables. It can be verified that the entries of the resulting probability table are nonzero (unless the parent variables are also deterministic in which case one marginalizes these as well). One can perform variational inference on this modified network with less variables.