Quantum self-learning Monte Carlo with quantum Fourier transform sampler

The self-learning Metropolis-Hastings algorithm is a powerful Monte Carlo method that, with the help of machine learning, adaptively generates an easy-to-sample probability distribution for approximating a given hard-to-sample distribution. This paper provides a new self-learning Monte Carlo method that utilizes a quantum computer to output a proposal distribution. In particular, we show a novel subclass of this general scheme based on the quantum Fourier transform circuit; this sampler is classically simulable while having a certain advantage over conventional methods. The performance of this"quantum inspired"algorithm is demonstrated by some numerical simulations.


I. INTRODUCTION
Monte Carlo (MC) simulation is a powerful statistical method that is generically applicable to compute statistical quantities of a given system by sampling random variables from its underlying probability distribution. A particularly efficient method is the Metropolis-Hastings (MH) algorithm [1]; this realizes a fast sampling from a target distribution via an appropriate acceptance/rejection filtering of the samples generated from an alternative proposal distribution which is easier to sample compared to the target one. Therefore, the most important task in this algorithm is to specify an appropriate proposal distribution satisfying the following three conditions; (i) it must be easy to sample, (ii) the corresponding probability can be effectively computed for judging acceptance/rejection of the sample, and (iii) it is rich in representation, meaning that it may lie near the target distribution. This is a long-standing challenging problem, but the recent rapid progress of machine learning enables us to take a circumventing pathway for the issue, the self-learning MC [2][3][4][5], which introduces a parametrized proposal distribution and updates the parameters during the sampling so that it is going to mimic the target one. Despite of its potential power thanks to the aid of machine learning, this approach has been demonstrated only with a few physical models; for example in Ref. [2], for a target hard-to-sample Ising distribution, a parametrized proposal Ising distribution was applied to demonstrate the effectiveness of self-learning MC approach. To expand the scope of self-learning MC, we need a systematic method to design a parametrized proposal distribution that is generically applicable to a wide class of target distribution. Now we turn our attention to quantum regime, with the hope that the quantum computing might provide us an effective means to attack the above-mentioned problem. In fact the so-called quantum supremacy holds for sampling problems; that is, a (non-universal) quantum computer can generate a probability distribution which is hard to sample via any classical (i.e., non-quantum) computer. Especially, the Boson sampling [6] and Instantaneous Quantum Polynomial time computations [7,8] are well known, the former of which is now even within reach of experimental demonstration [9,10]. Moreover, a recent trend is to extend this idea to quantum learning supremacy [11], meaning that a quantum circuit is trained to learn a given target distribution faster than any classical computer.
With the above background in mind, in this paper we study a new type of self-learning MC that uses quantum computing to generate a proposal distribution. In fact this scheme satisfies the above-described three conditions. First, (i) is already fulfilled as an intrinsic nature of quantum computers. Second, it is well known that the task (ii) can be effectively executed using the amplitude estimation algorithm [12,13], which is in fact twice as fast as the classical correspondence. Lastly the above-described fact, the expressive power of quantum computers for generating complex probability distributions, might enable us to satisfy (iii) and mimic a target distribution which is essentially hard to sample via any classical means. To realize the learning scheme in a quantum system, we take the variational method, meaning that a parametrized quantum circuit is trained so that its output probability distribution approaches to the target distribution. This schematic itself is also employed in the quantum generative modeling [14,15], but application to MH might be of more practical use for the following reason. That is, unlike the generative modeling problem, MH need not generate a proposal distribution that is very close to the target, but rather it requires only a relatively high acceptance ratio and accordingly less demanding quantum computers.
Of course the most difficult part is to design a parametrized quantum circuit which may successfully generate a suitable proposal distribution. Therefore, for the purpose of demonstrating the proof-of-concept, in this work we consider a special type of quantum circuit composed of the quantum Fourier transform (QFT), where the parameters to be learned are assigned corresponding to only the low-frequencies components. In fact, thanks to the rich expressive power of the Fourier transform in representing or approximating various functions, the proposed QFT sampler is expected to satisfy the condition (iii), in addition to (i) and (ii). We also emphasize that this QFT sampler or its variant (e.g., with different parametrization to cover the high-frequencies components) is a new type of circuit ansatz in the quantum variational method and might be applicable to other problems such as the quantum generative modeling. Now we state our bonus theorem; the proposed QFT sampler can be efficiently simulated with a classical means, using the iterative measurement technique [16]. This is exactly the direction to explore a classical algorithm that fully makes use of quantum feature, i.e., a quantum-inspired algorithm such as [17]. Actually we will show that this quantum-inspired sampler has a certain advantage over some conventional methods, in addition to the clear merit that the system with e.g., hundreds of qubits is simulable.
We use the following notations: for a complex column vector a, the symbols a † and a ⊤ represent its complex conjugate transpose and transpose vectors, respectively. Also a * denotes the elementwise complex column vector of a. Hence a † = (a * ) ⊤ .

A. Classical MH algorithm
Let p(x) and q(x) be target and proposal probability distributions, respectively. To get a sample from p(x), the MH algorithm instead samples from q(x) and accepts the result with valid probability determined by the detailed balance conditions. More specifically, assume that we have last accepted a sample r and now obtain a sampler generated from the proposal distribution q(x). Then, this sampler is accepted with probability which is called the acceptance ratio. Note that the value p(r) is assumed to be easily computable for a given r, while its sampling is hard. This procedure is repeated until the number of samples becomes enough large; then these accepted samples are governed by the target distribution p(x) due to the detailed balance conditions. Note that, if q(x) = p(x), then the acceptance ratio is always exactly 1, which is maximally efficient; but of course this does not happen because p(x) is hard to sample while q(x) is assumed to be relatively easy to sample.
In the context of self-learning MC, a parametric model of the proposal distribution q(x; θ) is considered, with θ the vector of parameters. The self-learning MC aims to learn the parameters so that q(x; θ) well approximates the target p(x).
B. General form of the quantum self-learning MH algorithm Here we describe the quantum sampler executing the self-learning MH algorithm, in the general setting. First, for the initial state |g N = |g ⊗N with |g = [1, 0] ⊤ a qubit state, we apply the parametric unitary gate U (θ): where θ ∈ C m are the parameters to be tuned. The reason of taking the complex-valued parameters will be made clear in the next section when specializing to the QFT circuit. This state is measured in the computational basis, defining the probability distribution where x is the multi-dimensional random variable represented by binaries (an example is given in Appendix A). Equation (2) is the proposal distribution of our MH algorithm. Hence, our goal is to update θ so that q(x; θ) is going to well approximate the target distribution p(x). The learning process for updating θ is executed via the standard gradient descent method of a loss function, as described below. First, for a fixed θ, we obtain samples r 1 , . . . , r B from q(x; θ) by the computational-basis measurement. These samples are filtered according to the acceptance probability (1), thus producing samples governed by p(x). Note now that, for a given r, the value q(r) = | r|Ψ(θ) | 2 must be effectively computed to do this filtering process (recall that the value of p(r) is assumed to be easily obtained); this computability indeed depends on the structure of U (θ), but in general we could apply the quantum amplitude estimation algorithm [12,13] to reduce the cost for executing this task. The samples r 1 , . . . , r B are also used to calculate the gradient descent vector of the loss function L(θ) for updating θ. Noting that L(θ) is real while θ is complex, its infinitesimal change with respect to θ is given by The gradient descent vector for updating the parameter from θ to θ ′ = θ + δθ is thus given by where α > 0 is the learning coefficient; in fact then δL = −2α ∂L/∂θ 2 ≤ 0. In this work, the loss function is set to the following cross entropy between q(x; θ) and p(x): which is a standard measure for quantify the similarity of two distributions. The gradient vector of L(θ) can be computed as follows: can be directly estimated when U (θ) is composed of Pauli operators with the parameters {θ i } corresponding to the rotation angles [18].
Here we discuss the notable feature and possible quantum advantage of the quantum sampler for the MH algorithm, by referring to the three conditions mentioned in Sec. I. First, the condition (i) is indeed satisfied because now the sampler is a quantum device that physically produces each measurement result r according to the proposal probability distribution q(x; θ), only in a few micro second in the case of superconducting devices. As for the condition (ii), it is in principle possible to effectively compute the probability q(r; θ) for a given r, as discussed above in this subsection. Lastly for the condition (iii), q(x; θ) might be able to represent a wide class of probability distribution, which is even hard to sample via any classical means as mentioned in Sec. I. Realization of this possible quantum advantage of course needs a clever designing of the ansatz U (θ).

III. THE QUANTUM FOURIER TRANSFORM SAMPLER
To show the proof of principle of the quantum sampler for self-learning MH algorithm, here we consider a special class of circuit composed of QFT, called the QFT sampler. Importantly, as will be shown, the QFT sampler is classically simulable, while it might have an advantage over classical algorithms.

A. 1-dimensional QFT sampler
We begin with the 1-dimensional QFT sampler; extension to the multi-dimensional case is discussed in the where (7) below is limited to an even function, and thus θ must take complex numbers. The output of QFT is given by |Ψ(θ) = U QFT |in , where U QFT is the QFT unitary operator whose matrix representation is given by Now the measurement on |Ψ(θ) in the computational basis yields the probability distribution where the random variable x is represented with binaries, i.e., x ∈ {0, 1, · · · , 2 N − 1}. Again, the task of self-learning MH is to update θ so that the proposal distribution q QFT (x; θ) may approach to the target distribution p(x). To compute the gradient vector (5), let us express x|U QFT |in as where u x is the vector composed of the first 2 M elements of the x-th row vector of U QFT ; hence the jth element of u x is given by Then the partial derivative of q QFT (x; θ) Hence, from Eq. (5), the gradient vector of the loss function L(θ) can be calculated as follows: where {r 1 , r 2 , · · · , r B } are samples taken from the proposal distribution q QFT (x; θ). Recall that we filter these samples via the rule (1) and thereby obtain samples that are subjected to the target distribution p(x). In this work, instead of the standard gradient update (3), we take the following momentum gradient descent [19]; where α and µ are the learning coefficients. The updated vector θ ′ is normalized and substituted into Eq. (6) for the next learning stage. Note that Eq. (10) is identical to the standard gradient descent when µ = 0.
Here let us discuss how the basic conditions (i)-(iii) are reasonably fulfilled by the QFT sampler. First, as mentioned before, the QFT sampler enables us to obtain samples enough fast, thanks to the intrinsic feature of quantum devices, and thus it satisfies the condition (i). Also the probability q QFT (r; θ) with given r is obtained via just calculating the inner product (8) which costs of the order O(2 M ), thereby the condition (ii) is satisfied when 2 M is a classically tractable number. As for the condition (iii), noting that the parameterized ket vector |ψ(θ) serves as the low-frequency components of the shape of the proposal distribution q QFT (x; θ), the QFT sampler may be able to well approximate the shape of the target p(x) in view of the rich expressive power of Fourier decomposition. Now, one might think that this QFT sampler is still out of reach, even in the case of medium-size quantum devices with e.g., N = 100 qubits. However, remarkably, the QFT sampler can be realized in a classical digital computer as long as N 2 and 2 M are classically tractable numbers; that is, in this regime, this is a "quantuminspired" algorithm that can deal with even a random variable on 2 N = 2 100 discrete elements. The trick relies on the use of the adaptive measurement technique [16], which enables us to sample only by applying O(2 M + N ) operations in a classical computer; see Appendix B for a detailed explanation. This means that, therefore, the QFT sampler fulfills the condition (i), even as a classical computer.
Finally, we discuss a possible advantage of our QFT sampler, within a regime of classical sampling method. As a classical Fourier-based proposal distribution, one might think to employ the fast Fourier transform (FFT). However, to deal with a variable with 2 N discretized elements, FFT needs O(N 2 N ) operations, while QFT can realize the same operation only with O(N 2 ) gates. As is well known, this does not mean a quantum advantage in the typical application scene such as signal processing, because all the amplitude of the QFT-transformed state cannot be effectively determined [20]. On the other hand, the presented scheme only requires sampling and thus determining {q QFT (r i ; θ)} i=1,...,B rather than all the elements {q QFT (x; θ)} x=0,...,2 N −1 . Hence we could say that the developed quantum-inspired algorithm has a solid computational advantage over the known classical algorithm, in the problem of determining a Fourier-based proposal distribution for the MH algorithm. The QFT sampler discussed above is able to sample only from a 1-dimensional distribution. To extend the scheme to the case of multi-dimensional distributions, we here develop a multistage QFT sampler, which is composed of single QFT samplers with their parameters updated via machine learning, as illustrated in Fig. 2.
First, to see the idea, let us consider the case where the D random variables x 1 , · · · , x D are independent with each other and subjected to the D-dimensional independent target distribution p(x). In this case, the following proposal distribution might work: Here q QFT (x k ; θ k ) is a 1-dimensional QFT sampler parametrized by the 2 M -dimensional complex vector θ k ; we summarize these vectors to θ = [θ ⊤ 1 , · · · , θ ⊤ D ] ⊤ . Now based on Eq. (11) we construct the multistage QFT sampler that can deal with a non-independent multi-dimensional proposal distribution. The point is that, as illustrated in Fig. 2, the parameter vector θ k specifying the k-th 1-dimensional QFT sampler in Eq. (11) is replaced by a vector of parametrized functions of random variables up to the (k − 1)-th stage, i.e., f k (x 1 , · · · , x k−1 ; θ k ), whose control parameter θ k is to be repeatedly modified through the learning process. Hence the proposal distribution is given by q QFT (x k ; f k (x 1 , · · · , x k−1 ; θ k )) . (12) Note that this is simply a representation of the joint probability distribution over the multi-dimensional random variables x = [x 1 , · · · , x D ] ⊤ via the series of conditional probabilities. The gradient of the cross entropy (4) is derived in the same way as the 1-dimensional case; where {r 1 , · · · , r B } are samples produced from the proposal distribution q(x; θ). Note that each sample r = [r 1 , · · · , r D ] ⊤ is formed from r k produced from the k-th 1-dimensional QFT sampler, as shown in Fig. 2. Also, as in the 1-dimensional case (8), the vector u x k is de- . This gradient vector (13) is used to update each parameter vector θ k using the momentum gradient descent (10), which is now of the form where the learning coefficients (α, µ) do not depend on k for simplicity. This eventually constitutes the total gradient descent vector minimizing the loss function (4) and accordingly move the proposal distribution (12) toward the target p(x). Also recall that we use Eq. (1) to filter {r 1 , · · · , r B } to obtain samples that are subjected to p(x).
In this work we examine the following four models of the parameterized function f k (x 1 , · · · , x k−1 ; θ k ).
• Identity (Id) model: The Norm function ensures the normalization of |ψ(θ) . This leads to the independent proposal distribution (11).
• Linear Basis Linear Regression (LBLR) model: where in this case the parameters to be learned are the collection of 2 M -dimensional complex vectors, This model outputs a linear combination of their input argument.
• Nonlinear Basis Linear Regression (NBLR) Model: • Neural Network (NN) model: g 2 (x)). This is a fully connected 4-layers neural network with input x = (x 1 , · · · , x k−1 ), shown in Fig. 3. Here g (k) is a 2 M+1 dimensional real parameter vector. Hence ) are the parameters to be optimized. Sg(x) is the sigmoid function whose ℓ-th component is defined as 1/(1 + e −x ℓ ). Finally, for the output vector y = [y ⊤ 1 , y ⊤ 2 ] ⊤ where y 1 and y 2 are the 2 M real vectors, the function h acts on it and produces a complex vector h(y) = y 1 + iy 2 .

IV. SIMULATION RESULTS
In this section we numerically examine the performance of the QFT sampler for several target distributions. For this purpose, the following criteria are used for the evaluation. First, the cross entropy (4) is employed to see if the proposal distribution approaches toward the target. Note that in general the cross entropy does not decrease to zero even in the case when the proposal distribution coincides with the target. Hence, to evaluate the distance between the two probability distributions, we use the following Wasserstein distance: where the supremum is taken over f contained in the set of functions satisfying the 1-Lipschitz constraint. Note that here the Kullback Leibler divergence is not used, because it is applicable only when the supports of the two distributions coincide.
We also use the average acceptance ratio to evaluate the efficiency of the sampler. Mathematically it is defined as the expectation value of Eq. (1), but in the simulation we simply take the ratio of the number of accepted samples to the number of all samples generated.

A. 1-dimensional case
First we discuss the performance of the 1-dimensional QFT sampler. The QFT sampler is composed of N = 10 qubits wherein M = 4 qubits are used for parametrization; see Fig. 1. The total learning step is 40000, where in each step B = 32 samples are used to compute the gradient vector (9). The learning coefficients in the momentum update rule (10) are chosen as α = 0.01 and µ = 0.9. In the upper panels of Fig. 4, five types of target distributions (red broken lines) and snapshots of proposal distributions in each 1000 steps (black solid lines) as well as the convergent distributions (blue solid lines) are plotted. In the lower three panels in each subfigures, CE (Cross entropy), WA (Wasserstein distance), and AC (Average acceptance ratio) are plotted, demonstrating that in each case the proposal distribution q QFT (x; θ) is approaching to the target p(x), and accordingly AC increases. Note that in the case D, a wave-like structure still remains in the convergent proposal distribution, because of the absence of qubits corresponding to the high-frequency components. However, as mentioned in Sec. I, this is not a serious issue in the framework of MH, which does not require a precise approximation of the target distribution via the proposal one but only needs a proposal distribution realizing relatively high acceptance ratio. In fact the lower panel of D in Fig. 4 demonstrates about 30% improvement in the acceptance ratio.

B. 2-dimensional case
Next we study several 2-dimensional (i.e., D = 2) target distributions, which are shown in the top right panels from A to G in Fig. 5. In this case, the proposal distribution (12) is of the form q(x; θ) = q QFT (x 2 , f 2 (x 1 ; θ 2 ))q QFT (x 1 , f 1 (θ 1 )), (14) where f 1 (θ 1 ) is set to be f 1 (θ 1 ) = Norm(θ 1 ). As for f 2 (x 1 ; θ 2 ), we study the four models described in Sec. III B; i.e., Id, LBLR, NBLR, and NN models. The learning coefficients of the momentum gradient method are set to α = 0.01 and µ = 0.9 for all the cases, except that α = 0.001 is chosen in the NN model for the cases from A to F. Also the two 1-dimensional QFT samplers in Eq. (14) are both composed of N = 10 and M = 4 qubits, for the cases from A to F, while N = 10 and M = 5 for the case G. The basis functions in the NBLR model f 2 (x 1 ; θ 2 ) are chosen as follows; wherex is defined asx = 2x/(2 N − 1) − 1. The total learning step is 40,000 for the cases from A to F and 400,000 for the case G. Finally, for all cases, B = 32 samples are taken to compute the gradient in each step. With this setting, the proposal distributions at the final learning step corresponding to the four models (Id, LBLR, NBLR, and NN models from the left to right) are shown in the top panel of Fig. 5, and the change of the figure of merits (CE, WA, and AC) are also provided in the bottom panel.
First, as expected, the Id model can only approximate the distribution having no correlation in the space dimension, i.e., the case of C. On the other hand, the LBLR model acquires a distribution close to the target, for the cases B, D, E, F, in addition to C. The figure of merits, CE, WA, and AC, also reflect this fact. (The blue and orange lines in the cases A and C almost coincide.) It is notable that the simple LBLR model greatly improves the performance of the Id model.
To further handle the cases of A and G, some nonlinearities need to be introduced, as demonstrated by the NBLR and NN models. In particular, for all the cases from A to G, the NMLR model shows almost the same level of performance as the NN model, which is also supported by the figure of merits. Considering the fact that there are some jumps in WA and CE in the NN model, and the fact that the NN model costs a lot in the learning process, our conclusion is that the NMLR is the most efficient model in our case-study.

C. Application to a molecular simulation
The last case-study is focused on the stochastic dynamics of two atoms obeying the Lennard-Jones (LJ) potential field, which is often employed in the field of molecular simulation. This problem requires sampling from the Boltzmann distribution where LJ(a) = a −12 −a −6 and β = 0.1 is the inverse temperature. r 1 = [r 1x , r 1y , r 1z ] ⊤ and r 2 = [r 2x , r 2y , r 2z ] ⊤ are the vectors of position variables of each atom. We apply the 6-stages QFT sampler to generate a proposal distribution for approximating this target distribution.
The QFT sampler is configured by conditioning the samples in the order r 1x → r 2x → r 1y → r 2y → r 1z → r 2z ; for instance, q QFT (r 1y , f 1y (r 1x , r 2x ; θ 1y )) is conditioned on (r 1x , r 2x ). The output of the QFT sampler, r i ∈ {0, 1, · · · , 2 N − 1}, is rescaled to r i /0.7(2 N − 1). We again employ the four models (Id, LBLR, NBLR, and NN models) as the conditioning functions, where f 1 (θ 1 ) is set to be f 1 (θ 1 ) = Norm(θ 1 ). Each QFT sampler is composed of N = 10 and M = 4 qubits, and the total learning step is 10000, whereas B = 1024 samples are used to compute the gradient in each step. The learning coefficient is α = 10 for the Id, LBLR, and NBLR models, while α = 0.001 for the NN case; in each case the momentum method with µ = 0.9 is employed to update the parameters. The basis functions {φ j (x 1 , x 2 , ..., x k−1 )} for constructing the NBLR model are set to all the coefficients (except for the constant) of the third-order polynomial function (1 +x 1 +x 2 + · · · +x k−1 ) 3 with x = 2x/(2 N − 1) − 1. Figure 6 shows the change of CE and AC; WA was not calculated due to its heavy computational cost. We then find that, similar to the 2-dimensional case, the NBLR and NN models succeed in improving both CE and AC, although more learning steps are necessary compared to the previous cases. Note also that the Id and LBLR models are almost not updated, implying the validity to introduce nonlinearities in the model of QFT sampler.

V. CONCLUSION
This paper provided a new self-learning Metropolis-Hastings algorithm based on quantum computing and an important subclass of this sampler that uses the QFT. This QFT sampler is shown to be classically simulable, and the effectiveness of this quantum inspired method is supported by several numerical simulations. There are a lot of rooms to be investigated more, such as the choice of the optimizer and the conditioning function for constructing the multistage QFT sampler. In particular, in extending to the case of multi-dimensional distribution, a completely different schematic than the proposed multistage QFT sampler could be considered.
Although the QFT sampler offers a certain advantage over some classical sampling schemes as discussed at the end of Sec. III A, of course, this quantum inspired algorithm is not what fully makes use of the true power of quantum computation. The real goal is to establish a genuine quantum sampler, which may provide a faster sampling with higher acceptance ratio than any conventional classical method. Such a direction is in fact found in the literature [21,22], which are though far beyond the reach of current available devices. We hope that the present work might be a first step for bridging this gap. the QFT part, meaning that the QFT is immediately followed by the measurement process.
Here we explain a detailed classical algorithm for simulating our circuit in the case of N = 4 and M = 2; hence |in = |ψ(θ) ⊗ |g ⊗ |g with |ψ(θ) a two-qubits state. In this case, the original quantum circuit to implement the QFT sampler is shown in Fig. 7A. The main body of this circuit is QFT, which consists of the Hadamard gates denoted by H and the controlled-R n gates which rotate the target qubit by acting R n = exp [−i2πσ z 2 −(n+1) ] on it iff the control qubit is |e = [0, 1] ⊤ . The output of QFT is measured in the computational basis, producing the binary output sequence b 1 b 2 b 3 b 4 with b i ∈ {0, 1}. To classically simulate this circuit, we utilize the fact that, if the controlled rotation gate is immediately followed by a measurement on a control qubit, this process is replaced with the following feedforward type operation; that is, the control qubit is first measured, and, if the measurement result is "1" the rotating operation acts on the target qubit. Repeating this replacement one by one from right to left in the circuit shown in Fig. 7A, we end up Fig. 7B. That is, all controlled-rotation gates are replaced with the 1-qubit rotation gate which depends on the antecedent measurement results.
Using this classically controlled implementation, the QFT sampler can be simulated on a classical computer as described below. In the example considered here, the input |in is divided into at least three unentangled states {|ψ(θ) , |g , |g }. Hence, these three parts can be tracked separately and adaptively, as shown in Fig. 7C. In gen- eral, the change of states of the QFT sampler can be computed by repeatedly multiplicating 2 M -dimensional matrices on |ψ(θ) and 2-dimensional matrices on |g in the adaptive manner, leading that the total computational cost is of the order O(2 M + N ). Therefore, the QFT sampler can be classically simulated as long as 2 M is a classically tractable number.