Quantum Boltzmann Machine

Inspired by the success of Boltzmann Machines based on classical Boltzmann distribution, we propose a new machine learning approach based on quantum Boltzmann distribution of a transverse-field Ising Hamiltonian. Due to the non-commutative nature of quantum mechanics, the training process of the Quantum Boltzmann Machine (QBM) can become nontrivial. We circumvent the problem by introducing bounds on the quantum probabilities. This allows us to train the QBM efficiently by sampling. We show examples of QBM training with and without the bound, using exact diagonalization, and compare the results with classical Boltzmann training. We also discuss the possibility of using quantum annealing processors like D-Wave for QBM training and application.


I. INTRODUCTION
Machine learning is a rapidly growing field in computer science with applications in computer vision, voice recognition, medical diagnosis, spam filtering, search engines, etc. [1] Machine learning algorithms operate by constructing a model with parameters that can be determined (learned) from a large amount of example inputs, called training set. The trained model can then make predictions about unseen data. The ability to do so is called generalization. This could be, for example, detecting an object, like a cat, in an image or recognizing a command from a voice input. One approach to machine learning is probabilistic modeling in which the probability distribution of the data (P data v for a given state v) is approximated based upon a finite set of samples. If the process of training is successful, the learned distribution P v has enough resemblance to the actual distribution of the data, P data v , such that it can make correct predictions about unseen situations. Depending upon the details of the distributions and the approximation technique, machine learning can be used to perform classification, clustering, collaborative filtering, compression, denoising, inpainting, or a variety of other algorithmic tasks [2].
The possibility of using quantum computation for machine learning has been considered theoretically for both gate model [3][4][5] and quantum annealing [6][7][8][9][10][11][12][13] schemes. With the development of quantum annealing processors [14], it has become possible to test machine learning ideas with an actual quantum hardware [15,16]. In all of the above works, however, the quantum processor is considered as a means to provide fast solutions to an otherwise classical problem. In other words, the model stays classical and quantum mechanics is only used to facilitate the training. In this work, we propose a quantum probabilistic model for machine learning based on Boltzmann distribution of a quantum Hamiltonian, therefore, a Quantum Boltzmann Machine (QBM). As we shall see, in our approach, the quantum nature of the processor is exploited both in the model and in the training process.
The Boltzmann machine (BM) is a classic machine learning technique, and serves as the basis of powerful deep learning models such as deep belief networks and deep Boltzmann machines [17,18,20]. It comprises a probabilistic network of binary units with a quadratic energy function. In principle, one could consider more general energy functions to bring in more flexibility [19,21,22], but training can become impractical and generalization suffers as the number of parameters grows. A BM commonly consists of visible and hidden binary units, which we jointly denote by z a , a = 1, ..., N , where N is the total number of units. To maintain consistency with the standard notation in quantum mechanics, we use z a ∈ {−1, +1}, rather than z a ∈ {0, 1}; the corresponding probability distributions are identical up to a linear transformation of their parameters. To distinguish the visible and hidden variables, we use the notation z a = (z ν , z i ), with index ν for visible variables and i for hiddens. We also use vector notations v, h, and z = (v, h) to represent states of visible, hidden, and combined units, respectively. In physics language, the quadratic energy function over binary units z a is referred to as Ising model with the energy function: The dimensionless parameters b a and w ab are tuned during the training [34]. In equilibrium, the probability of observing a state v of the visible variables is given by the Boltzmann distribution summed over the hidden variables: called marginal distribution. Our goal is to determine Hamiltonian parameters, θ∈{b a , w ab }, such that P v becomes as close as possible to P data v defined by the training set. To achieve this, we need to maximize the average loglikelihood, or equivalently minimize the average negative arXiv:1601.02036v1 [quant-ph] 8 Jan 2016 log-likelihood defined by which for the probability distribution (2) is The minimization can be done using gradient decent technique. In each iteration, the parameter θ is changed by a small step in the direction opposite to the gradient: where the learning rate, η, controls the step sizes. An important requirement for applicability of the gradient decent technique is the ability to calculate the gradients ∂ θ L efficiently. Using (4), we have where ... and ... v are Boltzmann averages with free and fixed visible variables, respectively, and ..
.. v denotes double averaging. Fixing visible variables to the data is usually called clamping. Using (1) for E z , we obtain The gradient steps are expressed in terms of differences between the clamped (i.e., fixed v) and unclamped averages. These two terms are sometimes called positive and negative phases. Since the averages can be estimated using sampling, the process of gradient estimation can be done efficiently provided that we have an efficient way of performing sampling.

II. QUANTUM BOLTZMANN MACHINE
We now replace the classical spins or bits in (1) with quantum bits (qubits). The mathematics of quantum mechanics is based on matrices (operators) with dimensionality equal to the number of possible states (2 N ). This in contrast to vectors with dimensionality equal to the number of variables (N ) used in common machine learning techniques. For instance, instead of the energy function (1), one considers a 2 N ×2 N diagonal matrix, called the Hamiltonian: This Hamiltonian is constructed in such a way that its diagonal elements are energy values (1) corresponding to all 2 N binary states z ordered lexicographically. To generate such a Hamiltonian, we replace z a in (1) with where ⊗ means tensor product (sometimes called Kronecker or outer product) and Every element in (10) is an identity matrix (I) except the a-th element which is a Pauli matrix (σ z ). Equation (1) will therefore be replaced by the diagonal Hamiltonian where b a and w ab are still scalars. Fig. 1a shows an example of such a model with visible and hidden qubits depicted as blue and red circles, respectively. We represent eigenstates of this Hamiltonian by |v, h , where again v and h denote visible and hidden variables, respectively. We can now define matrix exponentiation through Taylor expansion, e −H = ∞ k=0 Hamiltonian, e −H is a diagonal matrix with its 2 N diagonal elements being e −Ez corresponding to all the 2 N states. With the partition function given by Z = Tr[e −H ] (c.f. (2)), we define the density matrix as The diagonal elements of ρ are therefore Boltzmann probabilities of all the 2 N states. For a given state |v of the visible variables, we can obtain the marginal Boltzmann probability P v by tracing over the hidden variables where Λ v limits the trace only to diagonal terms that correspond to the visible variables being in state v. Thus, Λ v is a diagonal matrix with diagonal elements being either 1, when the visibles are in state v, or 0 otherwise. In operator notation, we write where I h is the identity matrix acting on the hidden variables, and is a projection operator in the subspace of visible variables. Equations (2) and (13) are equivalent when the Hamiltonian and therefore the density matrix are diagonal, but (13) also holds for non-diagonal matrices.
We can now add a transverse field to the Ising Hamiltonian by introducing non-diagonal matrices Every eigenstate of H is now a superposition in the computation basis made of the classical states |v, h . As the probabilistic model for QBM, we use quantum Boltzmann distribution with the density matrix (12), which now has off-diagonal elements. In each measurement the states of the qubits are read out in the σ z -basis and the outcome will be a classical value ±1. Because of the statistical nature of quantum mechanics, after each measurement a classical output v will appear for the visible variables with the probability P v given by (13).
To train a QBM, we change the parameters θ such that the probability distributions P v becomes close to P data v of the input data. This is achieved by minimizing the negative log-likelihood, which from (3), (12), and (13) is The gradient of L is given by Once again, we hope to be able to estimate the gradients efficiently using sampling. However, since H and ∂ θ H are now matrices that do not commute, we have ∂ θ e −H = −e −H ∂ θ H and therefore we don't trivially obtain expectations of ∂ θ H as in the classical case. Writing Introducing imaginary time τ ≡ mδτ , in the limit of n → ∞, we obtain Tracing over both sides and using permutation property of the trace, we find which is the same as the classical relation. Plugging this into the second term of (18) gives where ... ≡ Tr[ρ...] denotes Boltzmann averaging. This term can be estimated by sampling from the distribution (12). However, the first term in (18), cannot be estimated using sampling. This renders the training of a QBM inefficient and basically impractical for large system. A work around for this problem is to introduce a properly defined upper-bound for L and minimize it, as we shall discuss below. We call this approach bound-based QBM (bQBM). Minimizing a bound on the negative log-likelihood is a common approach in machine learning.

A. Bound-based QBM
One can define a lower bound for the probabilities using Golden-Thompson inequality [23,24]: which holds for any hermitian matrices A and B. We can therefore write we can write Notice that H v has an infinite energy penalty for any state of the visible qubits that is different from |v . Therefore, for any practical purposes, This is a clamped Hamiltonian because every visible qubit σ z ν is clamped to its corresponding classical data value v ν .
From (17) and (27) it follows that Instead of minimizing L, we now minimize its upper boundL using the gradient where Taking θ to be b a , w ab , and using δθ = −η∂ θL , we obtain Again the gradient steps are expressed in terms of differences between the unclamped and clamped averages, ... and ... v , which can be obtained by sampling from a Boltzmann distribution with Hamiltonians H and H v , respectively. In Section IV, we give examples of training QBM and compare the results of minimizing L using (18) with minimizing its upper boundL using (30). One may also attempt to train Γ a using the upper boundL. From (30) we obtain There are a few problems with using (34) to train Γ a .
First of all, one cannot calculate σ x a by sampling in σ z a basis. Therefore, measurement in the σ x a basis is needed to estimate σ x a . Moreover, the first term in (34) is always zero for visible variables, i.e., σ x ν v = 0, ∀ν. Since σ x ν > 0 for positive Γ ν , δΓ ν will always be negative, which means Γ ν → 0 for all visible variables. This is inconsistent with what we obtain when we train Γ ν using the exact gradient (18). Therefore, vanishing Γ ν is an artifact of the upper bound minimization. In other words, we cannot learn the transverse field using the upper bound. One may still train the transverse field using the exact log-likelihood, but it becomes quickly inefficient as the size of the QBM grows.

B. Restricted QBM
So far we haven't imposed any restrictions on the connectivity between visible and hidden qubits or lateral connectivity among visible or hidden qubits. We note that calculation of the first term in (32) and (33), sometimes called positive phase, requires sampling from distributions with clamped Hamiltonians (28). This sampling can become computationally expensive for a large data set, because it has to be done for every input data element. If we restrict our QBM to have no lateral connectivity in the hidden layer (see Fig. 1b), the hidden qubits become uncoupled in the positive phase and the calculations can be carried out exactly. We can write the clamped Hamiltonian (28) as (32) can be computed exactly: where Notice that (36) reduces to the classical RBM expression, in the limit Γ i → 0. We emphasize that unlike the classical RBM, in which there are no lateral connections in both hidden and visible layers (for contrastive divergence techniques to work), we only require their absence in the hidden layer, usually called semi-restricted Boltzmann machine [26]. In Section IV we give an example of training RQBM and illustrate the importance of using (36) instead of their classical limit.

III. SUPERVISED LEARNING
One important application of machine learning is classification in which a category (label) is assigned to each data point. For example, in spam detection the goal is to determine which of the two labels, "spam" or "not spam", should be assigned to a given text. The process of inferring a functional relation between input and label from a set of labeled data is called supervised learning.
Denoting the feature vector (input) by x and label (output) by y, the problem is to infer a function g(x) : x → y from the set of labeled data (x i , y i ). In probabilistic approaches to this problem, which are of our main interest here, the output y that is most probable, subject to the input x, is chosen as the label. Therefore, the function g(x) is determined by the conditional probability P y|x of output given input The end goal of training is to make P y|x as close as possible to the conditional distribution of the data, P data y|x . Assuming that the data comes with a joint probability distribution P data x,y , we can write: P data y|x = P data x,y /P data x,y is the marginal distribution. Two possible approaches to supervised learning are discriminative and generative learning [35]. In the discriminative approach, for each x we try to learn the conditional distribution P data y|x . If an input x appears in the training set with probability P data x , the loss function can be written as In the generative approach, on the other hand, we learn the joint probability distribution without separating input from output. The loss function is therefore: where we have used P x,y = P y|x P x . Notice that the first term is just L discr while the second term measures the difference between the probability distribution of the training set inputs and the marginal distribution P x . This second term is called cross-entropy and it is equal to KLdivergence, see Eq. (58), up to a constant. Now, we explore the possibility of applying QBM to both cases.

A. Generative learning
Generative learning with loss (40) can be done with the methods of Section II by treating input and output (x, y) jointly as the visible data v = [x, y] in a QBM. At the end of training, the QBM provides samples with a joint probability P x,y that is close to P data x,y . Therefore, the conditional probability should also match P data y|x as desired for supervised training. However, there is a problem when it comes to sampling from this conditional for a given x. If the input x appears with a very small probability (P x 1), it would require a large amount of samples from P x,y and P x to reliably calculate P y|x using (41).
In a classical BM, one can sample from the conditional distribution by clamping the input variables x to the data and sampling the output y. To understand how that strategy would work for QBM, let us introduce a clamped Hamiltonian This means for any x, we can sample P clamped y|x from H x and that will give us P y|x in an efficient way regardless of how small P x is. For quantum Hamiltonians, when [H, Λ x ] =0, we know that e −H e ln Λx = e −Hx . Therefore, P clamped y|x is not necessarily equal to P y|x and there is no easy way to draw samples from P y|x .
One might still hope that the clamped distribution is not too far off from (41) and can be used as an approximation P clamped y|x ≈ P y|x . As we shall see in an example in Sec. IV-C, this is not true in general.

B. Discriminative learning
In discriminative learning one distinguishes input from output during the training [2] and learns the conditional probability distribution using (39). This can be done by clamping the input x in both positive and negative phases. Since the input is always clamped, its role is just to apply biases to the other variables and therefore we don't need to assign any qubits to the input (see Fig. 1c). The Hamiltonian of the system for a particular state of the input, x, is given by where indices a and b range over both hidden and visible (output only) variables. Here, the input x provides a bias to the a-th qubit, where b a and w aµ are tunable parameters. Notice that x µ does not need to be restricted to binary numbers, which can bring more flexibility to the supervised learning. The probability of measuring an output state y once the input is set to state x is given by where H x is given by (45) and I x is an identity matrix acting on the input variables. The negative log-likelihood is given by (39). Using the same tricks as discussed in the previous section we can define a clamped Hamiltonian, The derivative ofL with respect to a Hamiltonian parameter θ is given by where The gradient descent steps in the parameter space are given by Notice that x is not only clamped in the positive phase (the first expectations), but also is clamped in the negative phase (the second expectations). The positive phase can still be done efficiently if we use RQBM (Fig. 1c with no lateral connection among the hidden variables). The negative phase needs a more elegant sampling method. This can make the calculation of the gradient steps computationally expensive for large data sets, unless a very fast sampling method is available.

IV. EXAMPLES
In this Section we describe a few toy examples illustrating the ideas described in the previous sections.
In all examples studied, the training data was generated as a mixture of M factorized distributions (modes), each peaked around a random point. Every mode (k) is constructed by randomly selecting a center point s k = [s k 1 , s k 2 , ..., s k N ] with s k i ∈ {±1} and using Bernoulli distribution: where p is the probability of qubit ν being aligned with s k ν , and d k v is the Hamming distance between v and s k . The average probability distribution over M such modes gives our data distribution In all our examples, we choose p = 0.9 and M = 8.
To have a measure of the quality of learning, we subtract from L its minimum L min = − v P data v log P data v , which happens when P v = P data v . The difference, commonly called Kullback-Leibler (KL) divergence: is a non-negative number measuring the difference between the two distributions; KL = 0 if and only if the two distributions are identical.

A. Fully visible model
We start with a fully visible example to compare BM with QBM and evaluate the quality of the bound (29) by training bQBM. We consider a fully connected model with N = 10 qubits. Classical BM will have N + N (N − 1)/2 = N (N + 1)/2 trainable parameters (b a , w ab ). The Hamiltonian of QBM has the form (16) where we restrict all Γ a to be the same (= Γ). In order to understand the efficiency of QBM in representing data we will train the exact log-likelihood using (18) and treat b a , w ab , and Γ as trainable parameters. This can be done using exact diagonalization for small systems. We will also perform training of the boundL using (30) treating (b a , w ab ) as trainable parameters but fixing Γ to some ad-hoc nonzero value Γ = 2. Comparing the training results of QBM with bQBM will give us some idea of the efficiency of training the boundL.
Since all expectations entering the gradients of loglikelihood are computed exactly, we will use second-order optimization routine BFGS [25]. The results of training BM, QBM and bQBM are given in Fig 2a. The x-axis in the figure corresponds to iterations of BFGS that does line search along the gradient. QBM is able to learn the data noticeably better than BM, and bQBM approaches the value close the one for QBM. In order to visualize the training process, we keep track of the average values of classical and quantum parts of the Hamiltonian during the training Fig 2b shows the learning trajectories in the space (|E cl |, |E q |). BM learns a model with average energy ≈ 3.5, and KL ≈ 0.62. One can see that QBM, which starts off with Γ = 0.1, initially lowers Γ and learns (b a , w ab ) that are close to the best classical result (see the inset). Soon after, QBM increases Γ and (b a , w ab ) until it converges to a point with Γ = 2.5 and KL ≈ 0.42, which is better than classical BM value. Having a fixed transverse field, Γ = 2, bQBM starts with a large E q and approaches the parameter learned by QBM (although doesn't reach the best value at Γ = 2.5 learned by QBM).

B. Restricted QBM
We now consider a (semi-) restricted BM discussed in Section II B. Our toy model has 8 visible units and 2 hidden units. We allow full connectivity within the visible layer and all-to-all connectivity between the layers. The data is again generated using Eq. (57) for the visible variables, with p = 0.9 and M = 8. We present the results of training in Fig 3. Similarly to the fully visible model, QBM outperforms BM, and bQBM represents a good proxy for learning quantum distribution.
In order to illustrate the significance of consistent usage of quantum distribution in evaluating the gradients (32) and (33), we train bQBM using classical expression instead of (36) for expectations of hidden units in the positive phase. The resulting machine (bQBM-CE in Fig. 3) learns worse than purely classical BM because the two terms in gradient expressions are evaluated inconsistently.

C. Generative supervised learning
We consider a supervised learning example with 8 inputs and 3 outputs with full connectivity between all units. For the training set we again used the multi-modal distribution (57) over x, with M = 8 and p = 0.9, and set the label y for each mode to be a 3-bit binary number from 0 to 7 [36]. Both BM and QBM are trained to learn the loss function (40). Our goal is to check whether P clamped y|x ≈ P y|x , when training QBM in this generative setup. In Fig 4a, we plot KL-divergence based on the generative log-likelihood (40) for both classical BM and QBM. It is clear that QBM has trained with better KL-divergence than classical BM. In 4b we plot the KLdivergence based on the discriminative log-likelihoods (39), evaluated with conditional probabilities P y|x and clamped probabilities P clamped y|x . One can see that although QBM is trained with the joint probability distribution, the conditional distribution is also learned better than BM. The clamped probability distribution, on the other hand, starts very close to the conditional distribution at the beginning of the iterations, when QBM and BM are closed to each other. But as the transverse field in QBM starts to grow, the clamped distribution deviates from the conditional one and its KL-divergence grows to a value much worse than the classical BM value. This shows that even for such a small example the clamped distribution can be very different from the true conditional distribution.

V. QBM WITH A QUANTUM ANNEALING PROCESSOR
Recent developments in manufacturing quantum annealing processors have made it possible to experimentally test some of the quantum machine learning ideas. Up to now many experiments have confirmed the existence of quantum phenomena in such processors [14,[27][28][29][30], which includes entanglement [31]. A quantum annealing processor implements the time-dependent Hamiltonian As our training data we use artificial data from Bernoulli mixture model (57) for inputs and 3-bit binary labels (0 to 7) for outputs. Training is done using second-order optimization routine BFGS. (a) KL-divergence of joint distribution (40) of BM, QBM models during training process. Once again QBM learns the distribution better than BM. (b) KLdivergence of conditional distribution during the same training, for BM, QBM models using (39), and for clamped QBM (QBM-clamped) using (43). The conditional distribution is also learned better by QBM than BM, but the clamped QBM distribution is very different from the conditional one and give a KL-divergence much higher than the classical BM.
discussed in Ref. [32], an open system quantum annealier follows quasistatic evolution following equilibrium distribution, ρ = Z −1 e −βH(s) , up to a point where the dynamics become too slow to establish equilibrium. Here, β = (k B T ) −1 with T being the temperature and k B being the Boltzmann constant. The system will then deviate from the equilibrium distribution and soon after the dynamics will freeze (see Fig. 2c in Ref. [32] and the related discussion).
In general, a quantum annealer with linear annealing schedule s = t/t a does not return a Boltzmann distribution. However, as argued in Ref. [32], if the dynamical slow-down and freeze-out happen within a short period of time during the annealing, then the final distribution will be close to the quantum Boltzmann distribution of (60) at a single point s * , called the freeze-out time. In such a case, the quantum annealer with linear annealing schedule will provide approximate samples from the Boltzmann distribution corresponding to the Hamiltonian H(s * ). Moreover, if A(s * ) happens to be small enough such that the quantum eigenstates at s * are close to the classical eigenstates, then the resulting Boltzmann distribution will be close to the classical Boltzmann distribution. In such a case, the quantum annealer can be used as an approximate classical Boltzmann sampler for training a BM, as was done in [15,16]. Unfortunately, not all problems have a narrow freeze-out region and A(s * ) is not always small. If the freeze-out does not happen in a narrow region, then the final probability distribution will depend on the history within this region and will not correspond to a Boltzmann distribution at any particular point. This would limit the applicability of using quantum annealer for Boltzmann sampling.
In principle, it is possible to controllably freeze the evolution at a desired point, s * , in the middle of the annealing and readout the qubits. One way to do this is via a nonuniform s(t) which anneals slowly at the beginning up to s * and then moves very fast (faster than all dynamics) to the end of annealing. An experimental demonstration of such controlled sampling was done in [33] for a specially designed 16 qubit problem. If s * lies in the region where the evolution is still quasistatic, the quantum annealer will provide samples from the Boltzmann distribution of Hamiltonian (16), with Since h a and J ab are tunable parameters, if one can control the freeze-out point s * , then all the dimensionless parameters in (16), i.e., Γ, b a , w ab , can be tuned and therefore the quantum annealer can be used for training a QBM. The applicability of the controlled sampling technique used in [33] is limited by how fast the second part of the annealing can be done, which is ultimately determined by the bandwidth of the filters that bring electrical signals to the chip. Because of this, such a technique is only applicable to specially designed problems that have very slow dynamics. With some modifications to the current hardware design, however, such techniques can become possible for general problems relevant to QBM in the near future.

VI. CONCLUSION
We have examined the possibility of training a quantum Boltzmann machine (QBM), in which the classical Ising Hamiltonian is augmented with a transverse field. Motivated by the success of stochastic gradient descent in training classical Boltzmann machines, one may wish to use a similar technique to optimize the log-likelihood of the QBM. However, unlike the classical BM, for which the gradients of the log-likelihood can be estimated using sampling, the existence of a transverse field in the QBM makes the gradient estimation nontrivial. We have introduced a lower bound on the log-likelihood, for which the gradient can be estimated using sampling. We have shown examples of QBM training through maximizing both the log-likelihood and its lower bound, using exact diagonalization, and compared the results with classical BM training. We have shown small-size examples in which QBM learned the data distribution better than BM. Whether QBM can learn and generalize better than BM at larger sizes are questions that need to be answered in future works.
Our method is different from other existing quantum machine learning proposals [3-13, 15, 16], because quantum mechanics is not only used to facilitate the training process, as in other proposals, but also is exploited in the model. In other words, the probabilistic model, i.e., the quantum Boltzmann distribution, that we use in our QBM is different from any other models that have been studied in the machine learning community. Therefore, the potential of the model for machine learning is unexplored.
We should mention that the similarity between BM and QBM training may not hold in all situations. For example, as we have shown in Sec. III A, sampling from a conditional distribution cannot be performed by clamping in QBM, as it is commonly done in classical BM. The two models may also differ in other aspects. Therefore, careful examination is needed before replacing BM with QBM in existing machine learning techniques.
Finally, we have discussed the possibility of using a quantum annealer for QBM training. Although the current commercial quantum annealers like D-Wave are not designed to provide quantum Boltzmann samples, with minor modifications to the hardware design, such a feature can become available. This would open new possibilities in both quantum information processing and machine learning research areas.