Entanglement Induced Barren Plateaus

,

A key part of a successful quantum machine learning algorithm is an efficient training algorithm.In recent years, several barren plateau results [12][13][14][15][16] put limitations on the gradient-based training of QNNs.Our result complements the growing literature on barren plateaus in quantum computing.McClean et al. [12] first showed that unitary quantum neural networks generically suffer from vanishing gradients exponentially in the number of qubits.This issue stems from the concentration of measure [17,18] and was subsequently demonstrated for other QNNs [13,14].Another type of a barren plateau emerges from hardware noise in the system [15].The key observation that we put forward in this work is that barren plateaus can occur because of an excess of entanglement in deep quantum models.
In this paper, we prove that entanglement between visible and hidden units hinders the learning process.Inclusion of hidden units is essential in traditional machine learning.Without them, the expressive power of neural networks would be severely limited and deep learning all but impossible.In spite of this, there has been very little attention paid to the effect of hidden units on the training of QNNs.Surely, the expressive power of hidden units would translate to the quantum world?Numerical experiments seem to contradict this intuition.A small scale numerical study [19] showed that the inclusion of hidden units to quantum Boltzmann machines did not lead to a higher quality of reproduction.While this could be explained due to the small size of the QNN and simple data, in our work we show that quantum Boltzmann machines do not benefit from a large number of hidden units.
We build intuition from exploring the statistical relationship between a random state and maximally entangled states in a bipartite quantum system.A classic thermalization result [20] shows that for a random initial state, the state on the visible units is with high probability exponentially close to a maximally mixed state.However, if the state is chosen from a k-design, its distance to a maximally mixed state is bounded by a polynomial in k [21].We show that it is very difficult to escape from this state because the gradients will be exponentially small.As such, for a wide array of QNNs, randomness and entanglement hinder the training.This surplus of entanglement to some extent defeats the purpose of deep learning by causing information to be non-locally stored in the correlations between the layers rather than in the layers themselves.As a result, when one tries to remove the hidden units, as is customary in deep learning, we find that the resulting state is close to the maximally mixed state.Indeed, we show that such situations are generic as well and that gradient descent methods are unlikely to allow the user to escape from such a plateau at low cost.This observation holds for both "feedforward" QNNs as well as Boltzmann machines and suggests that if quantum effects are to be used to improve classical models then they must be used surgically.Furthermore, our work establishes a link between the thermalization literature and quantum machine learning that has been hitherto absent from the literature.

II. REVIEW OF QUANTUM NEURAL NETWORKS
We focus on two types of QNNs prominent in the quantum machine learning literature.The first network is characterized by a unitary ansantz where θ i are the parameters we aim to learn and H i Hamiltonians that specify the QNN.This model is reminiscent of the QAOA setup but it is more general.The output is then U (θ 1 , . . ., θ n ) |ψ 0 where |ψ 0 can be taken to be |0 . . .0 for generative learning.In this model, visible units correspond to the qubits on which we evaluate the objective function.The remaining qubits are called hidden units as in the example in Figure 1a.
The second type is a quantum Boltzmann machine [8,19,22].Quantum Boltzmann machines model the output as a thermal state Without loss of generality, we will take Tr(H) = 0 for all quantum Boltzmann machines.The aim when training a quantum Boltzmann machine is to learn a vector θ such that for a training objective function given by O obj that acts on the visible subsystem, we maximize Tr(O obj Tr h (e −H(θ) /Z(θ)).
Quantum Boltzmann machines can also be trained generatively [19], meaning that rather than optimizing a training objective function that is a linear function of the density operator such as Tr(O Obj Tr h (e −H /Z)), we aim to optimize a non-linear function of the density operator such as the quantum relative entropy, i.e.S(ρ train ||ρ(θ)) = Tr(ρ train log(ρ train )−ρ train log(ρ(θ))), by generating a quantum state ρ(θ) using the Boltzmann machine that optimizes this divergence with the training density operator.

III. THE IMPACT OF ENTANGLEMENT ON DEEP MODELS
The central question our work is to understand how the entanglement in the neural network affects the visible units.Instead of providing speedup, entanglement between visible and hidden units causes thermalization on the visible subsystem.Thus, the inclusion of entanglement between the hidden and visible layers of a QNN can be, unless carefully controlled, harmful to the neural network model.The relationship between the representational power of a neural network and the degree of entanglement between the visible and hidden systems was first discussed in [23]; however, here we re-examine this question and arrive at a different conclusion.Specifically, we conclude that large amounts of entanglement (as quantified by a volume law) can be catastrophic for the model; whereas an area law scaling for the entanglement entropy between the hidden and visible can often be tolerated.
To see this, we need to make a few formal definitions.Let S ⊂ C DvD h ×DvD h be a family of parameterized density operators where D v = 2 nv is the dimension of the n v -qubit visible subspace and D h = 2 n h the dimension of the hidden subspace.For each ρ ∈ S, the qubits can be uniquely assigned to the vertices of a graph G on a vertex set to be the number of vertices in V v that are at least graph distance j away from the vertices in V h and define n (h) j to be the analogous number for the vertices in V h .We then say that S satisfies an area law if for all ρ ∈ S, qubits in the first visible and hidden layers respectively.We then have that hidden visible boundary FIG.2: For an area law, the entanglement entropy scales as the number of qubits on the boundary (in the dashed rectangle).In contrast, the entanglement entropy for a volume law scales as for any operator on the visible sub-system O obj and Proof.The proof follows from standard inequalities for the quantum relative entropy Then from the von-Neumann trace inequality we have If ρ satisfied an area-law scaling then there exists α > 0 such that S(Tr h (ρ)) ∈ Θ(n h ).From which the claimed result for the area law scaling immediately follows.If instead we assume that ρ obeys a volume-law then This shows that if our quantum neural network outputs states that satisfy a volume law then asymptotically the predictions of the neural network would be no better than random guessing.In contrast, quantum neural networks will not necessarily observe this problem if the entanglement entropy is characteristic of an area-law scaling unless the number of hidden units in the first layer becomes much larger than the number of visible units.We therefore see that uncontrolled entanglement, such as that yielded by volume laws, can be catastrophic for deep quantum neural networks (i.e. for D h D v ) but the comparably limited entanglement yielded by area laws may be more desirable.This means that when designing neural networks, it is vital to aim for sub-volume-law scaling; however, such states often have concise representations using matrix product states [24] and so may be no more performant than classical neural networks.Nonetheless we show below that such sub-volume law scalings are not typical and that almost all quantum neural networks within the ensembles we consider obey volume law scalings.

IV. TYPICALITY OF VOLUME-LAW SCALING
While area laws occur for certain systems, such as groundstates of gapped translationally invariant Hamiltonians on lattices, we expect volume law scalings to be much more common.This intuition can be made rigorous by making appropriate assumptions about the interactions between the visible and hidden layers in the model.In particular, we assume that the quantum states on the joint system of the QNN approximate a Haar-random state.In practice, this assumption is too strong as Haar random states typically require exponentially long quantum circuits to generate them.Instead, we focus on ensembles generated by unitary 2-designs, which model the states generated by random sequences of universal gates [25].
with high probability over U .
Proof.Let us first examine the case of ρ = U † |0 0| U .
We then have that if we take the expectation value over U drawn from a 2-design then Since the partial trace of a density operator is a density operator, it follows that the argument is positive definite and in turn that the result can be written as where F vv is the flip or swap operator that swaps the two visible subsystems.Since the result is quadratic in the probability distribution we have from the definition of a unitary 2-design that where E Haar is the Haar expectation value.The result then follows immediately from invoking Theorem 2 from the result of Popescu et al [20] that Next let us assume that ρ = e −H /Tr(e −H ).We have from the definition of H and the previous result that for any eigenvector |j of H the required result immediately follows by interchanging the order of the expectation values over the mixed state and over the unitary 2-design.These results also hold with high probability as a consequence of Markov's inequality.This shows that for both the Boltzmann machine, as well as unitary quantum networks, any observable measured on the visible layers will be indistinguishable, in expectation, to the maximally mixed state with high probability.In other words, rather than strengthening the analogous classical model the presence of entanglement actually weakens them as the dimension of the hidden subsystem grows relative to the visible subsystem.For deep networks, we anticipate that there will be many more hidden neurons than visible neurons and hence generically entanglement is a bane not a boon for deep QNNs.
There are a number of caveats to this analysis.First, we assume that the states in question are typical of a unitary 2-design.This assumption may not be appropriate if a structured ansatz is used or if the used circuits are shallow.The next assumption is that the observable is supported on the visible system only.The final, potential, caveat is that gradient-based optimizers may allow us to train our way out of these typical points and thereby find a way to productively leverage quantum effects.While the first two caveats do speak to ways to escape this apparent no-go result, the ubiquity of "entanglement induced barren plateaus" will make the third option fail with high probability.

V. ENTANGLEMENT INDUCED BARREN PLATEAUS
Our arguments for why gradient descent will fail to improve the quality of a training objective function due to entanglement between the visible and hidden layers, follows from similar reasoning to that employed in Proposition 2. However, the specific arguments require slightly more nuanced assumptions since we need to worry about how perturbations to the model parameters impact the resulting state.Such assumptions are also made, for example, in the original McClean et al. work that identified Barren plateaus for unitary networks [12].Further, while we were able to directly employ existing results from the literature of thermalization to prove Proposition 2, the necessary conditions do not hold for the gradients operator.We state the main results below and provide an explicit proof in Appendices A and B

A. Plateaus for Unitary networks
We will first consider the case of unitary networks of the form given in (1).We consider the case where one of the parameters is shifted by a constant amount δ k and argue about the maximum possible shift in the expectation of an observable that is supported only on the visible subsystem.
A major challenge to analyzing what happens when shifting parameters of a unitary network is that such networks are so complicated that the impact of this perturbation is difficult to measure.An example of such an effect can be seen in the Loschmidt echo, which shows exponential sensitivity to perturbations in the parameters of complex quantum dynamics [26,27].Our solution, similar to that taken in [12], is to assume that the dynamics scrambles the states so much that almost all subsequences of the product k j=1 e −iHj θj form a unitary 2-design.This assumption is reasonable for a sufficiently deep random circuit [28][29][30].We then see that the value of the objective function is Lipshitz continuous with a constant that scales inversely with the hidden-dimension D h = 2 n h .This shows that the plateau exists both for gradient descent as well as gradient-free methods 1 .A formal statement of this intuition and the result is given below.
Theorem 3 (Gradient in unitary networks).Assume that ρ(θ) is drawn from a unitary 2-design where ρ(θ) is generated through a unitary ansatz of the form 1 j=N e iHj θj that acts on a Hilbert space that is the product of a hidden and visible space of dimensions D h and D v respectively.Further, The proof of the theorem follows by using the unitary invariance of the trace norm and Hadamard's lemma to rewrite the difference between the perturbed exponential and the original exponential as a commutator series of H k (θ) and ρ(θ).Then by using the triangle inequality, the Cauchy-Schwarz inequality as well as the independence assumptions made above to arrive at the result.An explicit proof is given in Appendix A.

B. Plateaus for Boltzmann Machines
Next, we will turn our attention to Boltzmann machines.We show that parameterized Hamiltonians drawn from a unitary ensemble also experience an entanglement induced barren plateau.The nature of this plateau, however, differs from that of the unitary network's plateau in that the plateau occurs under reasonable assumptions if Tr(h h ) 2 /Tr(h 2 h ) ∈ o(D h ) as we see below.
Theorem 4 (Gradient for Boltzmann machines).Assume H ∈ C D×D is a random Hermitian matrix drawn from an ensemble in the following manner: a diagonal matrix with eigenvalues E j ∈ R chosen according to a probability Pr(E j ) such that ) and then is conjugated with a unitary drawn from a distribution that is a unitary 2-design.Let us then define for fixed Hermitian H k ∈ C D×D that can be written for Hermitian h v , h h as H k = h v ⊗ h h and ρ(θ k ) := e −(H+θ k H k ) /Tr(e −(H+θH θ ) ).Finally, let O obj ∈ C Dv×Dv be a Hermitian matrix then κ := Tr(( with high probability over the ensemble.
The proof of Theorem 4 can be found in Appendix B. The sketch of the proof is relatively simple.We use the assumption that the eigenvectors are taken to be columns of matrices drawn from a unitary 2-design and then use perturbation theory to argue about the perturbed H.The use of perturbation theory introduces the parameter Γ that characterizes the inverse minimal gap.We then take the partial trace of the resulting perturbed eigenvectors to show that if the reduced density matrix over the hidden units of the perturbation Hamiltonian H k has zero trace then the partial trace over the hidden layers of each eigenvector remains the maximally mixed state as per Proposition 2. This partial trace assumption is needed because if bias terms were added to the hidden units then one could disentangle them from the visible units in the ground state through the perturbation.While such a perturbation may save the predictive power of the Boltzmann machine, it would effectively eliminate the hidden layers reverting the model to a shallow one.With these observations, the results then follow from the use of standard inequalities and the Haar expectation value of random states given, for example, in [31].The result holds with high probability as a consequence of the Markov inequality.
In particular, we find that the gradient of the objective function with respect to terms that non-trivially act on the hidden layers are exponentially small in the number of hidden qubits since without loss of generality we may take Tr(h h ) = 0 for all such terms.In contrast, the gradient with respect to the visible Hamiltonian coefficients need not be exponentially small in the number of hidden qubits.Indeed, if we have a k-local random Hamiltonian where each Hamiltonian coefficient is chosen independently from a distribution that is independent of D then Γ ∈ O(log(D) 1−k ) thus for any k ≥ 2 the gradient may only be polynomially small.
A side effect of these observations is that they explain, in part, the observations in [19] that the number of hidden units included in the model did not increase the performance of Quantum Boltzmann Machines.This can now be understood from the fact that the Gibbs states for typical Hamiltonians generate thermal states that are close to the maximally mixed state.Thus, the inclusion of hidden units typically will not be expected to increase the performance of quantum Boltzmann machines.

VI. HAAR RANDOM UNITARIES
In the previous sections we assumed that the eigenbasis the neural networks scramble at least as effectively as a unitary 2-design.However, if we assume that in the case of the unitary networks the gate sequence is Haar-random or in the case of the Boltzmann machine that the basis is Haar-random, then the type of concentration that we see can be radically improved.Specifically, Levy's lemma [20] can be used in place of Markov's inequality to show that the vast majority of randomly selected networks will have vanishing gradients.In particular, Lemma 5 (Levy).Given a function f : S d → R defined on the d-dimensional hypersphere S d and a point φ ∈ S d chosen uniformly at random, where η is the Lipshitz constant of f and C ∈ Θ(1).
This result ends up allowing us to use an even tighter concentration result for the systems than what is possible using Markov's inequality because it shows that a large deviation from the Haar expectation is exponentially small.This further means that a substantial deviation from the results stated above is in fact exponentially smaller than what would be expected if we only had a 2-design condition.If unitary k-designs are used in place of 2-designs then it should be noted that it is possible to interpolate between these two results [21], however the bounds that arise from using this result under the assumption that we only have a 2-design is not superior to our Markov-based analysis.

VII. NUMERICAL RESULTS
We ran a series of numerical experiments summarized in Figure 3 and Appendix C to demonstrate that our asymptotic results apply to small-sized quantum networks.In Figure 3a and Appendix C, we compared the trace distance scaling of the maximally mixed state and three models: the gaussian unitary ensemble model, the unitary QNN, and the Quantum Boltzmann Machine.In Figure 3a, we see that for an increasing number of hidden units these models will produce states close to the maximally mixed state.This result can be understood in the context of Section IV.In Appendix C, Figure 4 highlights this effect on the data histograms: as we increase the number of hidden units, we see the trace distance concentrating around zero.Moreover, in Appendix C we describe the general form of the Hamiltonian and Hamiltonian terms we used in our experiments.
In Figure 3b, we performed a similar analysis on the gradients of the unitary QNN.We generated a fixed thermal state using a random two-local Hamiltonian (see Appendix C1).The onsite coefficients are drawn from a normal distribution with mean 0, variance 0.01, i.e.N (0, 0.01) and N (0, 1) for the offsite coefficients.We then proceeded to estimate the gradient vector of the Fidelity, F , between our model and the target state using finite differences.We observed a overall decrease in ∞norm of the gradient vector as we increasing the size of the hidden units.In particular, Figure 3b show a decrease in the variance ∞-norm of the gradient vector on a semilog scale.We also calculated the exponential rate of decay using least square fitting.This overall decay is predicted in Theorem 3 as we increase the number of hidden units.
The Boltzmann Machine results are summarized in Fig- ure 3c.In this case, we estimated the gradient vector of the trace distance, T , between an our model and its perturbation for each parameter using finite differences.In order to observe gradient decay, we had to draw our onsite coefficients from N (0, 0.01) and the offsite coefficients from N (0, 1).Moreover, we had to normalized the Hamiltonian by its operator norm.Our goal was to amplify the effect of the offsite terms in relation to the onsite terms to encourage a volume-law scaling in the examples simulated.The emergence of these volume laws can be understood from perturbation theory since the leading order shift in an eigenvector |n with eigenvalue E n is proportional to j =n |j j| H k |n /(E n − E j ).This shows that if we take |n to be an eigenstate of the 1-body terms in the Hamiltonian then the entanglement generated by H k is suppressed by the energy gaps between these states.We, therefore, choose these magnitudes to be small so that significant entanglement can be introduced in the eigenstates despite the small values of D that can be explored on a classical computer.This phenomenon is predicted in Theorem 4.

VIII. CONCLUSION
We showed that for Haar-random pure states and thermal states of random Hamiltonians, the gradient of an observable objective function will be vanishing exponentially with the number of hidden units.This shows that common types of QNNs are not only generically difficult to train via local optimization methods but also that adding hidden units will not always increase the power of QNNs.Indeed, asymptotically we see that unless the states generated satisfy an area law such hidden neurons will likely be harmful.
One can prevent these entanglement induced barren plateaus by violating any of the assumptions in our proofs.The first is to choose an atypical initial state which has been already explored in [32].Next, one could try to depart from the use of gradient-based optimization to train such quantum models.However, it is unlikely that without knowledge of the global properties of the training objective function that such methods would succeed in light of Proposition 2. Lastly, one can train models using an objective function that does not correspond to an observable and is independent of the density operator.
Of the three approaches, it is this last approach that we advocate greater attention be paid to in quantum machine learning.One tactic that can be used to circumvent our pessimistic results is to begin a discriminative learning task by first training generatively according to a quantity such as the quantum relative entropy [8,19] which is nonlinear in the quantum state ρ.We will show in subsequent work that this quantum generative pre-training approach can be used to successfully train both Boltzmann machines and unitary networks and thereby mitigate some of the challenges identified here for training deep QNNs.
As a final point, it is important to recognize that while entanglement is a powerful tool to add to our models, it must be used like a scalpel and not a sledgehammer.Quantum properties such as entanglement may be harmful if not surgically deployed and judicially used.Understanding the role that such quantum effects have on a model is very likely necessary [33] if we are to build quantum models that can successfully leverage quantum effects.for each Proof.First using the definition that we then wish to analyze the distribution over θ of |Tr v (O obj Tr h (ρ(θ) − ρ(θ + δ k )))| , under the assumption that the unitaries satisfy a 2-design condition.
Using Hadamard's Lemma we can express the distance between the difference between the expectation values is Next, making the assumption that H k (θ) and ρ(θ) are uncorrelated we can further simplify this result.Our exposition will now follow that of Popescu, Short and Winter [20]; which we modify to deal with to show that a concentration of measure exists for the commutators of H k (θ) and ρ(θ).
We will now work under the assumption that the expectation values are independent.We further denote the expectation value over the Hamiltonian as E H and the expectation value over the state as E φ .If this independence assumption holds then we need to argue about the magnitude of terms of the form E φ (Tr v (Tr h (H k (θ)ρ(θ)) 2 )).We can estimate this by introducing two copies of the quantum state and linking both terms through the use of a flip operator F vv such that In the following, we will use this notation primed indices to refer to the visible and hidden subsystems of the first and second copies respectively.The commutators in general consist of many different products of H k and the state operator.Below we argue about their form in generality.Let us assume that p 1 , p 2 , q 1 , q 2 are positive integers.We then wish to compute the product of traces of of the form Tr h (H k (θ) p1 ρH k (θ) p2 Tr h (H k (θ) q1 ρH k (θ) q2 ).By applying the flip operator and taking the quantum state ρ(θ) to be |φ φ| The next step in this is to recognize that the above tensor products if |φ φ| are a symmetric quantum state.Therefore if we express the state as the sum of its anti-symmetric component and its symmetric component then the anti-symmetric component must be zero [20].We then see from the fact that ρ(θ) is assumed to be drawn from a unitary 2-design that the expectation value is unitarily invariant and we can then follow the arguments laid out in [20] that Next, if we define the flip operator on the dilated space including the hidden and visible units to be F rr = F vv ⊗ F hh then we can express Π sym = 1 2 (I + F rr ).Finally using the properties of the flip operator we find that we can write We therefore have from the triangle inequality that Here the last inequality follows from the fact that the Schatten infinity-norm is unitarily invariant and thus Next let us consider the expectation value for one of the terms in the expansion Every term in Ad q H k (θ) (ρ(θ)) consists of q H k (θ) and further 2 q−1 terms have positive coefficient and 2 q−1 terms have negative coefficient.The proof of this fact is inductive.For q = 1, which demonstrates the base case of q = 1.Now assume that the claim is valid for q = p we then have that The induction step immediately follows from this observation and it is clear that the claim is valid for all q.Now if we expand (Tr h (Ad q H k (θ) (ρ(θ)))) 2 using the linearity of the partial-trace operation we find that each term is of the form Tr where p 1 + p 2 = q = q 1 + q 2 .The expression in (A7) then shows us that we can replace each term with while incurring a small error.Importantly, this value is independent of p 1 , p 2 , q 1 , q 2 .Thus since there are 2 q−1 such terms with negative coefficient and 2 q−1 with positive coefficient for each of the partial traces there are similarly 2 2q−1 terms with negative coefficient in the expansion and 2 2q−1 with positive coefficient.Ergo, the sums over all such terms present in the adjoint is zero up to the small error terms given in (A7).Thus we have that, Next from (A10) we have that where we have used the assumption that . From this our claim about the Lipshitz constant immediately follows from the definition of Lipshitz continuity and from (A2).
From Taylor's theorem, we have that if the Hamiltonian H + sH k has no level crossings on the interval s ∈ [0, θ k ] then to order O(θ 2 k ) the eigenvectors of H + θ k H k can be identified using perturbation theory.In particular, for any p ∈ {0, . . ., D − 1} let |p be an eigenvector of H with eigenvalue E p then the eigenvector |n of H + θ k that corresponds to the eigenvector |n of H can be expressed as This implies that pq is a complex number and |pq := |p v ⊗ |q h for some appropriate basis for the visible and hidden subsystems.The expectation value over the state vectors can then be thought of as an average of these coefficients.
There are many choices that can be made for the eigenbasis that we further exploit the fact that H k := h v ⊗ h h to choose the bases of the visible and hidden subsystems to diagonialze h v and h h .Thus we can state h v ⊗h h |pq := λ pq |pq for λ pq ∈ R.

With these choices in place we can write
There are a total of 8 terms that arise when we expand the above products.Let us consider the first case which emerges in the expectation value of the trace of the previous result.Here we will invoke the fact that the eigenvectors are sampled from a unitary 2-design, which means that any quantity that is at most quadratic in the probability will have an expectation value that coincides with the Haar average.A final point to note, is that as a consequence of unitary invariance and the discussion contained in [31, Appendix A], the expectation value of the product of any two terms is zero unless all of their indices match.Further, up to relative errors that are O(1/D), the expectation values of Now under the assumption that The claim that this bound on the derivative holds with high probability over the ensemble is then a direct consequence of Markov's inequality.a,b , are drawn from N (0, 1).Moreover, the Hamiltonian is normalized by its operator norm.

FIG. 1 :
FIG. 1: Examples of QNNs.(a) A quantum unitary network recreating a deep net structure.Visible units correspond to the last two registers.(b) Quantum Boltzmann machine defined on a graph.Each edge and each vertex correspond to a weight on a local Hamiltonian corresponding to the pair of qubits or a single qubit.The top layer (circles) corresponds to visible units and the bottom layer (rectangles) are hidden units.

Proposition 2 .
Let U ∈ C DvD h ×DvD h be drawn from a unitary 2-design and let H = U † SU for some diagonal matrix S ∈ C D×D .If either ρ = U |0 0| U † (unitary network) or ρ = e −H /Tr(e −H ) (Boltzmann machine) then any bounded operator O obj ∈ C Dv⊗Dv acting on the visible subspace we have that,

FIG. 3 :
FIG. 3: (a) Log-Log plot showing the of trace distance data in relation to the bound.The blue and orange marked values correspond to the estimated maximum peak of the data histograms where D v = 2 1 = 2.The green marked values correspond to the bounds we obtain after substituting in E[T (ρ, I/D)] ≤ 1/2 D v /D h .(b-c) Semi-log plot highlighting the decay in variance of the ∞-norm of the gradient vector over an ensemble of initialized models.The dash blue represent the average over 1000 model instances.The dash green line represent is the best fit obtain from least squares.(a) Gradient estimates for the Unitary Model.(b) Gradient estimates for the normalized Quantum Boltzmann Machine.

FIG. 4 :
FIG. 4: Computed the trace distance between the reduce density of our models and the maximally mixed state for 1000 instances.The models considered have only one visible units i.e.D v = 2 1 = 2. (a) Empirical trace distance distribution of a real-time evolution (t = 10) of Hamiltonians drawn from the Gaussian Unitary Ensemble (GUE).(b) Empirical trace distance distribution of the unitary model.All coefficients are drawn from a uniform distribution over [0, 1).(c) Empirical trace distance distribution of the quantum Boltzmann machine.The on-set coefficients, J i a , are drawn N (0, 0.01).The off-set coefficients, J i,ja,b , are drawn from N (0, 1).Moreover, the Hamiltonian is normalized by its operator norm.