On the geometry of learning neural quantum states

Combining insights from machine learning and quantum Monte Carlo, the stochastic reconfiguration method with neural network Ansatz states is a promising new direction for high precision ground state estimation of quantum many body problems. At present, the method is heuristic, lacking a proper theoretical foundation. We initiate a thorough analysis of the learning landscape, and show that it reveals universal behavior reflecting a combination of the underlying physics and of the learning dynamics. In particular, the spectrum of the quantum Fisher matrix of complex restricted Boltzmann machine states can dramatically change across a phase transition. In contrast to the spectral properties of the quantum Fisher matrix, the actual weights of the network at convergence do not reveal much information about the system or the dynamics. Furthermore, we identify a new measure of correlation in the state by analyzing entanglement the eigenvectors. We show that, generically, the learning landscape modes with least entanglement have largest eigenvalue, suggesting that correlations are encoded in large flat valleys of the learning landscape, favoring stable representations of the ground state.


I. INTRODUCTION
Recently, the fields of machine learning and quantum information science have seen a lot of crossbreeding. On the one hand, a number of promising results have been obtained suggesting the potential for performing quantum or classical machine learning tasks on a quantum computer 1 . In particular, the variational quantum eigensolver 2 -perhaps the most promising quantum algorithms for first generation quantum computers -is based on the variational optimization of a cost function to be evaluated on a quantum device, providing a new playground for hybrid quantum-classical learning 3,4 . However, arguably the most significant advances have been in the field of classical variational algorithms for quantum many body systems. A number of studies have shown that machine learning inspired sampling algorithms can reach state of the art precision; including ground state energy estimation 5,6 , time evolution 5,7 , identifying phase transitions [8][9][10] , and decoding quantum error correcting codes 11,12 .
A model that has gathered a particularly large amount of attention is the complex restricted Boltzmann machine (RBM) state Ansatz with stochastic reconfiguration optimization introduced by Carleo and Troyer 5 . The authors show that ground state energy evaluations can match state of the art tensor network methods on benchmark problems.
At present, however, there lacks a theoretical underpinning for explaining why the complex RBM wave-functions -or any other machine learning inspired parametrization -is a good Ansatz for describing ground states of physical Hamiltonians. In particular, it is difficult to assess and quantify the role of entanglement in these new classes of wave-functions. This is sometimes referred to as the 'black box' problem with machine learning inspired approaches, that success is based on loose heuristic arguments, rather than a proper theoretical underpinning. Such a situation might be sufficient for real world commercial applications of machine learning but is unsatisfactory when it comes to describing physical models, where we specifically hope to gain insight about some underlying or emergent physical principles.
Some studies try to relate complex RBM states to Tensor Network states 13,14 where the role of entanglement is natu- rally built into the model. But these studies are mostly based on constructing abstract mappings between RBM wavefunctions and tensor network states, and usually provide at best existence proofs.
In this paper, we aim to obtain a better understanding of the learning dynamics with complex RBM wavefunctions by analyzing the geometry induced in parameter space. Indeed, the stochastic reconfiguration method updates the variational parameters of the wavefunction by gradient descent of the energy, weighted by a 'quantum Fisher matrix', which is the quantum analogue of the Fisher information matrix. The Fisher information matrix is know to be the unique Riemannian metric associated to a probability space invariant under sufficient statistics 15 . Hence it is the natural candidate for associating an 'information geometry' to a statistical model.
We analyse the spectral properties of the 'quantum Fisher matrix' for a variety of physics models. We argue that the information geometry provides us with clues of both the expressibility of the Ansatz state and of the underlying physics, provided the optimization converges. In particular, we identify a number of features which we believe to be universal for spin models: (i) The spectrum of the quantum Fisher matrix becomes singular in phases connected to a product state (in the computa-tional basis). The singularity is more pronounced the closer one gets to the product state; (ii) Critical phases have a smooth and extended spectrum, which is also reminiscent of image recognition models in classical machine learning; (iii) Kinks in the spectrum reveal symmetries in the state.
(iv) The eigenvalues are exponentially decaying in value. The largest eigenvalues have eigenvectors that are dominated by first moments; i.e. they do not contain much information about correlations in the system. This feature is accentuated the sharper the spectrum profile of the quantum Fisher matrix.
The above insight was extracted from extensive numerical data calculated using quantum spin Hamiltonians such as transverse field Ising and Heisenberg spin-XXZ models as well as coherent Gibbs states for the two dimensional classical Ising model. Various Monte Carlo sampling strategies were used to optimize the results on large system sizes.
Importantly, we observe that the bare values of the variational parameters reveal very little information about the physical properties of the system, contrary to what is often claimed that 'activations indicate regions of activity in the underlying data'. We take this as evidence that there are many equivalent representations of the states in the vicinity of the ground state, suggesting that the optimizer preferentially choses robust representation of the ground state. Robustness of the Monte Carlo methods might be related to the generalization property in supervised learning. Our methods promise to be an essential diagnostic tool for further exploration with complex RBM wavefunctions as well as with other machine learning inspired wavefunctions.

A. Complex RBM and optimization by stochastic reconfiguration
The complex Restricted Boltzmann Machine (RBM) neural network quantum state specifies the amplitudes of a wavefunction |ψ θ = x ψ θ (x)|x in some chosen computational basis {|x } by the exponential family: where the vectors {a, b} and the matrix w contain complex parameters to be varied in the optimization, and y is a binary vector indexing 'hidden' units. Z = x |ψ θ (x)| 2 is a constant guaranteeing normalization of the state ψ. The complex RBM can be visualized as a binary graph (V, E) between the visible nodes x and the hidden nodes y (see Fig. 1). To each edge e ∈ E we associate a variational parameter w e , and at each vertex v ∈ V we associate a bias weight a or b to a visible (x) or hidden (y) binary degree of freedom. We will often express the variational parameters as a concatenated vector labelled θ = (a, b, vec(w)). For classical RBMs, the normalization constant is the partition function of a joint probability distribution on the hidden and visible units. This is generally not true in the complex case. The goal of variational Monte Carlo is to find the optimal paramters θ that minimize the energy of a given Hamiltonian in the state |ψ θ . The standard approach would be to use gradient descent, but this performs very poorly for spin Hamiltonians, as the updates tend to get stuck oscillating back and forth along steep wells of the energy landscape rather than falling down the more shallow directions. The stochastic reconfiguration (SR) method 16 for energy minimization is derived as a second order iterative approximation to the imaginary time ground state projection method (see Appendix A for a self contained derivation). In SR, the parameters of the Ansatz wavefunction are iteratively updated as where η is a constant specifying the rate of learning. The second order effects which take curvature into account are determined by the matrix of the diagonal operators O α , with α ∈ θ, which act for instance as in the computational basis {x}. We will call the matrix S the quantum Fisher matrix, because of its connection with information geometry as discussed in detail in the next section. The quantum Fisher matrix can be reformulated as a classical covariance matrix of the operators O α , O β , and similarly where E[A] = x A(x)|ψ θ (x)| 2 is the classical expectation of operator A in the state |ψ θ (x)| 2 , and is called the local energy. For the RBM Ansatz, the diagonal operators O α take on the simple form: where χ j (x) = b j + i w ij x i , and indices i run over [1, · · · , N ] visible vertices and j run over [1, · · · , M ] hidden vertices. Thus the size of the quantum Fisher matrix is The SR method is computationally efficient when the following are true: 2. The probability distribution |ψ θ (x)| 2 can be sampled from for any values of θ; meaning that any single Monte Carlo update can be computed efficiently. In practice we require that each Monte Carlo update is independent of system size; i.e. updates are local.
The complex RBM Ansatz guarantees that (1) and (2) hold whenever the number of hidden units is a constant multiple of the visible units. However, like essentially any sampling algorithm, provably guaranteeing (3) seems nearly impossible in any practically relevant problem. However, experience has shown that convergence often is rapid in practice, or can be curtailed, whenever one steers clear of frustration or the Fermionic sign problem. It is worth pointing out, though, that convergence of the sampler can depend sensitively on the chosen basis and the initial state, as evidenced in Sec. III B.

B. Natural gradient and SR
The stochastic reconfiguration method can be understood as a quantum extension of Amari's natural gradient optimization 17 . Plain vanilla gradient descent optimizes a multivariate function L(θ) by updating the parameters in the direction of steepest descent: at a certain rate η. In systems where the landscape of the function L(θ) is very steep in certain directions and shallow in others, convergence can be very slow as the updates fluctuate back and forth in a deep valley, but take a long time to 'drift' down a shallow one. The natural gradient method proposes to update the parameters according to the natural (Riemannian) geometric structure of the information space, so that the landscape is made locally euclidean before the update. Suppose the coordinate space is a curved manifold in the sense that the infinitesimal square length is given by the quadratic form where the matrix g(θ) is the Riemannian metric tensor. Amari showed that the steepest descent direction of the function L(θ) in the Riemannian space is given by The action of the inverse of g can be heuristically understood as 'flattening' out the space locally. For general optimization problems, the Hessian is a natural choice for g(θ), as it reproduces Newton's second order method. In machine learning applications, and with RBMs in particular, the Hessian is hard to construct from sampling. It also appears to be attracted to saddle points 18 .
When the parameter space in question is naturally associated with a classical probability distribution, the 'natural' geometry is chosen to be the Fisher information matrix as it is the unique metric that is invariant under sufficient statistics 15 . For pure parametrized quantum states, the natural Riemannian metric is derived from the Fubini-Study distance: Infinitesimal distances are given by: which reproduces the quantum Fisher matrix for parametrization θ as ds 2 = αβ S αβ dθ * α dθ β . In particular, when the wavefunction is positive in a given computational basis, the quantum state can be written as |ψ = x p θ (x)|x , and the quantum Fisher matrix is and F is the Fisher information matrix associated to the probability distribution p θ (x). Thus, the SR method reproduces the natural gradient method for positive wave functions. For this reason, we will be calling the S matrix associated to a pure quantum state the quantum Fisher matrix.

C. Spectral analysis of the quantum Fisher matrix
In this paper, we will argue that spectral properties of the quantum Fisher matrix reveal essential information about the physical properties of the system under study as well as the dynamics of optimization.
The quantum Fisher matrix is positive semi-definite, implying that its spectrum is real and there exists a set of orthonormal eigenvectors. The magnitude of an eigenvalue determines how steep the learning landscape is in that particular direction. The spectrum will generically be sloppy 19 , with a spectral function bounded above by a decaying exponential.
It is often argued in the machine learning community that gradient descent algorithms favor regions in parameters space where most eigenvalues are close to zero 20,21 . This implies that at convergence, most directions in the landscape are nearly flat, suggesting that nearby points in parameter space encode much of the same physical properties. In classical supervised learning, the flatness of the landscape has been associated with the 'generalization' ability of the learned model 22 ; in the physics setting we interpret it to mean that the representation is robust.
Because of the bipartite graph structure of the RBM Ansatz, it is natural to talk about correlations between the visible and hidden units. The quantum Fisher matrix is a square (N + M + N M ) matrix, with the first two blocks corresponding to the biases a, b, and the third block corresponds to the weights matrix w. The main w block describes the orientations in parameter space that can affect correlations in the model. We will see later that eigenvectors associated to eigenvalues of large magnitude are typically close to a product state between the visible and hidden part, meaning that they mostly just affect the first moments of the spin variables.
To measure correlations in the eigenvectors {ψ α }, we truncate the first two blocks of the eigenvectors associated with the biases, and renormalize the 'w' part to have Hilbert Schmidt norm 1. We then calculate the entanglement in the eigenstate ψ w α : where Tr h is the partial trace over the hidden layer, and S(·) is the von Neumann entropy of the reduced density matrix.

II. RESULTS
In this section, we analyze the spectral properties of the quantum Fisher matrix during the learning process of finding the ground state of the transverse field Ising (TFI) model. The TFI Hamiltonian is given by where σ σ σ i = {σ i x , σ i y , σ i z } are Pauli spin operators, and h is the external field. The system has Z 2 symmetry (σ i z → −σ i z ) which is explicitly broken for h < 1 in the thermodynamic limit (N → ∞). A second order phase transition occurs at h = 1. At zero external field the model has two degenerate ground states |0 ⊗N and |1 ⊗N , whereas in the limit of h → ∞ the ground state is unique, given by |+ ⊗N .
The spectral properties of the quantum Fisher matrix, as well as the energy during the learning process are plotted in Fig. 2. Figure 2(a) confirms that the optimization procedure successfully finds the ground state for all values of h, albeit at different speeds. The quantum Fisher matrix is constructed approximately by Monte-Carlo sampling and its full spectrum is evaluated every 5 epochs during learning. The eigenvalues at some representative epochs are plotted in decreasing order in Fig. 2 The dynamics of the learning process proceeds in two distinct stages. The first stage is observed at the very beginning of the learning, lasting for roughly 25 epochs 23 , and is the same for all values of h. The initial shape of the spectrum has two sharp drops located at N and N (N + 1)/2 (see Fig. 2(c)). This is a consequence of the random initialization with small weights. An analytic justification of this behavior is provided in Appendix B. The spectrum then gets pushed up until approximately the 25'th epoch, revealing that more and more dimensions in the information space become relevant.
The second stage of learning then slowly transforms the distribution to that of the final converged state. We observe that the spectrum falls off very sharply (exponentially) in all cases examined ( Fig. 2(b)), but the exact spectral profile depends strongly on the details of the model, yet not on the system size or on the specific values of the learned weights (see Appendix C for an in depth discussion). We take this as evidence that the learned state not only minimizes the energy, but also closely matches the actual ground state of the model. The behavior of the spectrum of the quantum Fisher matrix for each phase of the TFI model is discussed in the next subsection.
A. Phases of the TFI model a. The ferromagnetic phase (h < 1.0). Let us start by considering the extreme case with h = 0.0. The quantum Fisher matrix after convergence becomes a pure state up to numerical precision. The singularity of the quantum Fisher matrix in this case can be explained from the properties of the ground state: When h = 0.0, the Hamiltonian Eq. (19) has two ground states |0 ⊗N and |1 ⊗N . We first note that the optimization consistently found a solution with a ≈ 0 and b ≈ 0, leading to a Z 2 symmetric state. Let us therefore assume that the solution we have exactly describes the Z 2 symmetric ground state; i.e. a = b = 0. Then the ground state is Thus, the quantum Fisher matrix is which is rank 1. We note that the above argument does not depend on the details of the weights w, rather only on its magnitude |w|, so that any set of RBM weights that accurately model the ground state will exhibit the same behavior. The SR optimization typically favors small weights. As the external field h increases, the number of terms of the ground state in the computation basis increases, thus we also expect that rank of S to increase as . This is consistent with the results from our numerical data in Fig. 2(b). Importantly, rank deficiency is observed throughout the ferromagnetic phase, albeit  much more pronounced in the vicinity of h = 0. We interpret this behavior as a signature that the phase is connected to a product state in the physical basis. For values of h close to one, the rank deficiency can only be seen for at large system sizes, and after many training epochs. b. The critical point (h = 1.0) At the critical point, the distribution of eigenvalues after convergence is smooth, and decreasing exponentially. This behavior is also seen in many classical image processing tasks in machine learning 21,24 , suggesting that it might be signature of (critical) long range order. Indeed, each element of the quantum Fisher matrix can be expanded in terms of correlation functions, all of which are sizeable in the critical case. This eigenvalue distribution is characteristic of 'Sloppy model universality', which has been shown to reflect systems with certain forms of scale invariance 19 , further corroborating the claim. We will see in section III A that this behavior is seen in many other systems and reveals that the RBM is fine tuning a solution with the help of a large number of hidden units.
c. The paramagnetic phase (h > 1.0) In this case, we see that the energy converges rapidly and the eigenvalues almost do not change after the initial learning stage. In particular the second jump in the spectrum of the initial random RBM survives until the end. When h = 2.0, the jump is located at N + N (N − 1)/2 = 406, revealing that the quantum Fisher matrix has no support on the anti-symmetric subspace (see Appendix B). Precisely, the 406th eigenvalue has magnitude ≈ 4.08×10 −2 and the next one has magnitude ≈ 1.38×10 −3 in our numerical data.
To understand the stepwise behavior, we first focus on the randomly initialized RBM case; i.e. at epoch 0. As we initialize the parameters of the RBM with small random Gaussian values (sampled from N (0, σ 2 ) where σ ∼ 10 −2 ), the classical probability distribution |ψ θ (x)| 2 would be similar to the case when all parameters are zero. When a = b = w = 0, the RBM gives |ψ θ (x)| 2 = 1/2 N , i.e. the identity distribution. We can then perturbatively expand the quantum Fisher matrix in terms of the parameters. The derivation up to O(σ 3 ) is given in Appendix B. Our derivation gives N eigenvalues of O(1) associated with the visible biases block of the matrix and N (N − 1)/2 eigenvalues of order O(σ 2 ) in the weights block of the quantum Fisher matrix. This explains the first and the second jumps in the eigenvalue distribution of the random RBM.
The randomly initialized RBM also hints at the fact that the quantum Fisher matrix throughout the paramagnetic phase strongly retains properties of the h 1 limit with product state |+ N . We can compare the spectra of the quantum Fisher matrix for h = 2.0 and the randomly initialized case in Fig. 2(c). It shows that the second step is preserved but the first step disappears. This is because the first step depends on the details of weights but the second one is the consequence of the symmetry. We made detailed comparison between the quantum Fisher matrix for the paramagnetic phase and randomly initialized RBM in Appendix C. We there show that the converged matrix has larger diagonal elements in the w part of the matrix than the random RBM case which also support eigenvalues between N to N (N + 1)/2.
Throughout the phase diagram of the TFI, the spectrum of the quantum Fisher matrix at convergence has two special points at N and at N (N + 1)/2, as seen in Fig. 2(c). The location of these points is independent of the number of hidden units, suggesting that they originate from the Z 2 nature of the physical system, and the overall bipartite structure of the RBM, rather than any details of the RBM graph.

B. Eigenvectors
Above, we have argued the eigenvalues of the quantum Fisher matrix reveal signatures of the phase of matter being simulated. We now ask whether the eigenvectors can teach us anything about how correlations are conveyed in the learning landscape. In particular, since the complex RBM is constructed from a bipartite graph with no connections among the hidden and visible units, we know that all correlations have to be mediated by weights. Entanglement in the information manifold is therefore completely contained in the weights block of the Fisher matrix.
In Fig. 2(d), we plot the entanglement between the visible and hidden units of the w part of each eigenvector (see Eqn. (18)). We observe that the first N eigenvectors have very little entanglement when 0 ≤ h ≤ 1. This suggests that the directions of largest curvature are almost exclusively associated with the biases, or first moments, of the distribution. Note that this does not imply that the values of the w weights are small, as representations of the first moments are distributed over the biases and the weights. Rather it is a reminder that the actual values of the weights of the network reveal little information of the correlations in the system, as is manifest in Fig. 6 of Appendix C. This behavior is less pronounced for h > 1 as the quantum Fisher matrix behaves more like a random matrix whose eigenvectors are expected to have a more homogenous amounts of entanglement.
The entanglement increases in the bulk of the spectrum. Interestingly, this means that the directions in parameter space that encode information about correlations are typically dense, smooth and flat. In the context of classical ML, these properties are akin to good generalisation ability of the learning models, whereas in the present physics context, we interpret it to meant that the algorithm preferentially learns stable configurations; where changes (even large) in most directions in configuration space will not affect the physically observable properties of the system. Similar conclusions have been alluded to in the context of sloppy models universality in statistical mechanics 25 .

C. Numerics
For numerical simulation, we set the ratio between the numbers of hidden units and visible units of the complex RBM to α = m/n = 3. Thus the RBM has (α + 1)N + αN 2 parameters overall (N and αN for biases and αN 2 for the weight matrix w). To sample from the RBM, Markov chain Monte-Carlo (MCMC) method enhanced with parallel tempering was employed. 26 We used 16 parallel Markov chains with linearly divided temperatures from 1/16 to 1. For each Markov chain, we used local spin flip updates. To directly compare the results from variational Monte-Carlo with exact digitalization, we use the size of system N = 28 and impose the periodic boundary condition throughout the paper unless otherwise stated. In practice, SR has two hyper-parameters: the learning rate (η in Eq. (2)) and the regularization that should be added to the diagonal of the Fisher information matrix for numerical stability in when computing the inverse. For the simulation of TFI, we have used the hyper-parameters η = 0.01 and = 0.001.

D. Predictions
From the spectral analysis of the quantum Fisher matrix for the transverse field Ising model, we make the following predictions, which we expect to hold more generally for ferromagnetic quantum spin models: 1. The spectral profile is universal within a phase of the model, and is only weakly dependent on system size away from phase transition points. The spectrum of the quantum Fisher matrix is therefore a good indicator of the existence of a phase transition if it is possible to find two points in phase space with vastly different spectral profiles.
2. The first N eigenvectors are close to product states, and hence do not encode correlations in the system. They mostly pertain to first moments of the distribution.
3. A rank deficient quantum Fisher matrix is evidence that the state is in a phase connected to a product state in the chosen computational basis. A smoothly decaying spectrum is a sign that the system contains a lot of correlation; often a critical phase with polynomial decaying correlation functions.
4. Kinks in the spectrum reveal symmetries in the model. In the case of the TFI, the persistent kink at N (N + 1)/2 is a sign that the symmetric and anti-symmetric subspaces are strictly separated everywhere except at the critical point.

III. FURTHER EXPERIMENTS
In this section, we study two further models to test whether the predictions made in Sec. II D extend to more general spin systems. The first model is the two dimensional coherent Gibbs state, whose quantum Fisher matrix is evaluated exactly without having recourse to learning. The second is the XXZ model, where we explore all three phases with the tools developed above.
A. Coherent Gibbs state of the two dimensional classical Ising model We consider the RBM representation of the coherent Gibbs state of the two dimensional classical Ising model. Recall the classical Ising model where x is the configuration of the spin and i, j are nearest neighbors on a two dimensional lattice. For convenience, we set J = 1. We consider a system in thermal equilibrium with inverse temperature β = 1/T . At high temperature β < β c , the system exhibits a disordered paramagnetic phase characterized by zero magnetization x = 0 , whereas it shows a Z 2 symmetry broken ferromagnetic phase with non-zero magnetization at sufficiently low temperature β > β c 27 . The phase transition takes place at β = β c ≈ 0.44 in the thermodynamic limit and is second-order. We thus have polynomial decay of the correlation function x i x j c ∼ 1/dist(i, j) α at the critical point.
The coherent Gibbs state for the model with inverse temperature β is given by in a chosen computational basis {x} and Z = {x} e −βH(x) is the normalization factor which is the same as the partition function of the classical model. A key observation is that correlation functions of spin-z operators are exactly the same as that of the classical model, i.e. ϕ(β)|σ i z σ j z |ϕ(β) = x i x j x∼p(x) where p(x) = e −βH(x) /Z is the Boltzmann distribution. Thus we also have polynomially decaying quantum correlation functions for this state at β = β c .
It is known that coherent Gibbs states of Ising type models can be represented exactly as an RBM 28 by associating each edge of the lattice to one hidden unit (we provide a selfcontained derivation in Appendix D). In particular, the coherent Gibbs state of an Ising-type model defined on a graph G = (V, E) can be described using the RBM with parameters a = b = 0 and a |V | by |E| sparse weight matrix w.
Using this mapping, we construct the quantum Fisher matrix of the RBM representation for coherent Gibbs states . To sample from the distribution, we have employed the Wolff algorithm 29 instead of usual local update scheme in this case as it is more efficient close to the transition point. The spectral profiles of the quantum Fisher matrix for different values of β are shown in Fig. 3a). The figure shows very similar shape to that of the TFI case when they are deep in the ferromagnetic or paramagnetic phase. The eigenvalues exhibit a collapsing distribution in the ferromagnetic phase for large β and get progressively more singular as we increase β. Compare this behavior to the TFI for h < h c depicted in Fig. 2. In the paramagnetic phase (β < β c ), we see a stepwise distribution where the step is exactly located at N (N + 1)/2, very much like the TFI model at large h. Thus for coherent Gibbs states that are deep in each phase, we get the same qualitative behavior of the quantum Fisher matrix in both models.
In contrast to the learned TFI case in Section II, the drop-off at N (N + 1)/2 survives also at criticality. This can be understood by the fact that the quantum Fisher matrix is constructed from the exact coherent Gibbs state which is exactly symmetric in the exchange of spins. Hence the quantum Fisher matrix has zero support on the anti-symmetric subspace also at criticality. In Fig. 3c), we have plotted the quantum Fisher information which is simply the trace of the quantum Fisher matrix for different values of β. We see that the quantum Fisher information reaches a maximum in the vicinity of the phase transition point, hence acting as an order parameter reminiscent of the magnetic susceptibility. A more detailed analysis of the quantum Fisher information as a witness of phase transitions for this and other models will be presented elsewhere.

B. The XXZ model
We now consider the Heisenberg XXZ model This model is exactly solvable using Bethe Ansatz. The solution shows three distinct phases: (1) a gapped ferromagnetic phase for ∆ ≤ −1.0, (2) a critical phase for −1.0 < ∆ ≤ 1.0, and (3) a gapped anti-ferromagnetic phase for ∆ > 1.0. The ground state when ∆ ≤ 1.0 is a superposition between |0 ⊗N and |1 ⊗N . It is also known that the ground state is in J z := i σ i z = 0 subspace for ∆ > −1.0. In the critical phase (−1.0 < ∆ ≤ 1.0), the Hamiltonian is gappless in the thermodynamic limit and the correlation length diverges. The phase transition at ∆ = −1.0 is first order and an infinite order Kosterlitz-Thouless transition takes place at ∆ = 1.0.
We will again look at the spectral properties of the Fisher information matrix in this model for ∆ = −1.0, 0.0, and 1.0. For ∆ = 0.0 and 1.0, we have restricted the wave function to the U (1) symmetric subspace J z = 0 by applying the swap update rule in MCMC. Fig. 4(a) shows the convergence of sampled energy over SR iterations. We see that SR successfully finds the ground states in all cases but the initial drift starts later in the XXX case (∆ = 1.0). The spectrum of the quantum Fisher matrix shown in Fig. 4(b) also verifies slow initial learning in the XXX case. The spectrum begins to change slowly compared to other cases. We suspect that the SU(2) symmetry of the Hamiltonian is related to slow learning in the initial stage. When we compare this to the result from other values of ∆, the quantum Fisher matrix does not differ much as it only depends on the parameters of the RBM but the gradient of the energy ∇ θ H is much smaller when ∆ = 1.0 than other cases.
We plot the converged spectra in Fig. 4(c). Using this, we can extract some information of the converged ground state when ∆ = −1.0. As the first order phase transition occurs at this point, the system has two different types of ground states: one that is a superposition of |0 ⊗N and |1 ⊗N from ∆ ≤ −1.0 and the other one living in a subspace J z = 0 from ∆ > −1.0. As the converged spectrum is singular, we can expect that the ground state found in our simulation is ferromagnetic. We indeed have calculated J 2 z from Monte-Carlo samples and it gives J 2 z /N 2 ≈ 0.984 which means a large portion of the state is in |0 ⊗N and |1 ⊗N . When ∆ = 0.0 and 1.0, we see broader converged spectra. We note that there is a small step at ∼ N (N + 1)/2 when ∆ = 0.0 even though the whole spectrum is dense. In comparison, more smooth spectrum is obtained when ∆ = 1.0.
One should also ask about the behavior of quantum Fisher matrix in the anti-ferromagnetic phase. However, we found that usual MCMC does not produce unbiased samples in the anti-ferromagnetic phase, so usual SR does not converge to the real ground state 30 . As a consequence, we checked performed the optimization using the exactly constructed quantum Fisher matrix for small enough systems from the probability distribution |ψ θ (x)| 2 . The result obtained from the exact simulation for the system size N = 20 is shown in Appendix E. One observation is that we see a dense converged spectrum when ∆ = 2.0 despite the system being gapped. Thus the gap of the system alone does not implies a dense spectrum of the quantum Fisher matrix.

IV. IMPLICATION TO OPTIMIZATION
In this section, we use the insight gained about the structure of the quantum Fisher matrix to construct a new optimization method for quantum spin systems. The new method allows for significant savings in evaluation time for solving the inverse linear problem in stochastic reconfiguration. Precisely, in each step of SR, we need to solve the linear equation for a given quantum Fisher matrix S. Even when the matrix S is well-conditioned, the complexity of solving this equation scales as O(D 2 ) where D is the dimension of the S matrix, or number of parameters. As D itself scales like O(αN 2 ), the time cost is quartic in N . This is one of the main reasons why second order methods, including natural gradient descent, are not widely used in classical large scale deep learning applications.
Our new optimization method can be seen as an extension of RMSProp 31 . The method provides a significant advantage in computation time as it does not involve solving a large system of linear equations. However, the method is not always a good approximation of the natural gradient, but rather depends decisively on the structure of the quantum Fisher matrix.
Before describing our method, we briefly review RMSProp for classical machine learning and how it is related to the Fisher information metric from the viewpoint of Ref. 32 . For convenience, the original RMSProp is described in Appendix F. This algorithm improves a naive stochastic gradient descent by using v t , the running average of the squared gradients, to rescale the instantaneous gradient for updating weights. An observation in Ref. 32 is that v t is a diagonal approximation of the uncentered covariance matrix of gradients when the learning is in the steady state. When the function we want to optimize f is the logarithmic likelihood (which is typical in classical machine learning), v t recovers the diagonal part of the Fisher information metric at stationarity. The additional square root and prefactor in the last step are added to correct for "poor conditioning" 33 . This provides a plausible argument for why such a simple algorithm works incredibly well. One can also argue that other popular and efficient optimizers such as Adagard, Adadelta and Adam similarly use a type of diagonal approximation of the Fisher information metric 32 .
We now describe our variant of RMSProp applied to the ground state optimization problem. Using the same principle as above, one may use O to estimate the diagonal part of the uncentered quantum Fisher matrixS α,α = O † α O α . The details of the algorithm are outlined in Alg. 1. A distinguishing property of this algorithm to the original RMSProp is that it uses different vectors for a gradient decent direction and estimating the curvature: v t is calculated by O but the gradient of the energy is used for update in the last step. The algorithm suggested here is also different from the method used in Refs. 34,35 that put energy gradient directly to the classical optimizers.
Require: η: Learning rate Require: β: Exponential decay rate Require: θ0: Initial parameter vector 1: t ← 0 (Initialize timestep) 2: v0 ← 0 (Initialize 2nd moment vector) 3: while θt is not converged do 4: gt ← Gradient of the energy 6: We have tested the proposed version of RMSProp using different learning rates η for the TFI. The results for the ferromagnetic phase and the critical case (h = 0.0 to 1.0) are shown in Fig. 5. For small h, we see that RMSProp gets easily stuck in local minima unlike SR. When h = 0.0 and 0.2, the figure shows that the energy converges to that of the ground state for some learning rate η. However, such a convergence is probabilistic. For h = 0.0, 0.2 and 0.4, we ran the same simulation several times and found that, for any η, some instances converge to the ground state whereas others get stuck in lo- For larger h such as h = 0.6, 0.8, the proposed RMSProp shows better convergence behaviors for most values of η but it still show stepwise dynamics. In the critical case h = 1.0, the learning curves of RMSProp are smooth and insensitive to the choice of the learning rate, suggesting that the system no longer gets stuck in problematic local minima.
Our results suggest that preserving the singular nature of the quantum Fisher matrix is essential for ensuring convergence to the ground state energy. Indeed, the converged quantum Fisher matrices studied in Appendix C show that the diagonal of the Fisher matrices give rank N + M = 112 for h = 0.0 and full rank (N M + N + M = 2464) for other values of h. In contrast, the real ranks of the quantum Fisher matrices (measured by counting the number of eigenvalues larger than 10 −10 ) are given as 1, 78, 242, 726, 1698, 2464 for h = 0.0, 0.2, 0.4, 0.6, 0.8 and 1.0, respectively.
We still note that even though the rank provides a plausible argument for the behavior of the learning curves, it does not for the converged energies; the converged energies for h = 0.8 and 1.0 are slightly larger than the ground state energies. Moreover, the convergence behavior in the paramagnetic phase (h > 1.0) is more complicated and cannot be solely explained from the quantum Fisher matrix. A partial reason is that the path taken by RMSProp deviates from that of the SR in initial stage of learning (see Appendix F). Detailed investigations in this regime remain for future work.

V. CONCLUSION
We have initiated a detailed study of the quantum information geometry of learning ground states of spin chains in the artificial neural neural network framework. We have focused on complex restricted Boltzmann states and the stochastic reconfiguration method which implements a quantum version of Amaris natural gradient update scheme. Our main result is that the eigenvalues and eigenvectors of the quantum Fisher matrix reflect both the learning dynamics, which is unsurprising, as well as the intrinsic static phase information of the model under study -which is rather surprising. In particular, we found that in the entire non-critical ferromagnetic phase of a number of models, the spectrum of the quantum Fisher matrix has reduced rank. The matrix becomes highly singular in regions of the phase that are close to product states. In critical phases, the spectrum becomes smooth with more and more eigenvectors contributing to the information geometry landscape.
We have identified a universal behavior of the leading eigenvectors of the quantum Fisher matrix: they all convey little entanglement, as measured by the entanglement entropy between the visible and hidden layers. This, in combination with the insight that critical models have smooth spectra, suggests that correlations in complex RBM Ansatz are preferentially represented in the bulk of the information geometry space. Our interpretation of this key dynamical feature of RBM learning is that the model preferentially chooses stable representations, where the entropy of the landscape dominates over the energy. A similar phenomenon is classical supervised machine learning is frequently observed in discussion of 'generalization'. Finally, we explored strategies for diagonal approximations of the quantum Fisher matrix, and found that their success crutially depends on the phase of the model under study. We therefore do not expect any diagonal approximation of the quantum Fisher matrix to be effective in general.

VI. ACKNOWLEDGEMENTS
We thank S. Trebst and D. Gross for helpful discussions. We acknowledge support from the DFG (CRC TR 183), and the ML4Q excellence cluster. Source codes for the current manuscript can be found in CYP's github repository 36 . The numerical simulations were performed on the CHEOPS cluster at RRZK Cologne.
For the readers convenience, we derive the stochastic reconfiguration method of Sorella 16 . The main idea of Stochastic Reconfiguration (SR) is to modify the parameters of a trial wavefunction in such a way that it approaches the ground state along a path dictated by the projection 1− H, where is chosen such that 1 − H ≥ 0.
Let |ψ θ be a state in our ansatz class, with θ its vector of parameters. From now on, we will suppress the parameters θ. Then, for sufficiently small , we can write where |ψ α = ∂ ∂θα |ψ , {e α } are coefficients, and |ψ ⊥ is a state in the orthogonal subspace. Note the identity |ψ α = O α |ψ , where the operators O α are defined as: were |x is the computational basis.
We can now obtain a system of linear equations for the e α coefficients by multiplying Eqn. (A1) by ψ| and by ψ α | to get The averages are taken in the states |ψ . We can then solve for e 0 to get where the matrix S is given by and the vector R α is given by We can now identify the coefficients e α as the update coefficients for the variables θ α , up to an overall constant e 0 , which can be interpreted as the learning rate. The SR update scheme can then be summarized as: for some learning rate η. Here, is regularization constant that is typically ∼ 10 −3 .

Appendix B: quantum Fisher matrix of random RBM
We provide an explanation of the stepwise structure of the spectrum of the quantum Fisher matrix upon small random initialisation of the weights. The quantum Fisher matrix is broken up into three main sectors: [a, b, w], corresponding to the visible biases, the hidden biases and the weights.
As in the main text, we use N = |a| and M = |b| to indicate the number of visible and hidden units, respectively. In our simulations, the weights are initialized to be Gaussian distributed with an average magnitude of order σ = 10 −2 . We therefore make the following assumption about the initial state: the classical probability distribution associated with the initial quantum state is close to the identity, and in particular is separable. This implies that each spin has zero expectation value at initialization x j = 0 for all j, and that x j x k ∝ δ jk for all jk.
As the entries of the visible biases block are: we get the identity matrix for the a part. The covariance between the visible and hidden units involves the term x i tanh(χ j (x)) . Recall that the argument of the hyperbolic tangents are where b j are the hidden biases and w ij are the weights connecting the hidden and visible units. Under the assumption that all parameters are small, we approximate tanh(χ j (x)) ≈ χ j (x). Then Likewise, we can obtain the full unary part ([a, b]) of the S matrix as We can easily see this is rank N as the first N row generates the remaining rows. This explains the first N eigenvalues which are O(1). Next, the w part of the quantum Fisher matrix is given by where i, i label the visible units and j, j label the hidden units. Using the expansion we have where w is the N × M matrix of weights, |b = j b j |j is a vector form of the bias b, and X = ijkl x ikjl |ik jl| with Using the assumption of small initial weights, we have Then the X matrix is approximately where V = jk |jk kj| is the swap operator. The rank of X is given by N (N − 1)/2. Moreover, X is the projector that preserves the symmetric states except the copied state, i.e. X(|ab + |ba ) ∝ |ab + |ba when a = b but X|aa = 0. When b = 0, the whole covariance matrix is given by S = S un ⊕ S w and the matrix S w (Eq. (B8)) has rank N (N − 1)/2. This explains the small sub-leading eigenvalues of order O(σ 2 ).
However, the block-diagonal assumption breaks down when we have non-zero bias in the hidden layer (b = 0) as we have off-diagonal blocks between the unary and w part.
An additional 1 ⊗ |b b| also enters into S w . Still, it is not difficult to see that this does not change the overall rank. A precise calculation gives up to third order corrections. It is simple to see that first N rows still generate the next M rows. Moreover, applying |b to the first N rows gives the additional terms in the last N M rows so the rank of the S matrix from the w part also does not change. Thus we have exactly the same rank even when we turn on hidden biases b.
Appendix C: Further properties of the quantum Fisher matrix In this section, we investigate further properties of the quantum Fisher matrix. We use the same numerical data as in the main text; the TFI with system size N = 28.

Converged weights
Converged parameters of neural networks are often claimed to reveal features of the data or system under study 5,37 . We compare the converged weights and the quantum Fisher matrix for different values of h in Fig. 6. We find that, in contrast with the spectral information of the quantum Fisher matrix, it is difficult to infer any information from the converged weights of the network. For example, converged weights for h = 0.6, 1.0 and 1.4 are not sensibly different, whereas the quantum Fisher matrices that reveal essential features of the phase of the system. This brings to light one the of the key subtleties of RBM Ansatze, which is the extreme redundancy of representation. Let us illustrate this fact by constructing three completely different solutions of the RBM parameters that (approximately) represent the same quantum state |0 ⊗N + |1 ⊗N . As a first solution, consider the one obtained from our numerical simulation Fig. 6 (a). This solution is fully complex, i.e. real and imaginary parts of the weights are both non-zero. On the other hand, a real solution can be found from the coherent Gibbs states for classical Ising model as discussed in Sec. D. The state is obtained by letting J ij = −1 and β → ∞ for a classical Ising model defined on any graph that does not have an isolated vertex. We note that the parameters obtained using this scheme are real as e −βJi,j ≥ 1 (see Sec. D for details). Finally, it is also possible to represent this state only using pure imaginary parameters. By letting a = 0, b = (iπ/2, · · · , iπ/2), and the weight w as It is clear from these examples that inferring information of quantum states solely from the activation parameters of the RBM is very ambiguous.

Non-zero elements of Fisher information matrix
We investigate the rank of the quantum Fisher matrix more closely. Let us first focus on the ferromagnetic phase (h < 1.0). In the main text, we have shown that the rank of the quantum Fisher matrix increases as h increases. A question we are interested in is how non-zero elements are distributed in unary and w parts of the matrix. To answer this question, we use the quantum Fisher matrix itself after convergence plotted in Fig. 6(b). When h = 0.0, we see that the Fisher information matrix only has non-zero elements in the unary part. In contrast, the w part of the matrix shows non-zero elements (especially in diagonal part) when h = 0.6. To see this clearly, we have counted the number of diagonal elements of the quantum Fisher matrix that are larger than 10 −4 . It shows there are N +M = 112 such diagonal elements when h = 0.0 but N +M +N M = 2464 for all larger h = 0.2, 0.4, 0.6, 0.8. As the rank of the full matrix is small even for larger h, the non-zero elements in the w part in this case implies the eigenvectors with dominant eigenvalues have compelling w part. In addition, this provides an argument why RMSProp that is studied in Sec. IV works badly for small h.
Next, we consider the paramagnetic phase (h > 1.0). In the main text, we have shown that the Fisher information matrix when h = 2.0 shows a step at N (N + 1)/2. The whole shape of the spectrum remains similar for smaller h even though the location of step can be little shifted. Compared to the randomly initialized RBM, we see larger diagonal elements in w part. As Fig. 2 shows that eigenvalues between N th to N (N + 1)/2 are much larger for the converged Fisher information matrix than the random RBM, we expect that w part of the matrix contributes to these eigenvalues. To test this, we have diagonalized only the w part of quantum Fisher matrix when h = 2.0 where we could observe a step at N (N − 1)/2. Thus despite the whole spectrum does not show a clear step at N -th eigenvalue, we may still consider that N eigenvalues are from the unary part and N (N − 1)/2 are from the w part. We also found that all diagonal elements of the quantum Fisher matrix is larger than 10 −2 when h ≥ 1.0 so the diagonal approximation of the quantum Fisher matrix is full rank.

System size dependence of the spectral profile
When we use the same parameter α = M/N and the Hamiltonian, we observe that spectra of the converged Fisher information matrix behaves almost the same for varying N . In Fig. 7, we show the spectra of the converged quantum Fisher matrix for different values of N = [28,32,36,40] using the TFI with different values of h = [0.0, 0.6, 1.0, 1.4, 2.0]. We clearly see that eigenvalue distributions for the same h only vary little with the change of the system size N . Still, it is not easy to make a exact correspondence between the results from different N as the order of the quantum Fisher matrix is given by αN 2 + (α + 1)N which is not monomial. Thus there is no single constant scale factor we can use for rescaling the results. Still, this suggests that the spectrum of the quantum Fisher matrix can be used as a faithful diagnostic tool on small systems to infer qualitative behavior on larger systems. spectrum as compared to Fig. 4 in the main text. We conjecture that this is related to the fact that the ground state found using ER has more component in J z = 0 subspace compared to SR case. Indeed, we have J 2 z /N 2 ≈ 0.963 which is slightly smaller than what is found in the SR case in the main test. Second, the converged quantum Fisher matrix shows a smooth spectrum when ∆ = 2.0 even though the system has a gapped anti-ferromagnetic ground state. It implies that a smooth spectrum of the converged quantum Fisher matrix is not sufficient to infer criticality. gt ← ∇ θ f (θt−1) 6: vt = βvt−1 + (1 − β)gt gt 7: θt = θt−1 − ηgt 1/( √ vt + ) 8: end while We here study RMSProp introduced in Sec. IV for the paramagnetic phase of TFI. The learning curves for 5 different values of h are shown in Fig. 9. We can see that the learning curves are more complex than what we have seen for the ferromagnetic and critical cases. Specifically, we have three distinct observations as follows. First, there is a spike of the rescaled energy that goes up in the initial stage of learning. In addition, the size of the spike grows with h. This means that an initial direction that optimizer selects is different to the optimal direction. Second, the properties of the quantum Fisher matrix after convergence are not helpful to understand the learning. In Appendix C, we have shown that the properties of the quantum Fisher matrix do not change much within the paramagnetic phase. However, it does not seem like there is a common property of the learning curves from different h. Third, the converged energy can be as low as that of the ground state. This is interesting as it indicates the optimizer sometimes finds the proper solution even though the learning dynamic is not good.
From these observations, we suspect that RMSProp takes a different learning pathway than SR in the paramagnetic phase. To understand the applicability and details of the learning dynamics of the algorithm better, more detailed investigations such as tracking the path of optimization are required. We leave such a detailed investigation of this optimizers and the comparison to other optimizers for future work.