Exploring entanglement and optimization within the Hamiltonian Variational Ansatz

Quantum variational algorithms are one of the most promising applications of near-term quantum computers; however, recent studies have demonstrated that unless the variational quantum circuits are configured in a problem-specific manner, optimization of such circuits will most likely fail. In this paper, we focus on a special family of quantum circuits called the Hamiltonian Variational Ansatz (HVA), which takes inspiration from the quantum approximation optimization algorithm and adiabatic quantum computation. Through the study of its entanglement spectrum and energy gradient statistics, we find that HVA exhibits favorable structural properties such as mild or entirely absent barren plateaus and a restricted state space that eases their optimization in comparison to the well-studied"hardware-efficient ansatz."We also numerically observe that the optimization landscape of HVA becomes almost trap free when the ansatz is over-parameterized. We observe a size-dependent"computational phase transition"as the number of layers in the HVA circuit is increased where the optimization crosses over from a hard to an easy region in terms of the quality of the approximations and speed of convergence to a good solution. In contrast with the analogous transitions observed in the learning of random unitaries which occur at a number of layers that grows exponentially with the number of qubits, our Variational Quantum Eigensolver experiments suggest that the threshold to achieve the over-parameterization phenomenon scales at most polynomially in the number of qubits for the transverse field Ising and XXZ models. Lastly, as a demonstration of its entangling power and effectiveness, we show that HVA can find accurate approximations to the ground states of a modified Haldane-Shastry Hamiltonian on a ring, which has long-range interactions and has a power-law entanglement scaling.


I. INTRODUCTION
With the advent of Noisy Intermediate-Scale Quantum (NISQ) computers [1], near-term quantum algorithms such as the Variational Quantum Eigensolvers (VQE), may offer computational capabilities beyond the best current classical computers and algorithms for approximating ground states of quantum many-body systems. A VQE algorithm contains three ingredients: A variational quantum circuit ansatz specified by a set of parameters θ, an energy function given by the expectation value of a local Hamiltonian H composed of local measurements on the variational circuit state and a classical optimizer. A natural first approach is the random quantum circuit ansatz [2][3][4], capable of expressing a wide variety of states. However, this was shown to be ineffective for gradient-based optimization strategies due to the barren plateau phenomenon [5], which causes the optimization of randomly initialized circuits to get stuck on flat ar-eas in the cost landscape where gradients are exponentially small. These observations suggest that an effective ansatz for VQE requires a circuit that is problem specific, such that the optimization landscape of the problem is not hindered by barren plateaus. On the other hand, Ref. [6] suggests, for quantum many-body problems, a novel variational circuit that is now called the Hamiltonian Variational Ansatz (HVA). While there is no rigorous proof that HVA will be an effective ansatz, recent work has demonstrated that HVA is rather effective for several one-and two-dimensional quantum many-body models [7,8]. It is thus an intriguing question to further understand the empirically observed effectiveness of HVA.
For the purpose of understanding the effectiveness of such ansätze, it is useful to note that quantum entanglement provides a window into the capabilities of several families of numerical techniques and algorithms aimed at understanding the properties of quantum many-body states, as well as helps us delineate the boundary between quantum states that can be simulated classically and those which call for quantum simulators and quantum computers for their accurate description. For instance, for a one-dimensional (1D) gapped local Hamiltonian, the entanglement entropy of the ground state obeys an area law, i.e., the entanglement entropy grows proportional to the boundary area of the system instead of the system size [9]. This remarkable result allows us to combat the exponential scaling of the Hilbert space, since this area law provides evidence that the relevant physics of a system only takes place in a restricted part of the full state space. These observations have inspired a variety of variational numerical methods, most notably, Tensor Network approaches such as Matrix Product State (MPS), Multiscale Entanglement Renormalization and Projected Entangled Pair States [10], but also deep learning inspired variational approaches, which have been successful at representing quantum many-body states [11][12][13][14].
In this paper we study various entanglement properties of HVA and present several results on the favorable features of HVA that shed light on the underlying reasons for its effectiveness for solving natural many-body problems. Our findings suggest that HVA is highly expressive but yet structured enough to allow for efficient optimization. Through the study of two prototypical models in condensed matter physics, namely the 1D transverse field Ising (TIFM) and XXZ models, we find that entanglement entropy and entanglement spectrum can shed light onto the initialization and optimization properties of HVA in the context of the VQE algorithm. Whereas HVA provides a restricted and effective state space for the TFIM which yields ground state approximations largely insensitive to the circuit initialization, the 1D XXZ model ansatz requires a careful parameter initialization for its successful optimization. Through the study of the dynamics of entanglement spectrum during the optimization of the XXZ model we find that initializing the HVA near the identity operator enables a restricted and effective subspace during optimization that yields accurate approximations to the ground state with fast convergence. Furthermore, we show evidence that the gradient vanishing problem in HVA, especially if the HVA is initialized near the identity operator, is mild or entirely absent in comparison to the random circuit ansatz, where barren plateaus in the energy landscape cause gradients to decay exponentially with increasing system size. We also explore the over-parameterization phenomena in HVA and observe a "computational phase transition" between an under-parameterized and over-parameterized regime where the optimization landscape of HVA crosses over to a regime with faster convergence and absence of low-quality solutions. Lastly, as a demonstration of the entangling power and effectiveness of HVA, we study a modified Haldane-Shastry (MHS) Hamiltonian which has long-range interactions and a power-law scaling entanglement entropy [15]. We observe that HVA can find approximations to the ground state of the MHS Hamiltonian reaching fidelities > 99% for system sizes N = 4, 8, 12, 16 and circuit depths p = N . Our findings point to important features of HVA that will lead to a deeper understanding of its effectiveness, and point the way to developing more sophisticated ansätze for other many-body problems, as well as more informed optimization strategies.
In section II we introduce the basic concepts of VQE and HVA. In section III, we introduce two paradigmatic quantum many-body models which we use in our study, the Transverse Field Ising Model and the XXZ model, as well as their respective ansätze. We also introduce the necessary entanglement concepts used in this paper. In section IV, our main results are presented, and we conclude in section V. In the supplementary materials, we include the computational details in section A, some extra results on the dynamics of entanglement entropy in section B and some additional numerical results in section C.
II. VARIATIONAL QUANTUM EIGENSOLVER AND HAMILTONIAN VARIATIONAL ANSATZ VQE [16] is a hybrid classical-quantum algorithm for finding eigenstates of a quantum many-body Hamiltonian. According to the variational principle of quantum mechanics, a parameterized wave function |ψ(θ) provides an upper bound on the ground state energy, where H is a k-local lattice Hamiltonian. Hence, we can approximate the ground state by minimizing E(θ) with respect to the parameters θ. In the case of VQE, the wave function |ψ(θ) corresponds to a depth p quantum circuit specified by a unitary matrix U (θ), i.e., |ψ(θ) = U (θ) |0 where a number of m parameters specify the unitary θ ∈ R m . We can estimate the variational energy E(θ) p , where E(θ) p denotes the energy at p-level circuit, by measuring the observables that compose the Hamiltonian of the system over the quantum state U (θ) |0 . We use a classical optimization procedure to find the optimal parameters θ * that minimize the energy. As with other variational methods for approximating the ground state, a key ingredient to the success of the method is finding a good parameterization scheme of the wave function. Ideally, the manifold of states parameterized by the ansatz of choice contains the ground state of interest, and this ground state can be reached using a numerical optimization. The Hamiltonian Variational Ansatz (HVA) [6] is a quantum circuit ansatz inspired by the quantum approximation optimization algorithm (QAOA) [17] and adiabatic computation [18]. Instead of using only two (non-commuting) operators as in QAOA, HVA uses more terms of the Hamiltonian. More specifically, assume where s is a subgraph of the lattice under consideration. We assume that each pair of H s and H s do not commute, i.e., [H s , H s ] = 0. A depth p HVA is given by where |ψ 0 is the ground state of one of terms in equation eq. (2), i.e. H s0 . When ordering the unitaries, we make sure that H s0 is not the first H s acting on |ψ 0 . Note that due to the periodicity of the complex exponent, we can restrict the parameters to [0, 2π], although in the case of certain symmetries, this restriction can be made tighter without losing expressive power [7]. Since these circuits are model-specific, the properties of the circuit can vary per problem. We give some concrete examples of HVA in the next section.

III. METHODS & MODELS
A. Models

Transverse Field Ising-Model
The TFIM is a paradigmatic model for studies of quantum magnetism. The Hamiltonian for the onedimensional chain is given by: with where we assume g > 0 and use periodic boundary conditions σ z N +1 ≡ σ z 1 . Here, σ α i corresponds to a Pauli matrix α = x, y, z acting on a site i, where the Pauli matrices are defined as follows: The Hamiltonian has a Z 2 symmetry so it is invariant under a spin flip operation. For g < 1, the system is in a ferromagnetic phase where the Hamiltonian favors spin alignment along the z direction. For g > 1 the system transitions to a disordered paramagnetic phase. In the limit that g → ∞, the σ x term dominates the Hamiltonian, and the ground state becomes |+ ⊗N . At g = 1 there is a critical point, and the system becomes gapless in the thermodynamic limit. A depth-p HVA circuit for the TFIM corresponds to Hence for a depth p circuit, we have 2p parameters. Figure 1 illustrates the corresponding quantum circuit for N = 4 and p = 1. Note that we choose |ψ 0 in (3) to be the ground state of H x = − N i=1 σ x i , i.e., |ψ 0 = |+ ⊗N . The HVA circuit of (5) is the same as the QAOA ansatz used in [17] for solving the MaxCut problem. By using the Jordan-Wigner transformation, it was shown that the ground state can be represented accurately with a depth p = N/2 circuit for the case that g = 0 [19]. For the case that g = 0, there is only numerical evidence to support this claim [7,20]. In appendix section C we confirm that for the TFIM one can consistently find the ground state for g ∈ {0.5, 0.52, . . . , 1.5} with a depth p = N/2 circuit.

XXZ-model
Another prototypical model for studying quantum magnetism is the XXZ model. For the 1D XXZ model, the Hamiltonian is given by Again we assume periodic boundary conditions. The parameter ∆ controls the spin anisotropy in the model. For ∆ = 1, this model has an SU (2) symmetry and is equivalent to the Heisenberg chain. For ∆ = 1, this symmetry gets reduced to a U (1) × Z 2 symmetry. For 1 < |∆| the system is in the XY quasi-long-range ordered state and becomes gapless in the thermodynamic limit. At |∆| = 1 there is a phase transition to the Néel ordered state. This model can be solved exactly using the Bethe-ansatz for N → ∞ [21].
Inspired by [7], we decompose the 1D chain into even and odd links and separate the Hamiltonian into two for α = x, y, z. Our numerical experiments indicate that separately parameterizing these bonds gives better performance when studying the anisotropic system ∆ = 1. Additionally, we parameterize H xx , H yy and H zz terms with their own respective parameter. The reason for this is that for ∆ = 1 the anisotropy in the model cannot be accounted for by a single parameter. A depth-p HVA circuit for the XXZ model corresponds to Hence for a depth-p circuit, we have 4p parameters. Figure 2 illustrates a quantum circuit for N = 4 and p = 1. We choose the initial state |ψ 0 in (3) to be the ground state of H even , i.e., |ψ 0 = It was shown in [7] that the Heisenberg chain (i.e., ∆ = 1) can be solved accurately with HVA with p = N/2. Note that for the case of ∆ = 1, one can use a single parameter for H xx + H yy + H zz . In appendix section C, we find that for ∆ ∈ {0.5, 0.52, . . . , 1.5} and ∆ = 1, a depth p = N/2 HVA circuit is sufficient to find a close approximation to the ground state. Here, the X gates are given by X = σ x i . Together with a single Hadamard gate and a CNOT on even links, we prepare the Ψ − Bell state. The 2-local qubit rotations are all of the form AA = exp −ix/2 σ a i σ a j , with x = θ, φ, β, γ depending on whether the links are even or odd and ZZ or XX, YY.
In this work, we consider the problem of approximating the ground state at the critical points g = 1 and ∆ = 1 for the TFIM and XXZ model respectively since their particular entanglement scaling properties makes them harder to approximate with classical methods [22], such as density matrix renormalization group (DMRG). Due to the criticality of the aforementioned systems at these order values, the energy spectrum becomes gapless in the thermodynamic limit and hence there is a logarithmic correction of S ∝ log N to the area law of entanglement entropy. A matrix product state with bond dimension D bounds the entanglement of the state to S ≤ 2 log D, so the necessary bond dimension to express the ground state grows polynomially in a DMRG calculation [22].

Performance Metrics
We use the fidelity F between the VQE optimized state |ψ(θ * ) and the true ground state |ψ ground obtained from exact diagonalization: Note that for the models studied in this work, |ψ ground is always non-degenerate. If the fidelity is > 99.9%, we assume that we have successfully found the ground state. When assessing the quality of an optimized HVA circuit, the fidelity is a strong indicator of the success for solving the ground state problem, since the infidelity upper bounds the difference between the ground state and variational expectation value of any observable. Let where c is the operator norm of O [23].

B. Entanglement
In the context of quantum many-body physics, quantum correlations play a central role in our current understanding of the equilibrium and out-of-equilibrium properties of several systems in condensed matter. The source of these correlations is inherently non-local, and can be traced back to the presence of entanglement in the quantum state. In this section we introduce several commonly used entanglement quantities in quantum manybody physics.
In classical systems, one uses entropy to quantify our lack of knowledge of the state of the system due to thermal fluctuations. However, for a quantum system at zero temperature, the entropy of a subsystem has a different origin: entanglement. To quantify it, we use the bipartite entanglement entropy [9], which is defined as the von Neumann entropy of the reduced density matrix ρ A . To obtain this reduced density matrix, we divide the system into two subsystems A and B and trace out subsystem B, where |ψ is a pure state. For example, for an 8-spin model on a ring, a typical bipartition is given in fig. 3. The von Neumann entropy generalizes the concept of Shannon entropy to quantum states, and is given by Since a bipartite quantum state can always be rewritten using the Schmidt decomposition, = δ km , the von Neumann entropy reduces to [24] In recent years, the importance of entanglement in condensed matter physics has been elucidated in several systems through the study of the scaling behaviour of the entanglement entropy, which has enabled the identification and characterization of exotic phases of matter such as topological quantum states [25] and quantum spin liquids [26,27]. Fully characterizing the entanglement properties of a system cannot be done by looking solely at the entanglement entropy [24,28,29]. The so-called entanglement spectrum has a much richer structure, and has been used to study many-body localization [28], observable thermalization [30], and irreversibility in quantum circuits [29]. In addition, the entanglement spectrum has been used to study the properties of variational methods such as the Restricted Boltzmann Machine [15]. The entanglement spectrum is defined as the eigenvalue spectrum of the entanglement Hamiltonian From eq. (10) it follows directly that this Hamiltonian has eigenvalues ξ k . For random quantum states distributed according to the Haar measure, the entanglement spectrum follows the Marchenko-Pastur distribution [31,32]. This distribution describes the asymptotic average density of eigenvalues of Wishart matrices, i.e., matrices of the form XX * where X be m × n random matrices. Finally, the Page entropy [33] describes the average entanglement entropy over randomly drawn pure states in the entire Hilbert space, and is given by where d A and d B are the dimensions of subsystem A and B, respectively.

IV. MAIN RESULTS
A. The ansatz space through the lens of entanglement spectrum The effectiveness of a VQE optimization is determined by two factors. First, one requires an expressive enough ansatz space that contains the ground state. The ansatz space of a specific model H and depth p refers to the set of all possible quantum states that can be reached by applying a depth-p HVA circuit corresponding to H to a fixed initial state |ψ 0 which depends on the model. Secondly, the non-convex cost landscape induced by the variational energy of eq. (1) must be favorable, in the sense that the optimization does not get stuck in local minima and can reliably reach the ground state.
Here, we investigate the properties of the ansatz space by examining the entanglement spectra of HVA quantum states generated with random parameters sampled uniformly in the range [0, π] for the TFIM and [0, 2π] for the XXZ model. For each model, we sample 1e3 sets of parameters and calculate the entanglement spectrum of the resulting state. If the spectrum of the sampled states follows a distribution close to the Marchenko-Pastur (MP) distribution, a random HVA state has entanglement spectrum that resembles that of a Haar random state. On the contrary, a distribution far away from the MP distribution indicates a restricted manifold of states that has a non-random structure. We hypothesize that the shape of the average entanglement spectrum can be directly linked to the performance of the VQE optimization. Figure 4 shows the average entanglement spectrum for a state in the ansatz space of circuits with depths ranging from 1, 2, . . . , N for the N = 16-qubit TFIM and XXZ models. From the insets we see that both ansätze have enough entangling power to express the ground state, even for low depth circuits. For the TFIM with 16 qubits (fig. 4a), we see that for all p the HVA spectrum is further away from the MP distribution, and the HVA space corresponding to the TFIM appears to be a manifold of states with restricted entanglement structure. In contrast, for the XXZ model, we see that the average spectra gets closer to the MP distribution as p increases. This suggests that the HVA space for the XXZ model is not as restricted as for the TFIM. This can be understood directly by looking at the circuit complexity, which for the XXZ model contains more gates and parameters per layer. However, this is necessary because the XXZ model is inherently a much richer model in terms of physics, and it may be necessary for HVA space to accommodate more variety of states.
We now turn to examining the entanglement features of the XXZ model HVA states explored during optimization. For the variational minimization of eq. (1) we use a gradient descent algorithm (see section A for details). Since the cost function is non-convex, the quality of the solution will vary significantly between different starting points in parameter space. We compare the following initialization strategies: 1. A completely random-state initialization, where all parameters are sampled as θ ∼ U(0, 2π).
2. An identity initialization. We set all parameters equal to π, so that our circuit is equal to the identity circuit.
Our approach of starting close to the identity is similar to the block identity initialization strategy discussed in [34], however, we study a simpler version by setting all parameters equal to π. For both parameter initializations, we extract the final layer state from the circuit at multiple times during the optimization and calculate its entanglement spectrum with eq. (12). Not surprisingly, our experiments indicate that a random start is prone to getting stuck in a local minimum, due to our local optimization strategies combined with a non-convex energy landscape. The identity start on the other hand allows one to consistently find a high fidelity state for both systems with a depth p = N/2 circuit (see appendix section C).
To study this finding in more detail, we study the dynamics of the entanglement spectrum for different initialization strategies. In fig. 5a we see that an identity state initialization stays far away from the MP distribution at all times, indicating that we are accessing a highly structured restricted subspace of the full HVA space. Additionally, this initialization reaches a state with > 99.9% fidelity state. On the contrary, the random state initialization in fig. 5b starts close to the MP distribution and then moves to a more structured, local minima with 70% fidelity. We conclude that even though the shape of the entanglement spectrum from fig. 4b indicates a possible large unstructured ansatz space, a local optimization is still capable of finding the ground state if we choose a suitable parameter initialization. We further investigate the qualitative properties of the optimization dynamics in appendix section B. In the next section, we will see that the disadvantage of starting at a bad initial point can be overcome by making the circuit sufficiently deep, a process known as over-parameterization.

B. Over-parameterization in HVA
Over-parameterization is a phenomenon in certain types of non-convex optimization problems. For an over-parameterized model, the optimization landscape becomes dramatically better (e.g., almost trap free or almost-convex) as the number of parameters reaches some threshold. In most cases the rate of convergence also becomes better, sometimes even exponentially faster after passing this threshold. Over-parameterization has been studied extensively in the classical deep neural network literature [35][36][37][38]. For example, in [36] it was shown that under certain mild assumptions, the optimization landscape of a deep neural network is almost-convex in a large neighborhood of with the ground state. Since this initialization strategy starts with the identity circuit, we find the t = 0% state to be a product state, as indicated by the single eigenvalue. (b) The random initialization starts close to the MP distribution and converges to a local minimum with ≈ 70% fidelity. a random starting point. As a consequence the stochastic gradient descent algorithm can almost always find an accurate solution.
Although for VQE algorithms it is clear that we have min E(θ) p+1 ≤ min E(θ) p , it is not clear if this minimum can be found consistently due to the non-convexity of the energy landscape. Hence, a deeper understanding of the energy landscape with increasing depth is required. There is some work on over-parameterization in the context of controllable quantum systems with unconstrained time-varying controls [39][40][41], where the authors show that there are no sub-optimal local minima in the optimization landscape. For the case of a constrained controllable quantum system, a recent work [42] considers the problem of learning d-dimensional Haar random unitaries U (d) by gradient descent using general alternating operator ansatz of the form e −iγpA e −iβpB · · · e −iγ1A e −iβ1B , where A and B are matrices sampled from the Gaussian Unitary Ensemble [43]. The authors show that gradient descent always converges to an accurate solution when the number of parameters is d 2 or greater, and a "computational phase transition" is observed between an underparameterization (< d 2 ) and over-parameterization (> d 2 ) regimes.
Since HVA also has the form of an alternating operator ansatz, and the problem of finding the ground states can also be seen as a constrained quantum control problem, we expect a similar over-parameterization phenomenon in our setting. To investigate this, we randomly sample 20 initial parameters θ (uniformly drawn from the inter-val [0, π] for the TFIM and [0, 2π] for the XXZ model) and perform the optimization for increasing values of p. Here, we set the stopping criterion for the optimization to res = E(θ) p −E ground < 1e−4 and the max mum number of iterations to 3000. Indeed, fig. 6 shows that the overparameterization phenomenon also occurs in HVA for the 12-qubit TFIM and XXZ model. Since it is possible that one random starting point converges faster than the others, we have used the tolerant mean to calculate the average energy at each iteration, e.g.,ˆ res = {ˆ 1 , . . . ,ˆ 20 }, whereˆ i is the mean energy over i non-converged runs. We see that for both the TFIM and XXZ model, gradient descent from all 20 random starting points converges to an accurate solution once the depth p reaches a certain thresholdp(N ). Moreover, we also observe a "computational phase transition" around this threshold where the convergence speed becomes exponentially fast. However, this thresholdp(N ) is not tight, i.e., for depth p <p(N ) it is possible that gradient descent still converges to a high fidelity state. This indicates that in the setting of finding ground states using HVA, the problem is more structured and gradient descent is effective. In fig. 7 we see that for all system sizes, the mean number of iterations eventually converges to about 100 iterations, independent of system size. In addition, we can find the over-parameterization thresholdsp(N ) in table I for the TFIM and XXZ model with different system sizes. Our data suggests thatp(N ) has at most a polynomial scaling, which is compatible with the analogous parameter count required to express critical 1D ground states with an MPS. This is a striking difference with [42] where the number of parameters to achieve over-parameterization is (2 N ) 2 . From fig. 7 we can also see that the iteration time decreases substantially as p increases, which saturates to around 100 iterations after a certain p for all N .
The over-parameterization phenomenon in HVA shows a clear difference between HVA and parameterized random quantum circuits (RQC), because there is no indication or evidence that the landscape of RQC gets better as one increases the depth. On the contrary, in our experiments with random circuits of comparable depths to our HVA circuits, we were unable to observe the same overparameterization phenomenon. This can be explained from the barren plateau point of view and the lack of structure in the ansatz space.   Note that if the number of iterations is smaller than 3000, then we know that res ≤ 1e − 4, indicating that the optimization has converged to a good ground state approximation. We see that the error bars decrease systematically with depth. For both models, there is a critical p after which all random initializations converge to a good ground state approximation. Moreover, for depth p = 34 and p = 52 for the TFIM and XXZ model respectively, the number of iterations to find the ground state is of the order of 100 iterations for every starting point.

C. Ameliorated barren plateaus in HVA
In Ref. [5], a barren plateau phenomenon was observed for VQE on random quantum circuits, where all gradients are exponentially close to zero with overwhelmingly high probability, making local optimization within the ansatz space extremely challenging. The barren plateau phenomenon is due to the fact that RQCs consisting of single-and two-qubit gates form a 2-design, which means that gradients of the energy objective function will obey the same concentration of measure properties as if the circuits were Haar-random unitaries.
In contrast to the RQC ansatz, we show that the optimization landscape of HVA is much more favorable. This is clearly illustrated when optimizing the HVA corresponding to the TFIM: to begin with, as discussed in section IV A the manifold of states has a much more restricted entanglement structure than a typical, Haarrandom state -this already indicates that the HVA circuits do not form 2-designs, and thus do not obey the same kind of concentration of measure phenomenon as RQCs. On the other hand, the entanglement spectrum of the ansatz space corresponding to the XXZ model does not immediately rule out the same barren plateau behaviour as exhibited by RQCs.
Nonetheless, we determined that the barren plateau problem is significantly ameliorated in both the TFIM and XXZ models. In fig. 8, we calculated the variance of gradients as a function of qubits number N and depth p over 20 random points per N and per p. For the TFIM, the flatness of the variance curve indicates no barren plateau problem. However for the XXZ model, we see an exponential decay, but this decay is not as strong as in RQCs [5]. Nonetheless, we can reliably find an accurate solution when choosing an identity start (see appendix section C), where the barren plateau problem is absent. Indeed, sampling gradients close to the identity initialization gives a constant gradient variance for all N . This indicates that the vanishing gradient problem can be circumvented by choosing a suitable initialization strategy.

D. The entangling power of HVA circuits
For a 1D gapped quantum system, the entanglement entropy of the ground state obeys an area law [44][45][46], i.e., the entanglement entropy grows proportionally to the boundary area |∂I| of the subsystem ρ A : In 1D, the boundary area |∂I| is either 1 (for an open chain) or 2 (for a closed chain), and the area law simply says that the entanglement entropy should be constant as N increases. For a 1D conformally invariant gapless (critical) system, the entanglement entropy of the ground state has a logarithmic scaling instead [47], i.e.,

S(ρ A ) = O(log(n)).
Entangling power is an important factor for characterizing the expressiveness and efficiency of many variational ansätze in condensed matter physics. For example, in the matrix product state representation, the entangling power is limited by the so-called bond dimension D which affects the expressive power and computational cost of the ansatz. For a 1D gapped system with energy gap , the ground state can be approximated well by an MPS with sublinear bond dimension ) [48]. In the case of HVA, the amount of entanglement generated by the circuit depends on the depth p of the circuit. Indeed, we observed in fig. 4 numerically that the HVA circuits for the TFIM and XXZ model have enough entangling power to express the ground states.
As a demonstration that the full entangling power of HVA can be utilized effectively, we solve for the ground state of the so-called modified Haldane-Shastry (MHS) Hamiltonian. This model has long range interactions and is expected to have power-law entanglement scaling in the ground state [49,50]. The MHS Hamiltonian is given by where d ij = N π |sin(π(j − k)/N )|. Due to the form of the Hamiltonian, we can use the same HVA (7) as for the XXZ model. In figure fig. 9 we see that it is possible to find the ground state with > 99.7% fidelity using a depth p = N circuit for N = 4, 8, 12, 16.

V. CONCLUSION
In this work, we shed light on some of the desirable properties of HVA as a critical ingredient in the variational quantum eigensolver algorithm. In particular, we show evidence that there are only mild or entirely absent barren plateaus in HVA. This is strikingly different from the commonly used random quantum circuits. Moreover, we also observe an over-parameterization phenomenon in HVA. Similar to what was observed in the deep neural networks, the optimization landscape of HVA becomes increasingly better as the ansatz is overparameterized and eventually becomes trap free as the over-parameterization reaches a certain threshold. In contrast with the case of learning Haar random unitaries, we observe that such threshold in HVA scales at most polynomially with the system size. Finally, we provide numerical evidence that HVA can be used to find the ground state of the MHS Hamiltonian, which has a power-law scaling entanglement. We believe that our findings point to important features of HVA that will lead to a deeper understanding of its effectiveness, and point the way to developing more sophisticated ansätze for other many-body problems, as well as more informed optimization/initialization strategies.
As for future work, since most 1D quantum manybody systems can be simulated efficiently with classical methods, the crucible for HVA will be 2D systems. If low-depth circuits are capable of reproducing non-trivial 2D quantum states, then one can start thinking when a quantum advantage can be reached for systems where classical methods are computationally expensive or even ineffective. The effectiveness of the identity initialization, both in terms of the absence of vanishing gradients and reliability of finding a good ground state approximation is striking. Scrutinizing the mechanism for why this is the case will require a deeper understanding of the energy landscape of HVA. Our preliminary results for the XXZ model and TFIM on rectangular lattices show that this initialization strategy remains effective even for 2D systems.
Lastly, the over-parameterized regime is a double edged sword. On the one hand, it implies that we can improve the energy landscape by increasing the depth of the circuit, ameliorating the effects of local minima. On the other hand, the growth in circuit depth, may well nullify this increase in performance due to the longer coherence times required and multiplicative gate errors. In order to asses how useful this regime is for hardware implementations, it would require an understanding of the effect that noise has on the optimization in the overparameterized regime. A possible direction to analyze this phenomenon further is to connect to the rich literature on over-parameterization in the machine learning community.

A. COMPUTATIONAL DETAILS
For the implementation of our quantum circuits, we use Zyglrox [51], a powerful TensorFlow-based quantum simulator. For the classical optimization process, we use Adam (adaptive moment estimation) [52], a gradient descent-based optimizer, which is widely used in the machine learning community. Compared to vanilla gradient descent and its other variants, Adam updates the learning rates adaptively on a per parameter basis by using estimates of the first and second moments of the gradients. In our own investigation for solving the ground energy problem with HVA, Adam outperformed all the other optimizers available in TensorFlow, with respect to fidelity and convergence times.
Unless stated otherwise, the stopping criterion for our optimization is defined as |E(θ t ) − E(θ t+1 )| < 1 × 10 −13 where t is the iteration number. The maximum number of iterations is set to 15000. We use an initial learning rate r = 0.01 for Adam which gives reasonably consistent results across all the models. Through our own investigation into initial Adam learning rates, we found a learning rate 1 × 10 −3 ≤ r ≤ 4 × 10 −2 to be a good choice for the optimization for both the TFIM and the XXZ models, as it balances optimization accuracy and convergence speed.
These simulations were performed on the University of Toronto Computer Science Department servers, which housed either AMD Ryzen Threadripper 2990WX, or Silicon Mechanics Rackform iServ R331.v4 with two 12-core Intel E5-2697v2 CPUs with access to at most 10gb of RAM. More computationally expensive simulations were done using Nvidia GeForce GTX 1050Ti GPUs.
On the Vector cluster we used Intel(R) Xeon(R) Silver 4110 CPUs and Nvidia 480 T4 GPUs with access to at most 32gb of RAM.

B. DYNAMICS OF ENTANGLEMENT ENTROPY DURING OPTIMIZATION
To further elucidate the difference in initialization strategies we qualitatively study the dynamics of the entanglement entropy during optimization. In fig. 10, we calculate the entanglement entropy of ρ A at each layer of the circuit during the optimization. Although not much can be said about the intermediate states for the random state initialization, except that they are highly entangled, the entanglement entropy dynamics for the identity initialization have a distinct structure that is consistent as we increase the system size. In fig. 11, we compare the scaling of the entanglement entropy for the identity start halfway through the circuit for different system sizes.

C. ADDITIONAL NUMERICAL RESULTS
In fig. 12 we show the infidelities we find after optimization for g ∈ {0.5, 0.52, . . . , 1.5} and ∆ ∈ {0.5, 0.52, . . . , 1.5} for the TFIM and XXZ model, respectively. Our numerical results for the aforementioned ranges of order values are available in the dataset module of Tensorflow Quantum [53]. In this region, the optimization has not fully converged, but is stopped after 15000 iterations. We note that for increasing N the time until convergence is polynomial in N (not shown here), similar to what was observed in [20]. Additionally, we observe a worsening of this scaling with increased g. (b) For the XXZ model we are unable to consistently reach machine precision fidelities. In addition, the fidelities we find become worse as N increases. However, except for a couple of outliers we are able to get > 99.9% fidelities for N ≤ 16 or all order values.