Avoiding local minima in variational quantum eigensolvers with the natural gradient optimizer

We compare the BFGS optimizer, ADAM and Natural Gradient Descent (NatGrad) in the context of Variational Quantum Eigensolvers (VQEs). We systematically analyze their performance on the QAOA ansatz for the Transverse Field Ising Model (TFIM) as well as on overparametrized circuits with the ability to break the symmetry of the Hamiltonian. The BFGS algorithm is frequently unable to find a global minimum for systems beyond about 20 spins and ADAM easily gets trapped in local minima. On the other hand, NatGrad shows stable performance on all considered system sizes, albeit at a significantly higher cost per epoch. In sharp contrast to most classical gradient based learning, the performance of all optimizers is found to decrease upon seemingly benign overparametrization of the ansatz class, with BFGS and ADAM failing more often and more severely than NatGrad. Additional tests for the Heisenberg XXZ model corroborate the accuracy problems of BFGS in high dimensions, but they reveal some shortcomings of NatGrad as well. Our results suggest that great care needs to be taken in the choice of gradient based optimizers and the parametrization for VQEs.


I. INTRODUCTION
Variational quantum algorithms such as the Variational Quantum Eigensolver (VQE) or the Quantum Approximate Optimization Algorithm (QAOA) [1] have received a lot of attention of late. They are promising candidates for gaining a quantum advantage already with Noisy Intermediate-Scale Quantum (NISQ) computers in areas such as quantum chemistry [2], condensed matter simulations [3], and discrete optimization tasks [4]. A major open problem is that of finding good classical optimizers which are able to guide such hybrid quantumclassical algorithms to desirable minima and to do that with the smallest possible number of calls to a quantum computer backend. In classical machine learning, the Adaptive Moment Estimation (ADAM) [5] optimizer is among the most widely used and recommended algorithms [6,7], and has been one of the most important enablers of progress in deep learning in recent years. Such an accurate and versatile optimizer for quantum variational algorithms is yet to be found.
We are here mostly interested in variational algorithms for quantum many-body problems. To make progress towards finding an efficient and reliably optimizer for this domain, we concentrate on cost functions derived from typical quantum many-body Hamiltonians such as the Transverse Field Ising Model (TFIM) and the Heisenberg XXZ Model (XXZM) for two reasons: First, their system size can be varied allowing us to systematically study scaling effects. Second, for integrable systems such as the TFIM the exact ground states are known and it is possible to construct ansatz classes for VQE circuits that * wierichs@thp.uni-koeln.de provably contain the global minimum and can be simulated more efficiently. Such systems thus allow us to distinguish between the performance of the optimizers and the expressiveness of the ansatz.
As a first result we show that the commonly used optimization strategies ADAM [8] and Broyden-Fletcher-Goldfarb-Shanno (BFGS) [9][10][11][12][13][14][15][16][17][18] both run into convergence problems when the system size of a VQE is increased. This happens already for system sizes within the reach of current and near future NISQ devices, which underlines the importance to a systematic search for suitable optimization strategies. The BFGS algorithm fails systematically for bigger systems above about 20 spins in the TFIM corresponding to 20 variational parameters. The performance of ADAM is shown to depend strongly on the learning rate via multiple effects and the number of epochs required for convergence increases fast with the problem size. Convergence can be improved but only with an expensive fine-tuning of the hyperparameters.
We then study the performance of an optimization strategy known as the Quantum Natural Gradient or NatGrad [19][20][21] and introduce Tikhonov regularization to the classical processing step in the VQE [22]. We find that Natural Gradient Descent (NatGrad) regularized in this way does consistently find a global optimum for the largest system sizes we test (40 qubits) and requires significantly fewer epochs to do so than ADAM (in the cases where ADAM converges at all). This is in sharp contrast to the usually very good performance of the ADAM optimizer and related (stochastic) gradient descend based techniques in the optimization of classical neural networks. A possible explanation for this good performance in usually overparametrized settings is the following: For common activation functions and random initialization, increasing overparametrization tends to transform local minima into saddle points [23,24]. The optimizer then mainly needs to follow a deep and narrow valley with comparably flat bottom to find a global minimum. The ADAM optimizer is perfectly suitable to pursue this path as it has per-parameter learning rates that also take into account the average of recent updates. In this way it avoids side-to-side oscillations in the valley and can build up momentum to slide down the relatively flat bottom of the valley.
The energy landscapes of typical variational quantum algorithms however look very different. First, having deep and wide circuits with many parametrized gates is prohibitive on NISQ computers, which excludes overparametrization as a tool to make the variational space more accessible. Second, the variational parameters usually feed into exponentially generated gates and thus the cost function is a combination of trigonometric functions of the parameters. It appears that NatGrad is able to effectively use the information about the ansatz class to navigate the resulting energy landscape with many local minima. Third, it is known that large parts of the parameter space form so-called barren plateaus with very small gradients [25]. A random initialization of the parameters in reasonably deep VQEs is thus almost certainly going to leave one stuck in such a plateau. Of course this also implies that one must prevent the optimizer from jumping to a random location in parameter space during optimization. This can be achieved in NatGrad by inhibiting unsuitably large steps by means of Tikhonov regularization. Finally, due to the small number of variational parameters in VQEs, the added (classical) computational cost of inverting the Fubini-Study metric is neglegible as compared to the cost of sampling from the quantum backend. This fact, combined with the highly correlated nature of the learning landscape in quantum many-body problems [26], might render second-order methods such as NatGrad more amenable to quantum than to classical settings, where samples are cheap, but there are many variational parameters.
Our second set of results concerns the effect of overparametrization in VQEs. We study the impact of adding redundant layers to the ideal circuit ansatz. Not only does this overparametrization not improve the performance, it actually appears to make finding the optimum significantly harder. The BFGS algorithm but also the ADAM optimizer, designed to thrive on additional degrees of freedom, fail frequently in this setting. This cannot easily be mitigated by increasing the iteration budget and reducing the learning rate of the ADAM optimizer. While also affected, NatGrad shows much higher resilience against this effect, compensating its higher perepoch cost with a higher chance to succeed.
In order to generalize our results, we consider the XXZM together with the Trotterized time evolution operator as circuit ansatz. Indeed we find BFGS to experience the same difficulties in high-dimensional parameter spaces and ADAM to exhibit a similar behaviour of the required number of epochs as for the TFIM. The performance of NatGrad however is not as reliable for this model as it shows very flat intermediate optimization curves which obscur the distinction between challenging phases in the computation and convergence to local minima. A detailed investigation of the origin of the deviating behaviour and potential improvements of NatGrad will be subject of future work.

II. MAIN RESULTS
In this section, we state and assess the main numerical results of the paper. For a detailed description of the optimizers and circuit models, see the Methods section (sec. III).
A. QAOA circuits for the TFIM We start our numerical investigation with the QAOA circuit for the TFIM on N qubits with a depth of p = N/2 blocks and analyze the accuracy, speed and stability of all three optimizers BFGS, ADAM and NatGrad (see sec. III A 1 a for the ansatz and III C 1 for the model). These circuits with n = N parameters are sufficiently expressive to contain the ground state and respect the symmetries of the Hamiltonian. For each system size we sample 20 points in parameter space and initialize each optimizer at these positions. This leads to statistically distributed performances of the algorithms and as we perform exact simulations without sampling and noise it is the only source of stochasticity. The minimal relative error δ min and the number of required epochs for each initial point and optimizer are shown in fig. 1.
Before we analyze the results, recall that the optimization problem can be solved exactly, i. e. the ansatz contains the true ground state. This enables us to identify optimization results with precisions δ min ≥ 10 −3 as local minima and we consider them to be unsuccessful. In practical applications the precision reached in both local and global minima would be much lower and in particular results with δ ≈ 10 −10 are unreasonable to measure in quantum machines. This choice of benchmark is made in order to clearly reveal intrinsic features of the optimizers. For realistic applications, a systematic study of noise needs to be taken into account as well.
Our first observation is that the BFGS optimizer systematically fails to converge for systems sizes larger than N = 20. For small system sizes, however, it reaches a global minimum in the smallest number of iterations and at low cost per epoch (see tab. I). The runs of BFGS interrupted at a δ < 10 −6 level could be improved to reach the goal of δ = 10 −10 by tuning the interrupt criterion. Therefore, these runs are considered successful.
For ADAM we here show the optimization results with η = 0.06 which similarly display a deterioration in accuracy for system sizes beyond N = 26. It is important to note that the failed ADAM runs are interrupted after 5 · 10 4 iterations and convergence with additional runtime is not excluded in general. The question is then: How many iterations are needed for convergence? We observe a polynomial scaling of the required iterations in the system size up to a transition point N * (η) which depends on the chosen learning rate. Above this system size both successful and failing runs take much longer and exceed the set budget of 5 · 10 4 iterations. In fig. 1 we present the ADAM runs for a medium learning rate in order to demonstrate the described behavior but not the best possible performance of the ADAM optimizer. We present a more detailed analysis of the influence of the learning rate on the performance of ADAM in appendix B. NatGrad shows reliable convergence to a global minimum for all sampled initial parameters. The number of epochs to convergence scales polynomially with the system size and there is little variance in the required number of epochs.
Using the scalings discussed in more detail in sec. III B 5, taking the translation symmetry of the TFIM into account and employing the estimates N M /N a ≈ 10 [27] and t 3 ≈ t 2 ≈ t 1 (see sec. III B 5 for definitions of these quantities) we show the expected optimization durations on a quantum computer in fig. 2. Due to the increased cost per epoch and a similar scaling of the number of iterations for all optimizers, the cost for NatGrad are considerably higher than those for BFGS and ADAM in the regimes in which they converge and ADAM does not suffer from the sudden increase in required epochs. We expect the scaling for ADAM, which is truncated in fig. 1 due to our epoch budget, to yield quantum runtimes comparable to those of NatGrad. As we show in appendix B, reducing the learning rate makes bigger system sizes accessible to ADAM, but also rather drastically increases run times because of slower convergence.
The structure of the investigated Hamiltonian has a major influence on the scalings as the translation symmetries in the presented spin chain models reduce K H to a small constant leading to high relative cost of obtaining the Fubini matrix. For chemical systems, for example, with at least quadratic scaling of K H in N and depending on the ansatz class, the relative additional cost per epoch for NatGrad can be significantly smaller, which in combination with the unreliable convergence of ADAM from system size N * onwards would make NatGrad an attractive optimization technique.
In summary, we find the BFGS optimizer to run into convergence problems already for medium sized systems, ADAM to take a large number of epochs with a transition into unpredictable cost at a certain system size and NatGrad to exhibit reliable convergence with fewer epochs than ADAM, but an overall high cost when running on a real quantum computer. Furthermore, the success of both commonly used optimizers, BFGS and ADAM, strongly depends on the initial parameters whereas NatGrad shows stable convergence and a small variance of the optimization duration.

B. Overparametrization by adding Y layers
We now extend the optimal QAOA circuit for the TFIM by adding redundant layers of Pauli Y rotations. These additional rotations can be deactivated by setting their variational parameter κ to zero. This means in particular that the new ansatz classes still contain the ground state and simply introduce a form of overparametrization.
As Pauli Y rotations cannot be represented in the free fermion basis of the Hamiltonian (see eqn. (16)), the overparametrized class can be seen as breaking a symmetry. This means that for any given κ = 0, the ansatz state will not be a global minimum and it will be crucial for an optimization algorithm to find the submanifold with κ = 0. This is clear for a single additional layer of gates, but we expect it to hold for multiple layers as well. Although the present situation is artificially constructed and the broken symmetry is manifest, similar behavior is expected in systems where we do not have an analytical solution. More generally, even for a suitable ansatz class a very specific configuration of the variational parameters is necessary to find the ground state and the chosen optimization algorithm consequentially should be resilient to local minima. Our choice of overparametrization leads to such local minima, constructing an optimization problem that can be used as a test for the resilience of the optimizer.
We look at two configurations of the extended circuits with y-rotation layers included at positions N 4 and N 4 , N 2 − 1 respectively. With this choice we avoid special points in the circuit and expect these setups to properly emulate the problem of (additional) local minima.
Again we sample 20 positions in parameter space close to the origin and initialize the three optimizers at these points, resulting in the precisions and success ratios shown in fig. 3 together with the estimated quantum computer runtimes in fig. 4. We observe a clear distinction between the optimizations that succeed to find a global minimum and those which converge to a local minimum only, such that we obtain a well-defined success ratio for this numerical experiment. In contrast to the results for the minimal QAOA circuit, no intermediate precisions caused by a finite iteration budget occur. All optimizers suffer from the introduced gates as they show convergence to local minima for system sizes they tackled successfully without overparametrization. For BFGS, this effect appears for some system sizes for one layer of Pauli Y rotations but is much stronger for two additional layers, reducing the fraction of globally minimized runs to less than 50% for multiple system sizes. We do not claim a scaling behaviour with the system size but note an alternating pattern for the configuration with two Y layers, demonstrating large fluctuations of the success ratio (c. f. in particular system sizes 10 and 12 for two Y layers).
For the ADAM optimizer we use a comparably small learning rate of η = 0.02 which pushes the jump of the optimization duration observed before well out of the treated system size range. Nonetheless, we observe runs stuck in local minima already for small systems without exceeding the iteration budget such that in contrast to sec. II A allowing for a longer runtime would not improve the performance. Also for ADAM the fraction of successful instances fluctuates with the system size but in particular for two Pauli Y rotation layers the effect becomes stronger for bigger systems and no successful runs were observed for N ≥ 14.
The performance of NatGrad on the other hand, for which we reduced the learning rate to η = 0.05, is more reliable and the success rate is the best for most of the circuits, with few exceptions. In particular there are only few system sizes with local convergence for one and two additional degrees of freedom each and overall the success rate of NatGrad does not drop below 60%.
For all optimizers we confirm that successful runs deactivate the additional Pauli Y rotation layers by setting the corresponding parameters to 0 and that all optimizations with worse precision failed to do so, leading to a local minimization only. The quantum runtimes demonstrate the expected scaling with NatGrad as the most expensive optimizer, where the small iteration count compensates the increased cost per epoch for small systems. However, the increased effort is rewarded with significantly higher success rates, making NatGrad a strong choice for (potentially) overparametrized VQE optimization. We again note that the relative cost of the Fubini matrix are high for spinchain systems and that the reduced number of epochs required by NatGrad will have a bigger impact in other systems.
Overall our numerical experiments with the extended QAOA circuits for the TFIM demonstrate the fragility of the three tested optimizers to perturbations of the ansatz class. A significant decrease in performance is caused by overparametrization outside of the symmetry sector of the model and the QAOA ansatz class. All algorithms were successful for the original QAOA circuits on the considered system sizes such that the reduced success ratio can directly be attributed to the extension of the ansatz class. This is in contrast to machine learning settings where heavy overparametrization is essential to make the cost function landscape tractable to local optimizers like ADAM. The strong fluctuations over the tested system sizes indicate that more repetitions of the optimization would be required to resolve systematic behaviour.
We note that the BFGS algorithm in some instances converges to a local minimum although it has access to non-local information via its line search subroutine. In particular in the presence of two misleading parameters in the search space, the local information determining the one-dimensional subspace does not seem to suffice any longer to find the global minimum, even though the approximated Hessian is used. For the ADAM optimizer the initial gradient leads to an activation of symmetry breaking layers and due to the restriction to local information the algorithm is not able to leave the resulting sector of the search space with local minima it enters initially. NatGrad also is affected by the limitation to local information but because of the access to geometric properties of the ansatz state class it was on average less likely to leave the Pauli Y -rotation layers activated. We attribute this to the fact that NatGrad performs the optimization in the locally undeformed Hilbert space by extracting the influence of the parametrization. As a consequence the optimizer does not follow the incentive to activate the Pauli Y rotations at the beginning when given the same gradient as ADAM, but stays within the minimal parameter subspace. A better foundation for this intuition and the observed exceptions will be subject to further investigations of NatGrad.
In general, one could expect the cost function of VQEs to behave differently than those in common machine learning models as the parameters enter in a very nonlinear manner via rotation gates. We were able to demonstrate such a difference with ADAM, which benefits from overparametrization in machine learning applications but suffers significantly from the additional parameters of the extended circuits. The restriction of NISQ devices to rather shallow circuits implies much smaller numbers of variational parameters than in machine learning such that NatGrad can be considered a viable option for VQE optimization.

C. Results on the Heisenberg model
To complement the study on scaling and overparametrization in the integrable TFIM we present here numerical results on the XXZM with the ansatz discussed in detail in sec. III C 2. The performance of the three optimizers, initialized at 20 distinct points close to 0, is shown in fig. 5 together with the number of epochs.
The behaviour of ADAM and BFGS is similar to the one observed on the TFIM, i. e. ADAM successfully achieves the target accuracy of 10 −5 but shows an abrupt increase in the iteration number and BFGS starts to fail  for medium sized systems. The number of variational parameters at which the respective transition occurs is similar to that in the TFIM: The cost of ADAM jump abruptly at n = 24 and n = 36 and similarly the runs with comparable learning rate for the TFIM show (less clear) transitions at n = 26 and n = 30. Likewise the BFGS optimizer starts failing significantly at n = 24 and n = 22 for the XXZM and the TFIM, respectively. The Hilbert space dimension however clearly differs at the transition points. It is intuitively clear that the main influence should be due to the properties of the parameter space, but in general the physical system size might affect the performance as well by shaping the energy landscape. The NatGrad optimizer is less performant on the XXZM as it is sometimes interrupted during phases of small updates. This might indicate either convergence to a local minimum or a too small learning rate. A preliminary further analysis showed that reducing η in NatGrad can prevent convergence for some instances that were optimized successfully before. This hints to the second scenario because a reduced learning rate should generally improve the quality of NatGrad. This will be investigated in a follow-up study.
We note that the attained precision in failed runs does not show a consistent gap across the system sizes which makes the analysis of the performance less clear. Nonetheless the deviation from the target precision is significantly smaller for NatGrad than for BFGS and if one extends the gap visible for N = 8 and N = 10 many instances of BFGS are categorized as unsuccessful.
We interpret the results on the XXZM as follows: Some difficulties of the commonly used BFGS optimizer and ADAM appear also in this model already for moderate system sizes. The size of the parameter space seems to primarily determine whether performance (BFGS) or runtime (ADAM) issues arise, not so much the Hilbert space dimension of the underlying many-body model. The very reliable performance of NatGrad seen in the TFIM can not necessarily be generalized to other spin chain models, let alone to other classes of Hamiltonians. However, the characteristics of the failed runs let us hope that systematic improvements to NatGrad might be possible.

A. Variational Quantum Eigensolver
The framework of our work is the VQE, a proposal to use parametrized circuits on a quantum computer in combination with classical optimization routines to prepare the ground state of a target Hamiltonian H. In the first part of a VQE one constructs a quantum circuit that contains parametrized gates. Given input parameters θ for the circuit, a quantum computer can then prepare the corresponding ansatz state and measure an objective function, chosen to be the energy of the Hamiltonian and for benchmark problems with known ground state energy E 0 , the relative error δ can be calculated as Additionally one can prepare modified versions of the circuit to determine auxiliary quantities like the energy gradient in the parameter space [28]. The second part of the VQE scheme is an optimization strategy on a classical computer which is granted access to the quantum black box just constructed. In the most straightforward scenario this is a black box minimization scheme, but using auxiliary quantities, more sophisticated optimization methods can be realized as well.
There are two main theoretical challenges for successfully applying VQE: First, the construction of a sufficiently complex, but not overly expensive, circuit that gives rise to an ansatz class containing the ground state expressivity. Second, the choice of a suitable optimizer that is able to search for the ground state within the created parameter spaceefficiency. The two challenges are often seen as independent, but explicit algorithms using information gathered about the variational space during optimization phases for adjusting the ansatz have been proposed as well, some of which are inspired by concrete applications in quantum chemistry or by evolutionary strategies [8,16,29,30]. We now establish some notation for the general VQE setting where we assume the most common objective: Finding the ground state energy of a Hamiltonian H. Starting from an initial product state |ψ , we apply parametrized unitaries {U j (θ j )} 1≤j≤n to construct the ansatz state The parameters are typically initialized randomly close to zero to avoid the barren plateau problem [25]. For this work, the unitaries are going to be translationally invariant layers of one-or two-qubit rotations; Consider for instance where we identified the qubits with index 1 and N+1, i. e. we adopt periodic boundary conditions. The ordering of the gates within a layer is not relevant because they commute but for convenience we write them such that terms acting on the first qubits are applied first. Z (k) is the Pauli Z operator acting on the k-th qubit and we tacitly assume the tensor product between operators that act on distinct qubits as well as the missing tensor factors of identities. Compared to proposed ansatz circuits that employ full Hamiltonian time evolution exp[−iθH] (see sec. III A 1 a), such a layer is rather easily implemented on present quantum machines because it only requires linear connectivity and one type of two-qubit rotation. There have been many proposed circuits to generate ansatz classes for a variety of problems, all of which can be boiled down to combining rotational gates and possibly other fixed gates such as the CNOT or SWAP gate (see sec. III A 1). For the presented optimization methods the derivatives w. r. t. the variational parameters {θ j } j are important and for the above example we observe the special structure of translationally symmetric layers of Pauli rotation gates: The derivative only produces an operator prefactor, and all prefactors can be summarized because the single gates commute. While the basic gates composing a unitary U j (θ j ) typically take the form of (local) Pauli rotations, the full unitary often is more complex than the above layer and in particular the terms in U j do not need to commute. However, the structure of rotations enables us in general to evaluate required expressions involving derivatives on a quantum computer, either via measurements of rotation generators or via ancilla qubit schemes.

A selection of ansatz classes
Among the ansatz families proposed in the literature we present the following which are used frequently and of which two are directly connected to this work: a. QAOA The Quantum Approximate Optimization Algorithm was first proposed by Farhi, Goldstone and Gutmann [1] in 2014 for approximate solutions to (classical) optimization problems by mapping them to a spinchain Hamiltonian. The algorithm looks similar to adiabatic time evolution methods with an inhomogeneous time resolution which is rather coarse for typical circuit depths. A lot of work has been put into proving properties of the QAOA both, in general and for certain problem types, including extensions to quantum cost Hamiltonians [31][32][33][34]. A the same time the algorithm has been refined, extended, and characterized on the basis of heuristics and numerical experiments, gaining insight into its properties beyond rigorous statements [14, 36? -38].
The QAOA circuit is constructed as follows: For a cost Hamiltonian H S and a so-called mixing Hamiltonian H B one alternatingly applies the unitaries exp [−iϑ j H S ] and exp [−iϕ j H B ] p times, giving rise to a VQE ansatz class with 'time' parameters {ϑ j , ϕ j } 1≤j≤p . Originally, the system Hamiltonian would encode a classical optimization problem and thus be diagonal while the mixing Hamiltonian was chosen to be off-diagonal and specifically has been kept fixed to the original H B = N k=1 X (k) for many investigations. However, new choices of mixers have been proposed and investigated as well, giving rise to the more general Quantum Alternating Operator ansatz (QAOa) [15,38,39].
Note that for quantum systems, the terms comprising the Hamiltonian H S do not commute in general such that very large gate sequences would be necessary to realize the exact QAOA approach including exp [−iϑH S ]. In practice these blocks commonly are broken up in a Trotter-like fashion instead, yielding circuits that are implemented more readily but deviating from the original ansatz. For the TFIM, such a modified QAOA ansatz has been studied intensively [14, 36? ] and we are going to use it as a starting point for our investigations.
b. Adaptive ansätze Most prominently for this type of ansätze, ADAPT-VQE tackles both the construction of a suitable ansatz class and the optimization within the constructed parameter space.
Instead of a fixed ansatz circuit layout, ADAPT-VQE takes a pool of gates as input and iterates the two steps of the VQE scheme: After rating all gates the most promising one is appended to the circuit (construction) and afterwards all the circuit parameters are optimized (minimization). The optimized parameters from the previous step are then used for both, the rating of the gates for the next construction step and the initialization for the following optimization, where newly added gates are initialized close to the identity. For both, the concept of allowed gates and the gate rating criteria, there are multiple options and we refer the reader to [16,29] for more detailed descriptions.
Besides ADAPT-VQE, multiple other methods which grow the ansatz circuit in interplay with the optimization have been proposed and demonstrated, including Rotoselect [8] and EVQE [30]. These demonstrations include the solution of 5-qubit spinchains and small molecules (lithium hydride, beryllium dihydride and a Hydrogen chain) to chemical precision using simulations with and without sampling noise or quantum hardware.
We will not be using any adaptive scheme in our work, but our results on stability and overparametrization raise serious doubts as to the reliability of any adaptive ansatz method.

B. Optimizers
A variety of optimizers have been used in the context of variational quantum algorithms. These optimizers are inspired by classical machine learning and can be sorted according to the order of information required about the cost function. Zeroth-order or direct optimization methods only evaluate the function itself, first-order methods need access to the gradient, and second-order optimization need access to the Hessian of the cost function, or some other metric reflecting the local curvature of the learning landscape.

Direct optimization
The most naive approach to optimizing a function over an input space is to simply "look at all possible inputs", i. e. to set up a grid and to evaluate the function on all vertices of the grid. Even though it is unlikely to find the minimum in this manner directly, subsequent refinements of the grid around potential minima make global optimization possible. On the one hand this method becomes exponentially expensive in the number of parameters and a 15-dimensional grid generated by only two values per parameter already requires 2 15 > 3 · 10 4 function evaluations. On the other hand, the naive grid search can be improved significantly which allows for global optimization. This approach has been demonstrated successfully for between 15 to 20 parameter with the DIRECT method and a budget of 2 · 10 5 evaluations [40]. For high-dimensional applications, i. e. circuits for realistic systems with parameter count at least linear in the size of the system, any global optimization strategy seems likely to suffer from the sparse information access and to become incapable of exploring a sufficiently big fraction of the search space.
As is the case for most of the work on VQEs we will not use any direct minimization methods, supported by the estimate that those strategies become unfeasible for relevant problem sizes and demonstrated deficiencies in comparison to gradient-based techniques [17].

First-Order Gradient Descent
Optimization techniques using the gradient of the cost function are at this point the most widely used in machine learning. Starting from the simple Gradient Descent method that updates the parameters according to the gradient and a fixed learning rate, a whole family of minimization strategies has been developed. The improved routines are inspired by physical processes like momentum, based on heuristics like adaptive learning rate schedules, or a smart processing of the gradient information as in the Nesterov Accelerated Gradient. A review of this development can be found e. g. in [7], here we just present the first-order method we are going to use, the ADAM optimizer.
ADAM, which was proposed in 2014 [5], is probably the most prevalent optimization strategy for deep feedforward neural networks [6] and has been used in VQE settings as well [8]. For completeness, we briefly outline the ADAM optimizer: Given the cost function E(θ), where θ recollects all variational parameters, a starting point θ (0) and a learning rate η, Gradient Descent computes the gradient ∇E(θ (t) ) at the current position and accordingly updates the parameters rescaled by η: As the gradient points in the direction of steepest ascend, the parameter update is directed towards the steepest descend of the cost function and for η small enough, the convergence towards a minimum can be understood intuitively. Small learning rates yield slow convergence which increases the cost of the optimization whereas choosing η too large leads to overshooting and oscillations which might prevent convergence. Furthermore, although the optimizer will diagnose convergence to a minimum due to a vanishing gradient, it cannot distinguish between local and global minima. In order to fix both issues, i. e. the need for an optimally scheduled learning rate and the liability of getting stuck in local minima, various improvements have been proposed and ADAM uses several of these upgrades. The first feature is an adaptive, componentwise learning rate, which was introduced in AdaGrad [41] and improved in RMSprop [42] to avoid suppressed learning. The second feature ADAM uses is momentum, which is inspired by the physical momentum of a ball in a landscape with friction. This is realized by reusing past parameter upgrades weighted with an exponential decay towards the past and enables ADAM to overcome some local minima. The final form of the ADAM algorithm is as follows: Initialize with hyperparameters {η, β 1 , β 2 , ε}, momentum m (0) = 0, average squared gradient v (0) = 0 and initial position θ (0) . At the t-th step, compute the gradient and update the momentum and the cumulated squared gradient as where x 2 denotes the elementwise square of a vector x. The parameter update then is computed from these updated quantities via with the square root of v (t) taken elementwise. Besides the learning rate η, we identify the hyperparameters β 1 and β 2 as exponential memory decay factors of m and v respectively and the small constant ε as regularizer, which avoids unreasonably large updates in flat regions and division by zero at initialization or for irrelevant parameters.
Because of the advanced features that ADAM uses, it has been very successful at many tasks and even though there are applications for which more basic gradientbased optimizers can be advantageous, we choose ADAM to represent the family of local first-order optimizers.

BFGS optimizer
The second optimizer we look at is the BFGS algorithm, which was proposed by its four authors independently in 1970 [9][10][11][12] . Using first-order resources only it approximates the Hessian of the cost function and performs global line searches in the direction of the gradient transformed by the Hessian inverse. Therefore it is a global quasi second-order method using local first-order information and its categorization is not obvious. The algorithm is initialized with the starting point θ (0) and a first guess for the approximate Hessian H (0) of the cost function E, which usually is set to the identity. At each step of the optimization one determines the gradient, computes the direction and performs a line search on {θ (t) + η n (t) |η ∈ R} which yields the optimal update in that direction and can optionally be restricted to a bounded parameter subspace. Given the new point in parameter space, θ (t+1) , the change in the gradient D (t) = ∇E(θ (t+1) ) − ∇E(θ (t) ) is calculated and used to update the approximate Hessian via As the parameter updates are found via line searches, the BFGS algorithm is not strictly local but due to its use of local higher-order information, the global search is much more efficient than direct optimization. The method has been successful in many applications and currently is of widespread use for VQEs. [13][14][15][16][17][18]

Natural Gradient Descent
The third optimization strategy we use is the NatGrad [19][20][21], which due to its increased cost per epoch is not adopted very often in machine learning settings itself but is connected to some successful methods. As an example, Stochastic Reconfiguration which is closely related to NatGrad [43] recently has been shown to work well for training Restricted Boltzmann Machines (RBMs) to describe groundstates of spin models [44]. Despite this success, the insights into why and under which conditions the method works remain limited and recent work has been put into understanding the learning process for the mentioned application of RBMs and the Natural Gradient Descent [26]. Before discussing NatGrad and its role in the VQE setting, we outline its update rule: Given a starting point θ (0) and a learning rate η, a step is performed by first constructing the Fubini-Study metric of the ansatz class at the current position and then updating the parameters via where we abbreviated |ψ (t) ) := |ψ(θ (t) ) and |∂ i ψ (t) := ∂ ∂θi |ψ(θ (t) ) . The Fubini-Study metric is the quantum analogue of the Fisher information matrix in the classical Natural Gradient [20]. It describes the curvature of the ansatz class rather than the learning landscape, but often performs just as well as Hessian based methods. In order to avoid unreasonably large updates caused by very small eigenvalues of F in standard Natural Gradient Descent η has to be chosen very small for an unpredictable number of initial learning steps. Alternatively one can use Tikhonov regularization which amounts to adding a small constant to the diagonal of F before inverting it.
Even though NatGrad is simple from an operational viewpoint, it is epochwise the most expensive optimizer of the three presented here (also see sec. III B 5). This is due to the fact that it not only uses the gradient but, in order to construct the (Hermitian) matrix F for n parameters, one also needs to evaluate 1 2 (n 2 + 3n) pairwise overlaps of the set {|ψ , |∂ 1 ψ , . . . , |∂ n ψ } (all but ψ|ψ = 1). Depending on the gates in the ansatz circuit, each of these overlaps requires at least one and possibly many individual circuit evaluations. For circuits containingñ simple one-or two-qubit Pauli rotation gates, the number of circuits required is 1 2 (ñ 2 + 3ñ), independent of the number of shared parameters. Symmetries of the circuit may reduce the number of distinct terms in which case fewer quantum machine runs suffice.
Taking the j-th parametrized unitary to have K j Hermitian generators P j,kj , e. g. Pauli words up to prefactors {c j,kj }, the factors in the second expression of F take the shape of an expectation value (see also eqn. (6)) The first term in eqn. (12) requires slightly more complex circuits using one ancilla qubit and a depth which depends on the indices of the matrix entry [13,17,45,46]. Both for simulation work and for applications on real quantum machines, the construction of the Fubini matrix is expected to take much more time than inverting it -in sharp contrast to typical classical machine learning problems. Given the scaling of the number of required circuits above and the fact that for a fixed number of qubits the depth has to grow at least linearly with the number of parameters, an asymptotic scaling of O ñ 3 is a lower bound for the construction of the full matrix. Standard matrix inversion algorithms do not only show smaller or equal scaling but also exhibit as prefactor the time cost of a FLOP whereas the evaluation scaling has prefactors based on sampling for expectation values.
As the number of parameters in a typical VQE circuit is considerably smaller than in neural networks and the circuit chosen in this work exhibits beneficial symmetries, the high cost of the method are expected to be less problematic for our setting and bearable for VQE applications. Indeed there have been some demonstrations of the Natural Gradient Descent and the Imaginary Time Evolution for small VQE instances [19,27,47] as well as comparisons to standard gradient descent methods and imaginary time evolution for one-and two-qubit systems [48]. Inspired by the classical machine learning context and aiming for reduced cost, modifications of Natural Gradient Descent have been proposed such as a (block) diagonal approximation to the Fubini-Study matrix [19]. We will later show that such simplifications have to be performed with caution and can disturb the optimization.

Optimization cost
To make a fair comparison between the optimization schemes, we briefly lay out the scaling of the required operations and the resulting cost per epoch.
We will use the following notation during the comparison: There are n variational parameters in the circuit, K H terms in the Hamiltonian and on average K = n j=1 K j /n Pauli generators per variational parameter, with an average of N M samples required for each expectation value. In practice, one of course would measure whole sets of operators both from the Hamiltonian and from the Pauli generator set simultaneously, such that K and K H essentially are numbers of bases in which measurements are required. For entries of the Fubini matrix we assume N a samples for sufficiently precise measurements, which has been shown to be smaller than N M numerically; For further discussion see [27]. Finally, we introduce the time scale t x := d x t gate + t wrap where t 1 is needed to initialize and measure the quantum register (t wrap ) and perform the circuit with depth d inbetween (dt gate ).
Evaluating the gradient of the energy function can be done with different methods yielding a tradeoff between precision and cost. On one hand, the analytic gradient can be evaluated up to measurement precision at the expense of an ancilla qubit and a scaling prefactor Kn. On the other hand there is the finite difference method, which can be performed symmetrically, asymmetrically or via Simultaneous Perturbation Stochastic Approximation (SPSA), with cost prefactors 2n, n + 1 and 2, respectively. This means that robustness to imprecise gradients in general is a relevant property of any optimization scheme used for VQEs because these gradients are much cheaper to evaluate. Computing the Fubini-Study metric requires two terms and although the measurement cost scales with O (Kn) 2 for the first and with O (Kn) for the second, we keep both terms in the overall cost scaling because the VQE regime implies moderate values of Kn.
For the scalings presented in table I we assume a homogeneous distribution of the variational gates in the circuit and that similar numbers of samples N M are required to measure expectation values of the Hamiltonian terms within one basis and each derivative for all gradient methods.
For the full optimization algorithms the cost are given per epoch as we do not have access to generic scaling of epochs to convergence. Using the cost per epoch one can rescale the optimization cost from epochs to estimated runtime on a quantum computer beyond estimates that are based on the classical simulation runtimes. For the BFGS algorithm we can not predict the number γ of evaluations that are required for the line searches but our numeric experiments and the linear scaling of the cost for non-SPSA gradients suggest that it can be neglected as compared to the gradient computation.
For the quantum runtime scalings shown in figs. 2 and 4 we give the time in units of t eval = N M K H t 1 , assumed  N M /N a ≈ 10 and approximated t 1 ≈ t 2 ≈ t 3 .
C. Models

Transverse Field Ising Model
Our main model is the TFIM on a spinchain with periodic boundary conditions (PBC). Its Hamiltonian reads (15) where we identify the sites 1 and N + 1 because of the PBC and t is the transverse field. For t = 0, the system is the classical Ising chain, which is also called ring of disagrees and is a special case of the MaxCut problem [1? ]. For t = 1 the problem is no longer motivated by a classical optimization task and for the critical point t = 1 the ground state exhibits long-ranged correlations.
The ground state of the TFIM is found analytically by mapping it to a system of non-interacting fermions, such that the transformed Hamiltonian can be diagonalized [49]. The translational invariance of the Hamiltonian is crucial for this step and it will be important that only a small number of different (Pauli) terms can be mapped to non-interacting fermions simultaneously. We show the explicit computation via the Jordan-Wigner transformation in appendix A, it can also be found in e. g. [? ]. Here we summarize the action of the mapping on the terms in the Hamiltonian which also generate the QAOA circuit (see eqn. (19) for the definition of α q ): where the expressions on the right are understood in a fermionic operator basis. The ground state of H TFI is just the product of the single-fermion ground states in momentum basis and we can write out the state and its energy as Because of the free fermion mapping, we can not only obtain the exact ground state of the system but also justify the success of the modified QAOA circuit for the TFIM. As mentioned in sec. III A 1 a, the original QAOA proposal would use the system Hamiltonian and a mixing term as generators for the parametrized gates. For the TFIM, however, separating the nearest-neighbour interaction terms H S from the transverse field terms H B recombines the latter with the mixing unitary next to it absorbing one variational parameter per block. The resulting parametrized circuit contains two types of translationally invariant layers, L x (ϕ) and L zz (ϑ), of one-and two-qubit rotation gates, respectively. Starting in the ground state of H B , that is |ψ = |+ ⊗N , we alternatingly apply these two layers p times starting with L zz . In the free fermion picture this translates to |ψ = |0 ⊗r and to rotations of the r fermionic states about the z-axis (L x ) and an axis e q = (0, sin α q , cos α q ) T which depends on the fermion momentum q (L zz ).
For t = 0 one can prove that this circuit can prepare the ground state exactly if and only if p ≥ r [14], whereas for the case t = 0 only numerical evidence and a nonrigorous explanation support this claim [36]. This explanation compares the number of independent parameters, 2p to the number of constraints from fixing the state of r free fermions, 2r. While solvability would be implied for a linear system, the given problem is non-linear and the argument remains on a non-rigorous level.
Finally, the equivalence to a system of free fermions has a practical implication for our simulations of the QAOA circuit: Storing the state of r free fermions just requires memory for 2r complex numbers. Applying the entire circuit needs 2pr two-dimensional matrix-vector multiplications, which is contrasted by 2pN matrix-vector multiplications in 2 N dimensions for a full circuit simulation in the qubit picture. Using the fermionic basis for numerical simulations, results on the VQE optimization problem for up to N = 200 and p > 120 have been obtained for t = 0 [14].

Heisenberg XXZ Model
As a second model we consider the 1D XXZM with PBC which is defined by ∆ is the anisotropy parameter. As in the TFIM, the sites 1 and N + 1 are identified. The Bethe ansatz reduces the eigen value problem for the XXZM to a system of N/2 nonlinear equations that can be solved numerically with an iterative scheme [50,51]. This results in polynomial cost for computing the ground state energy but does not yield a simple ansatz class to construct the ground state on a quantum computer or a simulation scheme of reduced complexity.
We therefore use the XXZM as a second benchmark which models the application case more closely: We do not know a finite gate sequence that contains the ground state but instead employ circuits composed of symmetrypreserving layers which we found to be relatively successful in experiments. The ansatz we choose is the first-order Trotterized version of the unitary time evolution with the system Hamiltonian applied to a antiferromagnetic ground state: where we only treat even N and |ψ is chosen symmetric under translation for (N mod 4) = 0 and antisymmetric for (N mod 4) = 2 in anticipation of the exact solution via the Bethe ansatz. We found this circuit to be more successful at finding the ground state than the QAOA circuit. Even though the terms N k=1 X (k) X (k+1) and N k=1 Y (k) Y (k+1) do not preserve the magnetization in the Z-basis in general they do so within the sector described by the above ansatz.

D. Simulation Details
The simulations of the QAOA circuit for the TFIM are done in the free fermion picture yielding a quadratic scaling in N for the cost function evaluation. The circuits including L y layers and for the XXZM do not obey the same symmetries and therefore are implemented as a full circuit simulation using ProjectQ [52]. The depth of the QAOA circuit for the TFIM is fixed to the smallest value containing the exact ground state p = N/2, which gives us N variational parameters with one added per L y in the second main experiment. For the (XXZ) model we choose p = N resulting in 3N variational parameters. All circuit simulations are performed exactly, i. e. without noise or sampling. Furthermore we use the SciPy implementation of the BFGS algorithm and in-house routines for ADAM and NatGrad [53]. All variational parameters are initialized uniformly i.i.d. over the interval [0.0001, 0.05] as this corresponds to initializing the circuit close to the identity and symmetric randomization around 0 has shown slightly worse performance in our experiments.
We bound the BFGS optimization to one period of the rotation parameters as this improves the line search efficiency and found only a small dependence on the position of the interval. For the ADAM optimizer we fixed β 1 = 0.9, β 2 = 0.999 and ε = 10 −7 and vary η in [0.005, 0.5] trying to build heuristics for the particular problems. We found non-trivial behaviour of ADAM w. r. t. the learning rate, observing a strong influence on the optimization duration, for details see sec. II A. Furthermore, an increased regularization constant ε did not yield any improvements of ADAM. For NatGrad we fix the Tikhonov regularization to ε T = 10 −4 and use learning rates of 0.5, 0.05 and 0.1. Employing block diagonal approximations to the Fubini-Study matrix as suggested in [19] was not successful due to long-range correlations between the variational parameters in the circuit.

IV. CONCLUSION
Our first main result shows that the BFGS optimizer, while quick and reliably for small systems, has an increased chance getting stuck in local minima already in medium sized VQEs, in the range of present day and near future NISQ devices. This may be surprising as it has access to non-local information due to its line search subroutine. We suspect that this aspect of the algorithm becomes less helpful for finding a global minimum because of its sparsity in high-dimensional parameter spaces.
The ADAM optimizer on the other hand is able to find global minima also in larger parameter spaces (up to 42) for suitably small learning rates but this comes at the cost of a quickly increasing number of epochs to complete the optimization. In particular we observed two effects of the learning rate η on the runtime of ADAM: On the one hand, there is a threshold size of the parameter space that depends on η above which the epoch count rapidly increases, which means that a small enough value of the learning rate is essential to avoid extremely long runtimes. On the other hand, the optimization duration for sizes below the threshold is significantly increased when reducing η such that it is undesirable to choose it smaller than strictly necessary. It thus appears that tedious hyperparameter tuning may be necessary to balanced these two effects.
The NatGrad optimizer recently proposed for VQEs shows very reliable convergence to a global minimum for all tested system sizes within fewer epochs but at high per-epoch cost. We found that Tikhonov regularization can fix the problem of getting lost in barren plateaus even after a suitable initialization. This makes the algorithm a promising, although expensive, candidate for the optimization of future VQEs. The increased cost for determining the Fubini matrix at each step have a particularly strong effect on the estimated quantum runtime for spin chain systems, such that for other systems with more favourable scaling NatGrad might not only be more reliable but additionally exhibit competitive cost.
Our second main experiment treats overparametrization in VQE ansatz classes using the example of additional rotation gates that break the symmetry of the Hamiltonian. The BFGS optimizer fails to find a global minimum in some instances even for small systems and in general exhibits a strongly fluctuating performance which decreases with the number of additional gate layers. The simulation cost restricted the maximal system size for this second experiment but there is no reason to assume that a stronger overparametrization with more symmetry breaking layers would resolve these problems.
Also ADAM showed strong susceptibility to the additional degrees of freedom. Beyond the implications on applications, this is interesting because overparametrization is heavily used in machine learning to make the cost function tractable for optimizers like ADAM and we therefore appear to observe a fundamental difference between classical machine learning and VQEs.
Finally, NatGrad showed some failed optimization runs for selected system sizes as well but mostly remained successful even for multiple additional gate layers. It therefore rewards its increased cost per epoch with higher success rates and is the only tested optimization strategy that showed resilience to both, big search spaces and local minima caused by overparametrization.
The extension of our analysis to the XXZM confirmed the problems of the BFGS optimizer with big search spaces and the rapid runtime growth for ADAM. NatGrad performed less reliably on the XXZM and the per-epoch cost dominate the reduced number of epochs. The convergence issues might be either due to local minima or optimization interrupts based on small improvements with a series of updates, where preliminary insights suggest that the latter is the case and that NatGrad could be improved by tailoring it to VQEs.
Our investigations have shown that NatGrad might enable VQEs to solve more complex and bigger problems as it performs well on a test model with challenges representative of those in potential future applications of VQEs. Caution is in order, however, when generalizing this result to other models as we saw in the case of the XXZM.
For the transformation of the Hamiltonians H S and H B , which comprise both the TFIM Hamiltonian and the generators for the unitaries in the QAOA ansatz, note that Using eqn. (A1) and the above properties the transformed Hamiltonians read where we denote by G := N l=1 N l the gauge factor in the term generated by the periodic boundary conditions and the non-local transformation eqn. (A3) which also has a reversed sign. G interacts with the initial state of The shown fits are based on filtered data in order to determine the apparent scaling for small system sizes and thus do not aim at describing the entire data. The biggest system size partially included in the fit is marked. For the shown learning rates in descending order, we obtain the exponents 2.3, 2.3, 1.9 and 1.4 but prefactors 1.8, 1.9, 7.3 and 74.7.
cuit of the TFIM, we tested it at multiple learning rates from the interval [0.005, 0.1] observing a major influence on the runtime, see fig. 6. For a given learning rate η, the required number of epochs grows polynomially with the system size up to a size N * above which ADAM takes much longer, exceeding the budget of 5·10 4 iterations. In this second phase we find the optimizer to require excessively many iterations both when succeeding and when getting stuck in a local minimum (see e. g. η = 0.06), which prevents us from systematically distinguishing the two cases before convergence. The observed transition point N * (η) can be shifted towards bigger system sizes by decreasing the learning rate, i. e. N * (η) is monotonically decreasing. Meanwhile, reducing η increases the epoch count significantly for smaller system sizes without disrupting the convergence as is expected for well-behaved systems. Even though the scaling exponent is smaller for lower learning rates the optimization requires more iterations which is due to a large prefactor, such that the cost are increased for all system sizes before the jump. The observed dependencies of the runtime on η result in a system size dependent optimal learning rate which trades off the systematically increased epoch counts for small η against the position of the jump in optimization duration. This demonstrates that heuristics for ADAM are needed in order to achieve systematic global optimization and that the required numer of optimization steps can be unpredictably large depending on the hyperparameters.