Quantum Analytic Descent

Variational algorithms have particular relevance for near-term quantum computers but require non-trivial parameter optimisations. Here we propose Analytic Descent: Given that the energy landscape must have a certain simple form in the local region around any reference point, it can be efficiently approximated in its entirety by a classical model -- we support these observations with rigorous, complexity-theoretic arguments. One can classically analyse this approximate function in order to directly `jump' to the (estimated) minimum, before determining a more refined function if necessary. We derive an optimal measurement strategy and generally prove that the asymptotic resource cost of a `jump' corresponds to only a single gradient vector evaluation.


I. INTRODUCTION
Quantum devices have already been announced whose behaviour cannot be simulated using classical computers with practical levels of resource [1][2][3][4]. In this era, quantum computers may have the potential to perform useful tasks of value. The early machines will not have a comprehensive solution to accumulating noise [5], and therefore it is a considerable and fascinating challenge to achieve a valuable function despite the imperfections. One very promising class of approaches are generically called quantum variational algorithms in which one seeks to make use of a quantum circuit of (presumably) relatively low depth [6][7][8][9][10], by adjusting the function it performs to tune it to the desired task.
Typically a simple-to-prepare reference state (such as all-zero) is passed into a quantum circuit, called the ansatz circuit, within which there are numerous parametrised gates. The idea exists in many variants, both theoretical and experimental [6,7,[11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27], refer also to the review articles [8][9][10]. In a typical implementation, each gate implements a unitary which is therefore also parametrised; for example where σ x is the Pauli X operator acting on a given qubit, and θ is the classical parameter. For a suitably-chosen ansatz circuit and an appropriate number of independently parametrised gates, the emerging state (also called the ansatz state) may be very complex -while inevitably being restricted to a small proportion of the exponentially large Hilbert space. A given problem, for example the challenge of finding the ground state of some molecule of interest, is encoded by deriving a Hamiltonian H whose ground state represents an acceptable solution. This is of course a non-trivial challenge in itself for many systems of interest. Importantly, this challenge is not decoupled from the task of selecting a suitable ansatz circuit, or that of choosing the initial parameters for that circuit.
Assuming that all these tasks have been appropriately * balint.koczor@materials.ox.ac.uk performed then the hope is that there exists some set of parameters, to be discovered, for which the ansatz state emerging from the circuit is indeed (acceptably close to) the desired solution state. The problem then becomes one of parameter search -there might easily be hundreds of parameters, so that techniques from classical optimisation are very relevant to the prospects of successfully finding the proper configuration. A popular choice is gradient descent; in the basic form of this method one evaluates the gradient of energy H with respect to each of the ansatz parameters. One then takes a 'small step' in the direction of steepest gradient descent, and re-evaluates the gradient. Numerous adaptions are of course possible, ranging from varying the size of the step through to more advanced protocols for obtaining a valid direction of progress [28].
Although gradient descent and its more advanced variants, such as natural gradient [29][30][31] are a popular choice, they have their limitations and costs. Determining the energy in quantum chemistry or in recompilation problems must be performed to a very high accuracy to be useful (e.g. chemical accuracy, equivalent to 3 or 4 decimal places [32]) while finding the minimum of the energy landscape very precisely is generally expensive due to shot noise [22,28,33].
In the present work we study an alternative method of particular relevance in the latter stages of a QVA when we begin to approach the minimum of the cost function: Using an ansatz circuit within which gates correspond to Pauli strings (a universal construction), we observe that the cost function i.e. the expected energy of the output state with respect to the problem Hamiltonian, will necessarily have certain simple properties in the local region around any reference point. Exploiting this knowledge, we sample from the ansatz circuit to determine an analytic function to the near-minimum region. Given this function, we can descend towards the minimum classically and then take a 'large jump' (as compared to the small incremental steps taken in generic gradient descent) direct to that point. If necessary we then repeat the procedure of refining the analytic function and jumping again, until we reach a point satisfactorily close to the true minimum. We derive an optimal measurement strategy whereby we occasionally collect further samples arXiv:2008.13774v4 [quant-ph] 13 May 2022 during a descent to reduce shot noise in our classical approximations. In numerical simulations of this approach we find that a single jump can be equivalent of thousands of steps of generic gradient descent.

II. EXPANDING THE ANSATZ CIRCUIT
Quantum gates generated by Pauli strings have only two distinct eigenvalues, and consequently as we vary the parameter θ associated with such a gate, the corresponding slice of the energy surface is simply of the form a + b cos θ + c sin θ for some a, b, c ∈ R. For further discussion refer to [34][35][36][37][38][39].
It immediately follows from the above property that the Fourier spectrum of the full energy surface is determined by 3 ν coefficients, where ν is the number of parameters. Despite the very simple structure of such functions, determining them classically is intractable. Nevertheless, previous works proposed that the exact energy surface can be reconstructed for a classically tractable number of parameters, e.g., ν = 1, 2, while freezing other parameters and thereby sequentially optimising the surface using a classical computer [34][35][36]. Here we make the fundamental observation that one could efficiently obtainby estimating at most a quadratic number of terms -a good classical approximation of the full energy surface (and its full gradient vector) that is valid in the vicinity of any reference point. We support these observations with rigorous, complexity-theoretic arguments and with an optimal measurement strategy. Let us introduce our model.
Here P k are products of single-qubit Pauli operators as P k ∈ {Id, σ x , σ y , σ z } ⊗N . For any such ansatz circuit, we can expand every gate into the following form. First, let us fix θ 0 and express the continuous dependence of the quantum gates on the angle θ around the fixed θ 0 as where a(θ), b(θ) = 1 ± cos(θ) and c(θ) = sin(θ)/2 are simple Fourier components. The transformations can be specified as Φ ak = Φ k (θ 0 ), and via parameter shifts as Φ bk = Φ k (θ 0 +π/2)−Φ k (θ 0 −π/2) and Φ ck = Φ k (θ 0 +π). Note that these transformations are discrete in nature, and implicitly depend on the constant θ 0 which we have fixed as a reference point. Refer to Appendix A 1 for more details. We now expand the full ansatz circuit from Eq. (2) into the above form assuming that all gates are generated by Pauli strings via Eq. (3). We again fix θ 0 and express the continuous dependence on θ around this reference point in parameter space as The above product can be expanded into a sum of 3 ν terms, which is classically intractable. Nevertheless, in the following we aim to approximate this mapping via a polynomial number of important summands and discard the remaining, less important terms. In particular, we introduce δ := θ ∞ , which denotes the absolute largest entry in the parameter vector. We will now expand the above quantum circuit into a quadratic number of terms in ν which introduces an error O(sin 3 δ).
We derive the explicit form of this approximate mapping in Appendix A 2 as Here A, B k , C k , D kl : R ν → R are multivariate functions -in fact, monomials in a(θ), b(θ) and c(θ) -and they multiply the discrete mappings as, e.g., A(θ)Φ (A) . As such, these monomials are products of simple univariate trigonometric functions and they completely absorb the continuous dependence on the parameters θ.
Our derivation of Eq. (5) is detailed in Appendix A 2 and relies on the following 3 steps. First, we substitute the explicit forms of the single-variate trigonometric functions a(θ k ), b(θ k ) and c(θ k ) into Eq. (4). Second, we expand the resulting product into a sum of 3 ν terms. Third, we discard all contributions that contain a product of 3 or more sin θ k terms thereby introducing an error O(sin 3 δ). We could similarly approximate the mapping via a sum of O(ν 3 ) terms up to an error O(sin 4 δ) or beyond.
Our multivariate trigonometric series has the significant advantage that it can capture some global features in contrast to local Taylor expansions, for example, the approximation is exact along single parameter slices and respects symmetries of the objective function, such as its periodicity. In particular, while the error term is generally upper bounded by the 'pessimistic' monomial const × δ 3 just like in the case of a Taylor expansion, the actual error can be significantly below this bound and, e.g., can be zero along single parameter slices. Moreover the constant prefactor in the above upper bound can be significantly smaller than in case of a Taylor expansion, refer to Appendix B 7 and to Appendix B 6. FIG. 1. Error of our trigonometric-series approximation of the entire energy surface (a) and gradient vector (b) as a function of the distance δ from the reference point θ 0 of our model, where δ = θ ∞ is the absolute largest parameter θ k . As long as δ is small, we can classically approximate the gradient vector and use it in an analytic descent optimisation. The approximation error of the gradient vector is computed as the similarity measure 1 − f , refer to text. We used a 12-qubit spin-ring Hamiltonian as in Fig. 2(b) and an 84-parameter ansatz circuit, and included the empirical scaling of the errors as O(δ 3.1 ) and O(δ 4.2 ).

III. CLASSICALLY COMPUTING THE ENTIRE ENERGY SURFACE
A large class of potential applications in the context of variational quantum algorithms assume a target function that corresponds to linear mappings of the form E(θ) := Tr[H Φ(θ)ρ 0 ], that can be used to model, e.g., the expected energy of a physical system when H is a Hamiltonian [8][9][10].
Using our expansion in Eq. (5), we can express the entire energy surface explicitly as ∈ R can be reconstructed by estimating the energy expectation value at discrete points in parameter space using a quantum device. For example, is just the energy at the fixed point θ 0 and E (C) k is obtained similarly by shifting the k th parameter by π. As such, our classical approximation algorithm depends on a quadratic number of coefficients that can be fully determined by querying energy expectation values. Indeed, error mitigation techniques are applicable [8,[40][41][42].
Let us here briefly summarise our derivation of Eq. (6) from Appendix B. We first apply both sides of Eq. (5) to our reference state ρ 0 , i.e., on the left-hand side we obtain the exact, continuously parametrised quantum state Φ(θ)ρ 0 while on the right-hand side we obtain an approximation to it in terms of the discrete mappings, such as Φ (B) k ρ 0 . We finally obtain Eq. (6) by computing quantum-mechanical expected values via the linear mapping Tr[H ·], for example, we obtain the coefficients as E Fig. 1(a) shows approximation errors obtained from a simulated ansatz circuit of 12 qubits as a function of the absolute largest entry in the parameter vector given by the norm δ = θ ∞ . We computed the approximate energy via Eq. (6) at 1000 randomly generated points in parameter space in the vicinity of our reference point θ 0 , close to the global optimum. Fig. 1(a) confirms the error scaling O(δ 3 ) and illustrates that the error is smaller than 10 −3 as long as the parameter vector norm θ ∞ is smaller than 0.1. We further remark that in Appendix B 6 we derive exact and approximate symmetries of the energy function around local minima; the objective function is approximately reflection symmetric via E(θ) ≈ E(−θ) and exactly reflection symmetric along slices θ k .

IV. CLASSICALLY COMPUTING THE GRADIENT
We derive the full analytic gradient of the approximate energy surface from Eq. (6) in Appendix B 1 and propose an efficient classical algorithm for computing it for a given input θ in Appendix D 1. This has a classical computational complexity of O(ν 3 ). We simulate an ansatz circuit of 12 qubits in Fig. 1(b) and compute the approximation error of the analytically calculated gradient vector. We quantify this error using the similarity measure as the normalised scalar product f = g|g /( g g ), between the exact g and the approximateg gradient vectors. We plot 1 − f in Fig. 1(b), and conclude that our approximation is very good and that our error measure scales with the parameter vector norm in fourth order as We aim to use this gradient vector in a classical optimisation routine, but we first have to take into account shot noise: When using a quantum device to estimate the coefficients in Eq. (6), one needs to collect a large number of samples in order to sufficiently reduce the statistical uncertainty in those estimates. This uncertainty is quantified by the variances as, e.g., Var[E (B) k ] when estimating the coefficient E k . As such, we want to determine the gradient vector to a fixed precision as the expected Euclidean distance 2 := ∆g 2 = ν k=1 Var[∂ m E(θ)] for which we derive the following error propagation formula Here A, B k , C k , D kl : R ν → R are trigonometric polynomials that depend on the parameters θ and we can efficiently compute these [43]. Note that the above statistical uncertainties are directly proportional to single-shot variances of estimating energy expectation values as, e.g., Var[E (A) ] = Var[E(θ 0 )], and advanced estimation techniques can be applied [44]. We analytically derive an optimal measurement strategy in Theorem 1 which does not require us to determine the coefficients, such as E (B) k , in order to predict their measurement costs but only their variances which is relatively cheap. As such, using a small overhead in quantum resources, our classical algorithm takes an input parameter vector θ and it exactly determines how many measurements need to be assigned to estimating the individual coefficients. Most importantly, when we are close to our reference point almost all measurements are assigned to the coefficients E (B) k which guarantees that the cost of our approach is comparable to a single iteration of gradient descent.
For this reason, we prove the following approximate upper bound of the full measurement costs in Theorem 2 relative to determining a single gradient vector as Here S is the ratio of minimal and maximal single-shot variances due to estimating the energy E(θ) at different points θ while ν is the number of ansatz parameters and δ is the distance from our reference point θ 0 . In our proof in Appendix B 5 we expand the exact variance propagation formula from Eq. (7) and obtain Eq. (8) by keeping only the leading terms in δ and upper bounding the single-shot variances.
Our upper bound in Eq. (8) ensures us of the following: a) Initialising analytic descent in the reference point θ 0 costs exactly the same as determining a single gradient vector; b) When not moving very far from the reference point, e.g., δ ≤ 2/ν, then the overall measurement cost is only by a small constant factor more expensive than estimating a single gradient vector; c) We generally prove that as we asymptotically approach the optimum, analytic descent costs exactly the same as determining a single gradient vector.

V. QUANTUM ANALYTIC DESCENT
Instead of determining the gradient at every step, we use our classical approximation of the entire objective function E(θ) and its gradient vector to descend towards its minimum using a classical computer. We propose an iterative optimisation in two nested loops. First, in an external loop we use the quantum device to estimate the coefficients in Eq. (6) which allow us to build a classical model of the full objective function around the reference point θ 0 . This initialisation costs exactly the same number of measurements as determining a single gradient vector. In the internal loop, we compute our classical approximation of the gradient vector at every iteration step and propagate our parameters θ according to a suitable update rule. With our efficient C code for computing the gradient vector, descending 1000 steps towards the minimum can be performed in a matter of minutes on a single thread for up to many hundreds of parameters [43].
The internal, classical optimisation loop is aided with feedback from the quantum device: As we move away from the reference point, shot noise in our classical approximation is magnified via Eq. (7) which would degrade the precision of our classical gradient. Therefore, our optimal measurement distribution algorithm determines to which coefficients we need to assign further measurements in order to keep this precision below a threshold. For example, when moving along a single slice θ 1 , then we need to use the quantum computer to sample only a linear number of coefficients, namely the E (D) 1,l . Furthermore, note that our upper bound in Eq. (8) guarantees that the overall number of additional measurements is generally proportional to the distance δ from the reference point.
Besides keeping the precision below a threshold, we also need to ensure that our analytical approximation is valid (as it breaks down for large δ). For example, one could estimate the energy with the quantum device, e.g., at every t iterations; If the deviation from the analytical energy is too large then the internal loop should abort and our approximation in Eq. (6) should be re-initialised in the new reference point. Other possibilities include, e.g., estimating the previously discussed similarity measure 1 − f or simply aborting when θ ∞ or the iteration depth exceeds a certain threshold.

VI. NUMERICAL SIMULATIONS
Let us now demonstrate our approach on two problems of practical relevance. We consider a hardwareefficient ansatz construction which is built of alternating layers of parametrised single-qubit X and Y rotations  10 -3.

-2.
10 -1. In particular, a classical approximation of the energy surface is determined at each iteration step of analytic descent (solid lines) and in an internal loop we descent towards its minimum using a classical computer using gradient descent (not shown here). Our approximation is occasionally refined with optimally distributed additional measurements to keep shot noise (via 2 ) below a threshold. Note that the hyperparameters have been optimised, especially the sampling rates, for each technique specifically so that the low energy regime can be reached. Consequently they do oversample in the early evolution and a left-to-right shift should be viewed as an artefact of this choice. All four techniques rely on determining the coefficients, such as E and two-qubit parametrised Pauli ZZ gates as illustrated in Fig. 7, and we demonstrate our approach on two problems. First, we consider an 8-qubit recompilation problem of a unitary operator U that acts on 4 qubits as in [45]: since recompiled unitaries may be repeated as part of a quantum algorithm, they need to be determined to a very good approximation. Second, we simulate a spinring Hamiltonian [46,47] that is important in the context of many-body localisation and we aim to determine its ground state to a high precision. Fig. 2 shows the decreasing distance from the exact ground-state energy: we do not compare the number of iterations but instead the number measurements (quantum resources). As such, in Fig. 2(black) analytic descent reaches the optimum faster than other techniques (with fewer measurements and within about 5 iterations). Furthermore, analytic descent appears to have a considerably accelerated asymptotic convergence rate, i.e., a significantly steeper slope, even though in the early stages of the optimisation its efficacy is comparable to simple gradient descent (red dashed lines). The explantation is straightforward: building a classical approximation of a local region may not be beneficial as we take big jumps in the early evolution. In contrast, as we approach the optimum, our analytical approximation acts as a "memory" and we can thus descend towards the optimum with very little overhead in measurement costs which is confirmed by the steep decrease in Fig 2. Note also that the overall cost of any optimisation algorithm is dominated by these later stages of evolution due to the fundamental shot-noise limit.

Cumulative Total
We show in the Appendix B 7 that the Hessian is de-termined by our coefficients in Eq. (6) and similarly provides an O(δ 3 ) approximation of the local energy surface. Fig. 2(blue) confirms that initially the Hessianbased Newton-Raphson approach is faster than gradient descent (steeper slope), but then slows down when approaching the optimum for three main reasons as expected from ref. [28]. a) As opposed to analytic descent, we need to determine all the O(ν 2 ) coefficients to a relatively high precision for computing the inverse of the ill-conditioned Hessian; b) the measurement cost grows with η 4 of a regularisation parameter η, and we therefore set η = 0.1 to keep costs practical -whereas an increased η reduces convergence rate; c) We use the Hessian to determine a jump in parameter space, but this jump is taken with respect to a Euclidean geometry as opposed to the relevant Riemannian geometry with substantial off-diagonal entries in the metric tensor [29,30].  [34][35][36], whereby we repeatedly jump to the global minimum along single parameter slices θ k . The approach is initially faster than gradient descent (steeper slope). We note that hyperparameters, in particular the sampling rate, of each technique have been specifically optimised such that a convergence criterion ∆E = 10 −4 can be reached. We therefore inevitably oversample in the early evolution and a left-to-right shift in Fig. 2 should be viewed as an artefact: in a lowprecision setting, e.g., ∆E = 10 −2 , sequential optimisation may even outperform others [34][35][36].
We finally remark that quantum natural gradient has been shown in numerical studies to significantly outperform classical optimisers and to be less vulnerable to get-ting stuck in local optima [16, 29-31, 48, 49]. We numerically demonstrate in Appendix D 4 that analytic descent is further enhanced by taking into account the metric information by building a classical approximation of the quantum Fisher information [F Q ] mn . In particular, the metric tensor entries can also be approximated classically as (9) where F BB are real coefficients that we can estimate by computing overlaps between quantum states while F BB (θ) are trigonometric monomials.

VII. CONCLUSION AND DISCUSSION
In this work we considered analytical characterisations of variational quantum circuits that are composed of Pauli gates; although exponentially many coefficients determine a full trigonometric expansion, we propose an efficient, approximate approach for characterising the ansatz landscape in the vicinity of any reference point.
We propose a novel optimisation technique: a quantum device is used to determine a classical approximation of the entire energy surface. A classical optimisation routine is then used in an internal loop to descend towards the minimum of this approximate surface. We have devised an exact, optimal measurement distribution strategy whereby the quantum computer is occasionally used to perform further targeted measurements to reduce shot noise in our classical model: we generally prove that asymptomatically the measurement cost of an entire 'jump' in our approach corresponds to determining just a single gradient vector.
We numerically simulated practical problems and observed that indeed analytic descent significantly outperforms other techniques both in terms of the number of measurements and in terms of its convergence rate. We have made our efficient C implementation of the approach publicly available [43].
There are a number of apparent, promising extensions. First, we could use the information from the previous iterations as a Bayesian prior when re-estimating our classical model in a next step. Second, we can similarly build a classical model of the quantum Fisher information matrix and compute it in the internal optimisation classically without using the quantum device.
We note that our approach is completely general and can be applied to any Hamiltonain H, although, we expect that increasingly more complex Hamiltonians -such as in quantum chemistry -might result in more complex energy surfaces which are more difficult to approximate classically. Nevertheless, a significant advantage of our approach is that in all cases it guarantees an approximation error of the gradient vector that scales with the fourth power of the distance from the reference point as shown in Fig. 1(b). While our analytical approximation may be accurate for relatively large jumps δ, we have shown that its measurement cost relative to determining a gradient vector grows with the distance δ.
As such, the main limitation of the present approach is that in the early evolution it may be less beneficial to build a local approximation of the energy surface due to the increased sampling costs. The present work therefore motivates a hybrid approach whereby analytic descent complements other techniques: in the early evolutions one may benefit from, e.g., applying a sequential optimisation [34][35][36] or natural gradient [29-31, 48, 49], while in the later stages of the evolution one would switch to analytic descent. However, it is important to recognise that the bulk of the optimisation costs are absorbed by the later stages of the evolution. For example, in Fig. 2 we spend less than 10 9.5 shots in the early stages while it takes an order of magnitude more, 10 10.5 shots, to reach our convergence criterion with standard gradient descent. As such, Quantum Analytic Descent could reduce the overall cost of optimisation by at least an order of magnitude, and this figure is further increased when using more advanced techniques for adaptively setting sampling rates.

ACKNOWLEDGMENTS
SCB acknowledges financial support from EPSRC Hub grants under the agreement numbers EP/M013243/1 and EP/T001062/1, and from the IARPA funded LogiQ project. BK and SCB acknowledge funding received from EU H2020-FETFLAG-03-2018 under the grant agreement No 820495 (AQTION). BK thanks the University of Oxford for a Glasstone Research Fellowship and Lady Margaret Hall, Oxford for a Research Fellowship. The numerical modelling involved in this study made use of the Quantum Exact Simulation Toolkit (QuEST), and the recent development QuESTlink [50] which permits the user to use Mathematica as the integrated front end. The authors are grateful to those who have contributed to both these valuable tools. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the U.S. Army Research Office. Let us finally remark that the present technique has recently been extended to general quantum gates in ref. [51]. Let us consider a single gate in the ansatz circuit U k (θ k ), where k indexes its position and k ∈ {1, 2, . . . ν} with ν denoting the number of parameters. We assume that this gate is generated by a Pauli string P k and ideally (when the gate is not noisy), it corresponds to the following unitary operator where the second equality straightforwardly follows from the algebra P 2n k = Id and P 2n+1 k = P k . We now fix the parameter dependence of this gate at the reference point θ 0 and express the action of this gate on any quantum state using the continuous angle θ. Let us first define the quantum gate as a mapping Φ k (θ) : D → D over density operators, where D denotes the set of density operators, i.e., positive, unit trace operators over the Hilbert space C 2 N . The gate can then be expressed as a general mapping over arbitrary density matrices ρ as the unitary conjugation U k (θ 0 + θ)ρU † k (θ 0 + θ), and this can be expanded into the following transformations Here we have used the notation ρ ref : We can now formalise Eq. (A3) by separating it into discrete mappings over density matrices which are multiplied by continuous functions that depend on the parameter θ as Here the mapping depends on the parameter θ via the Fourier components a(θ), b(θ), c(θ) : R → R and we define their explicit forms as and we have also included their scaling when approaching θ → 0. Note that we have intentionally introduced the constant shift θ 0 and, of course, our definition corresponds to the action Φ k (0)[ρ] = U k (θ 0 )ρU † k (θ 0 ) for the case θ → 0. The discrete mappings Φ ak , Φ bk and Φ ck in Eq. (A4) can be specified via their action on arbitrary density matrices as where we have denoted U + := U k (θ 0 +π/2) and U − := U k (θ 0 −π/2). We finally conclude by recollecting their explicit forms as We can use the above expressions to express any linear mapping, such as the energy functional E(ρ) : D → R, via the trace relation E(ρ) = Tr[H † ρ], which is often referred to as an expectation value, and H is any Hermitian operator in the Hilbert space C 2 N . We now consider the parametric mapping E(θ) : R → R, which we define as and we refer to it as the energy function, or energy landscape. The reference state can be, e.g., the computational zero state ρ 0 := |0 0|. We can express the energy function explicitly via the following Fourier series The Fourier coefficients α k , β k , γ k ∈ R can be completely determined by discrete samples of the energy function via the discrete mappings of the density matrix as The above formula informs us that we can completely and analytically determine the full energy function E(θ), just by querying the function E(θ) at four different points as (−π/2, 0, π/2, π). Of course Nyquist's theorem also informs us that this is suboptimal, since the Fourier spectrum of E(θ) is bounded with only 3 frequency terms present (−1, 0, 1). This guarantees that querying the function E(θ) at only 3 points would be sufficient for completely reconstructing it. Note that due to our definitions, the parameter θ is shifted by the constant θ 0 and, for example,

Expanding the full ansatz circuit
Let us now consider the effect of the full ansatz circuit on the reference state ρ 0 := |0 0| as U (θ 0 +θ)ρ 0 U † (θ 0 +θ) with using the notation Here θ 0 ∈ R ν is a vector that represents a fixed, constant shift of the parameters, while the circuit depends continuously on the parameters θ ∈ R ν .
Using results from the previous subsection, we can build an analytical model of the superoperator representation Φ(θ) of the full ansatz circuit as the mapping The above equation expresses the full ansatz circuit and its dependence on the parameters θ. Of course fully expanding the above expression would result in a sum of 3 ν different terms. Nevertheless, we expand this into a sum and truncate the expansion such that the remaining terms are correct up to an error O(sin 3 δ). For this we define δ := θ ∞ , to denote the absolute largest entry in the vector θ. We assume that the continuous parameters are only used to explore the vicinity of the reference point θ 0 in parameter space via a sufficiently small δ. This can be, e.g., when the reference parameters θ 0 are already a good approximation of the optimal ones as θ 0 − θ opt ∞ < δ with δ 1 and we search for the ground state energy by optimising the continuous parameters.
Let us now derive our approximation: we first substitute the explicit forms of the trigonometric functions into the expression above as and expand this product into a sum of 3 ν terms. We drop all terms that have a product of 3 or more sin(θ k ) terms in them thereby obtaining an approximate mapping that is correct up to O(sin 3 δ) as Here the functions A(θ), B k (θ), C k (θ) and D kl (θ) absorb the dependence on the parameters θ and Φ are superoperators of discrete mappings. We compute the explicit form of the terms appearing in the summation in Eq. (A14) as The discrete mappings can be further simplified by using Eq. (A8) as where v k ∈ R ν denotes the standard basis vector, e.g., (0, 0, . . . 0, 1, 0, . . . 0). We further remark that due to our definitions, the parameters θ are shifted by the constant vector θ 0 and, for example, Φ(0)ρ 0 = U (θ 0 )ρ 0 U † (θ 0 ).
We can quantify the error of the approximate mapping in Eq. (A14) via the trace distance of the resulting density operators and we express this as Φ(θ)ρ−Φ(θ)ρ tr = O(sin 3 δ). We remark that our expansion in Eq. (A14) consist of a sum of 1 + ν + ν 2 /2 different terms and describes the variational mapping up to an error O(sin 3 δ). We could similarly expand the mapping into a sum of O(ν 3 ) terms and have an error O(sin 4 δ) or beyond. As such, in general we can obtain a family of approximations to the energy landscape: by discarding all terms that contain a product of q or more sin θ k terms we obtain an approximation as a sum of O(ν q−1 ) terms with an approximation error O(sin q δ).

Appendix B: Approximating the full energy surface locally
We can express the full energy surface following our definition in the previous section and evaluating the dis-crete mappings We can express the discrete mappings as queries of the energy function at discrete points in parameter space as Here v k ∈ R ν denotes a standard basis vector, e.g., (0, 0, . . . 0, 1, 0, . . . 0). Note that due to our definitions, the parameters θ are shifted by the constant vector θ 0 and, for example, Using the above expressions, one can determine an O(sin 3 δ) approximation of the full energy surface by querying the energy function E(θ) at a total number of Q points, where Q = 1 + ν + 2ν + 4(ν 2 /2 − ν) = 1 + 2ν 2 − 2ν. (B2)

Expressing the gradient analytically
We now derive the dependence of the gradient vector components g m := ∂ m E(θ) on the parameters θ using our approximation from Eq. (B1). We can explicitly write Let us fist compute the derivatives of the single-variate functions from Eq. (A5) as We now compute partial derivatives of the monomials; The first term is Similarly we have but note that here we do not not assume that m > k. Very similarly we have Finally, One can therefore compute the full gradient vector analytically, up to an error O(sin 2 δ), via the monomials A(θ), B k (θ), C k (θ) and D kl (θ) and the corresponding energy coefficients. These coefficients can be determined by querying the energy function at O(ν 2 ) points as discussed in Sec. B. We propose an efficient classical algorithm for computing this gradient vector, and its computational complexity is O(ν 3 ), refer to Sec. D 1.

-2.
10 -1. This verifies our analytical expression derived in Sec. B 3 that we have numerically exactly computed using our efficient C code [43]. (right) We compute the exact expression for the function T (θ) as defined in Eq. (B9) using our efficient C code and compare it to the analytical approximation in (B14) and obtain the expected O(δ 2 ) error term. The analytical approximation in (B14) is used to derive the scaling of the measurement cost of the analytic descent approach.

Error propagation and variances
Using the usual linear error propagation formula, the variance of the gradient estimator can be computed via the following terms Here the variances, such as Var[E (A) ] = Var[E(0)], are directly proportional to the precision of the energy estimation. This variance scales inversely with how many times the energy estimator is sampled. Now using the scaling of the multivariate functions from B 1, we can expand the above variance into a leading term [ ∂Bm(θ) ∂θm ] 2 = O(1) and into terms that scale with δ as As long as the norm θ ∞ < δ is sufficiently small, the variance of the gradient vector is dominated by the variances of measuring Var[E (B) k ]. This means that, even though one has to query the energy function at O(ν 2 ) points, most of these queries need not be very precise. In fact, the variance of the gradient component is dominated by the precision of the O(ν) queries used to determine the coefficients E k . Let us now derive an optimal measurement strategy that confirms these expectations.

Optimal measurement distribution
Using techniques from [28] we now derive an optimal measurement strategy for estimating coefficients in our analytical approximation of the gradient vector, refer also to [52]. Let us define the precision of determining the full gradient vector via the expected euclidean distance from the mean as 2 : We can express this precision explicitly using the above variance propagation formula as Let us simplify the above equation by introducing the abbreviations that denote the following trigonometric polynomials as through which we can express the precision of determining the full gradient vector as We confirm the validity of the above error propagation formula in Fig. 3. Notice that the above equation is a sum over nonnegative terms of the form where I is an index set that indexes the terms in the above equation while x i are statistical variables that correspond to coefficients in our analytical approximation. The coefficients c i are given by, e.g., B k . We can reduce 2 by increasing the number of measurements that are used to determine, e.g., E k . In the following the variance Var[x i ] denotes the variance of a single measurement. We distribute overall N measurements optimally by assigning N i measurements to estimating the mean of the individual x i variables as where we define Indeed the overall number of measurements is determined as N = T 2 / 2 . Furthermore, in this optimally distributed scheme the reduced individual variances are given by Let us note that the quantity T which determines our measurement cost depends on the parameters θ and for example at θ = 0 we exactly obtain the measurement cost of the gradient vector as Here we have used that ∂Bm(θ) ∂θm | θ=0 = 1/2 and via N grad = T 2 grad / 2 , we exactly obtain the measurement cost of determining the gradient vector using parameter shift rules.
Let us summarise these results in the following theorem. Theorem 1. Let us denote variances of the singlemeasurement energy estimators as, e.g., Var[E (C) k ]. In order to determine the full gradient vector to a precision 2 := ν k=1 Var[∂ m E(θ)], we need to distribute overall N = T 2 / 2 measurements, where T is defined in Eq. (B9). When optimally distributed, estimating the coefficients in our analytical approximation requires the following number of measurements as where, e.g., N It is important to recognise that the number of measurements, e.g., N (A) , depend on the parameters θ, however, this dependence is completely absorbed by the trigonometric polynomials, e.g., A(θ). It follows that we do not even need to explicitly/exactly know the coefficients, e.g., E (A) (only their variances which is significantly cheaper to estimate) in order to determine how many measurements are required to estimate the gradient vector via the analytic descent approach to a given precision. We provide an efficient C code that exactly computes these trigonometric polynomials [43].

Measurement cost as a function of distance from the reference point
While our classical algorithm evaluates the exact coefficients from Eq. (B8) via the efficient C code [43], here we obtain local approximations of these coefficients in order to be able to generally compare measurement costs of the analytic descent approach to existing techniques.   Fig. 2 in the main text we plot the exact relative measurement cost N/N grad (which is upper bounded via Theorem 2) as a function of the classical iterations. In the initial evolutions δ is relatively large and analytic descent is therefore expensive. However, as we approach the optimum δ is smaller and the measurement overhead decreases and is guaranteed to vanish asymptotically. Sudden jumps in the plot indicate positions where we abort the classical internal loop and re-determine our classical approximation at the new reference pointwhich costs exactly the same as determining the gradient vector. Red corresponds to the recompilation problem while black corresponds to the spin-ring Hamiltonian.
We also expand the terms B k and C k as Finally, we expand the terms D kl as Similarly, we can compute square roots of the above terms using the series expansion for b smaller than a as Let us now substitute these approximations back to the expression for T as and we have introduced the abbreviation σ[·] := Var[·]. We verify the validity of this analytical approximation in Fig. 3 and our numerical simulations confirm that the dominant error term scales as O(δ 2 ).

Upper bounding single-shot variances and relative costs
Let us now state our result on bounding the measurement cost of analytic descent relative to the cost of a gradient evaluation. , relative to the minimal single-shot variance of determining a full gradient vector where we define S 2 min in Eq. (B19). The measurement cost of the analytic descent approach relative to determining a single gradient vector to the same precision is generally upper bounded as As such, the relative measurement cost scales as 1 + O(δν) for small displacements. This also guarantees that asymptotically when approaching the optimum, and therefore δ → 0, the cost of analytic descent is the same as the cost of determining a gradient vector.
Proof. Let us first introduce an upper bound on the single-shot variance Var[E(θ)] ≤ S 2 max when estimating the energy via This inequality provides a convenient, explicit formula in the specific case when we can express the expected value E(θ) = Tr[Hρ(θ)] via the Hamiltonian H = r H k=1 |c k | 2 P k which decomposes into Pauli strings P k . The above upper bound establishes that given the Pauli decomposition of the Hamiltonian (as typical in practice) we can generally upper bound the single-shot variances via [28] in terms of the coefficients. Note that S 2 max typically grows polynomially (via the number of Pauli terms r H ) with the number of qubits and S 2 max can be significantly reduced by optimally distributing measurements or by simultaneously measuring commuting Pauli terms via advanced techniques [44]. We illustrate the upper bound and the actual single-shot energy variances in Fig. 4.
Using the single-shot variance upper bound above, it follows that the standard deviations of estimating our coefficients are upper bounded as σ[ kl ] ≤ 2S max . It may be useful in practice that we can explicitly upper bound the measurement cost T from Eq. (B14) as Instead of upper bounding the cost of analytic descent as above, let us now derive an explicit upper bound on the measurement cost of the analytic descent approach relative to the cost of determining a gradient vector.
For this reason we first compute the ratio where we have used that T grad = Above we have defined a minimal single-shot variance as averaged over parameter shifts with Here v m are the standard basis vectors in parameter space. This minimal single-shot variance is generally lower bounded as S 2 min ≥ min θ Var[E(θ)], however, note that the variance of the energy measurement can vanish as Var[E(θ)] → 0, e.g., when approaching an eigenstate of a diagonal problem Hamiltonian. In case of such systems we need to use our general definition of S min which indeed cannot vanish as S min > 0 except for trivial problem definitions that are not relevant in practice, e.g., all quantum gates in the ansatz, the problem Hamiltonian and the quantum state ρ are diagonal in the same basis.
In the nominator of Eq. (B18) the individual terms are upper bounded as It follows that the measurement cost of the analytic descent approach relative to determining the gradient vector is where we have introduced the abbreviation for the ratio of lower and upper bounds S := S max /S min .

Symmetry of the energy surface around the optimum
At a local optimum one finds that the gradient vanishes as g m = 0 for m ∈ {1, . . . , ν}. We set θ 0 := θ opt and therefore θ = 0. The explicit form of the leading terms in the energy surface can be expressed as and we have used that E (B) k = 0 due to g k = 0. We now make two observations which pose strict constraints on the geometry of the energy surface around local optima. First, the energy function in this case is (approximately) reflection symmetric via due to the reflection symmetry of the basis functions A(θ), C k (θ) and D kl (θ). Second, any slice of the energy function is just a shifted cosine function as which can be written as a+b cos(θ k ) and a = E (A) +E

Relation to the Hessian matrix and to a Taylor expansion
One can show that the coefficients used to determine our approximation of the energy surface are related to partial derivatives of the energy surface. In particular, the gradient vector g m from Eq. (B3) can be expressed exactly at the point θ as This is the well-known parameter-shift rule, which estimates the gradient via sampling the energy function at two different points [37].
The mixed second partial derivatives can similarly be expressed exactly using Eq. (B1) as To conclude, we express explicitly elements of the gradient vector as and elements of the Hessian matrix as
10 -1. of the energy surface. As such, when not considering shot noise, we require the same quantum resources to determine both the analytic descent approximation and the well-known Taylor expansion as which has the same asymptotic scaling in δ as the analytic descent approach. Let us now explain why the analytic descent approach may be preferable. First, the Taylor expansion is a quadratic polynomial in the variables θ k , i.e., it contains no degree-3 contribution. In contrast, the analytic descent approach is an infinite-degree polynomial in the variables θ k , i.e., an analytic function, as it is composed of trigonometric functions, such as cos(θ k ). Even though in the limit when θ k → 0 for all k asymptotically both approaches have the same approximation errors, in practice one always aims to use the approximations for finite, non-vanishing parameters θ k . We compare approximation errors in case of analytic descent and in case of the Taylor expansion in Fig. 5 (left). Indeed, our trigonometric expansion typically gives a better approximation of the energy surface (majority of points above the red line) and in some cases the approximation errors are about 2 orders of magnitude smaller. Let us consider a specific example that nicely illustrates how our trigonometric approximation outperforms the above Taylor expansion. Let us assume that we move away from the reference point only along a single variable θ k . In this case our trigonometric series is exact for arbitrarily large θ k , however, the Taylor expansion breaks down and its error increases infinitely: in the extreme scenario when θ k = 10 10 the approximation error can be as large as 10 10 while our trigonometric series is exact. The argument approximately holds even when we move along every parameter but there are a few dominant components, e.g., θ k , θ k+1 etc. This can often happen in practice. This illustrates that while the Taylor expansion only captures the energy surface locally, analytic descent also captures some of the global features too.
Of course, in our optimisation algorithm we do not actually use the approximation of the energy energy surface, but instead the resulting analytical gradient vector. It is therefore more meaningful to compare how the gradient vector can be approximated by the two techniques. Our optimisation technique is compatible with a (linear) Taylor expansion as the analytical gradients could be used in our algorithm When not considering shot noise, the resulting approach would require the same quantum resources as the trigonometric series, i.e., same number of coefficients determined, but would require reduced classical computational resources. Nevertheless, we assume that the classical computational resources required for computing the trigonometric series are free, and therefore we prefer to use the trigonometric series. In Fig. 5 we compare these gradient approximation errors and find that the superiority of the trigonometric series is even more pronounced: analytic descent almost always outperforms the Taylor expansion as almost all dots are above the red line.

Appendix C: Expanding the metric tensor entries
It was shown in [30] that the quantum Fisher information matrix can be approximated by the scalar product which relation becomes exact in the limit of pure states.
Here we have denoted ρ(θ) := Φ(θ)ρ 0 . We can straightforwardly express the partial derivatives via the partial derivative of the mapping which we aim to express explicitly using our approximate mappingΦ(θ) from Eq. (A14). We can compute the derivative analytically as and note that this expression is directly analogous to the gradient vector from Eq. (B3), and we have defined the partial derivatives of the monomials, such as ∂A(θ) ∂θm , in Sec. B 1. Expanding the quantum Fisher information to leading terms only results in We do not write out all the terms explicitly for clarityhowever, note that they could be computed straightforwardly.
Similarly as before, we have monomials that completely absorb the continuous dependence on the parameters θ and their explicit forms can be computed as These functions multiply the coefficients, e.g., , which can be computed via the discrete transformations as Classically computing the gradient vector using our efficient C code [43]. Execution times estimated on a laptop for an increasing number of parameters ν confirm the theoretical complexity O(ν 3 ) from Sec. D 1. Descending 1000 steps towards the minimum of the classical function can be performed in a matter of minutes for up to many hundreds of parameters. Our code was executed on a single thread, but the algorithm described in Sec. D 1 could be parallelised.
The coefficients therefore can be estimated by estimating the overlap between the states, as e.g., ρ(0) and ρ( 1 2 πv k ). These can be straightforwardly estimated using SWAP tests or, in the case of pure states, using Hadamard tests as, e.g., and we store it. All other monomials are obtained from this just by dividing it by, e.g., a(θ k ), and then multiplying it with, e.g., ∂b(θ k ) ∂θ k , which components we have already precomputed. For example, the monomial ∂B k (θ) ∂θm is obtained as when k = m and note that we have already precomputed all components in the above equation. In conclusion, evaluating all νQ = O(ν 3 ) basis functions in the gradient vector for a given input vector θ can be done in O(ν 3 ) time and requires O(ν 2 ) storage. We have estimated execution times of our C implementation from [43] which confirms this theoretical complexity, refer to Fig. 6.

Classical algorithm for computing the optimal measurement distribution
Recall from Sec. B 3 that the gradient variance can be expressed as where A, B k , C k and D kl are trigonometric polynomials that depend on the parameters θ. The variances, such as Var[E (A) ], are proportional to single-shot variances of estimating the energy expectation values. We therefore assume that the experimentalist has explicit knowledge of these (these can be estimated efficiently experimentally). The optimal measurement distribution can therefore be obtained via Theorem 1 by explicitly computing the above trigonometric polynomials. The analytical forms of these trigonometric polynomials are defined in Eq.(B13): they are sums of squares of the monomial derivatives as, e.g., ∂B k (θ) ∂θm , with respect to the index m. As such, we only need to sum up the squares of these trigonometric monomials, which our classical algorithm from Sec. D 1 computes in O(ν 3 ) time. In summary we require O(ν 3 ) time and O(ν 2 ) storage for determining the optimal measurement distribution. We have made available our efficient C code online [43].

Simulations in main text
In Fig. 2 in the main text we considered two specific problems that aim to find the ground state of a Hamiltonian. In the first case we consider a recompilation problem whereby we aim to recompile a 4-qubit quantum circuit C 4 [SWAP 12 ]C 4 [SWAP 23 ] that contains two consecutive controlled-SWAP operators as relevant in the context of error mitigation [8,[40][41][42]. The recompilation requires overall an 8-qubit circuit which is initialised by entangling every qubit in the 4-qubit register with qubits in an ancillary 4-qubit register as described in [45]. Our 4-qubit ansatz circuit consists of 124 parametrised quantum gates as illustrated in Fig. 7 and we aim to optimise parameters of this circuit such that the ground state of the Hamiltonian − 8 k=1 Z k is found. In the second scenario we consider the spin-ring Hamiltonian in which we have set N +1 = 1 and X, Y and Z are Pauli matrices. We randomly generate −1 ≤ ω i ≤ 1 and set J = 0.1. Our 8-qubit ansatz consists of 104 parametrised quantum gates as illustrated in Fig. 7.
We simulate four different optimisers and estimate the level of quantum resources required to reach a certain precision with respect to the exact ground-state energy. For this reason, we have determined the optimal set of parameters θ opt and we initialise the optimisation in its vicinity: we disturb the optimal parameters θ k by adding uniformly randomly generated numbers in the range (−0.05, 0.05). We estimate measurement costs by assuming that a single call to the quantum subroutine determines the coefficients, such as E k ] = 1, and we count the overall number of calls N s at every iteration. We simulate shot noise in all cases by adding Gaussian distributed random numbers to the coefficients; the standard deviation is related to the number of shots N E that we use to estimate a single coefficient as σ E = 1/ √ N E . Let us now detail how we set hyperparameters of each optimisation technique such that they all can consistently reach a precision ∆E = 10 −4 in determining the ground-state energy.
Simple gradient descent: Recall that a simple gradient descent update rule is defined as θ k+1 = θ k − λg, where g is the gradient vector and in the following we refer to λ as the step size. We set the largest stable step size as 0.2 and we set a fixed precision of determining the full gradient vector as the Euclidean distance Recall that gradient descent is guaranteed to converge in principle under an arbitrarily small precision [53], and therefore we set a relatively low, constant precision 2 = 10 −5 such that evolution approaches the convergence criterion with approximately a uniform convergence rate. We use the parameter shift rule from Eq. (B23) and thus the measurement cost of a single iteration can be computed via Eq. B11 as N s = −2 T 2 grad = −2 ν 2 /4 given the single-shot variance Var[E (B) k ] = 1. Note that while the measurement cost of a single iteration is N s , the number of shots to determine one of the E (B) k coefficients is N E = N s /ν = −2 ν/4, i.e., for the spin-ring Hamiltonian we have used N E = 10 6.41 while for the recompilation problem we used N E = 10 6.49 shots for determining a single coefficient. One can certainly increase the efficiency of simple gradient descent by adaptively setting the gradient precision [54] or using ADAM or SPSPA variants. However, we stress that in order to be able to compare vastly different optimisation techniques and their convergence rates we decided to set a constant precision. Of course, all other techniques would certainly benefit from more advanced adaptive strategies, but this is beyond the scope of the present work.
Analytic descent: We set a small step size 0.01 such that our classical gradient descent optimisation follows a smooth evolution path, i.e., we assume that classical computation is free. Furthermore, we use the same relatively low precision of determining the classical approximation to the gradient vector as in case of simple gradient descent as 2 = 10 −5 . As such, the optimally distributed measurement cost of determining our classical approximation at a reference point θ 0 is exactly the same as in case of simple gradient descent as N s = −2 ν 2 /4. This measurement cost is slightly increased by a small factor as we move away from the reference point since we need to collect further samples using the quantum computer via our optimally distributed measurement scheme as illustrated in Fig. 4(right). The measurement cost is a function of the parameters as N s := N s (θ) and we approximate the overall cost of a single iteration as the maximum of this function. We find that in the early evolution, where analytic descent is less beneficial, this measurement cost is at most by a factor of 10 more expensive than the cost of determining a single gradient vector to the same precision -since here the optimiser takes large jumps and the measurement cost grows with the size of the jump. These measurement costs are explicitly shown in Fig. 4(b). It is also evident from Fig. 4(b) that in the later evolutions the cost of an analytic-descent iteration is only by a small factor ≤ 2 more expensive than determining a gradient vector. Note that in case of analytic descent the number of shots N E to determining single coefficients, as e.g. N (C) k , are distributed optimally via Theorem 1 and thus cannot it be compared to that of other techniques' sampling rates.
In case of analytic descent we have aborted the internal classical optimisation loop if the exact energy, as determined via a quantum computer, was increased. In a later section we explore an abort condition based on our similarity measure f .
Hessian-based optimisation: We determine the Hessian matrix from Eq. (B27) and the gradient vector Eq. (B23) via the parameter shift rules by estimating the coefficients as, e.g., E (D) kl /4. We apply the inverse of the Hessian to the gradient vector to update our parameters. We have proposed an optimal measurement distribution scheme in ref. [28] that is applicable to Hessian-based optimsiations: the measurement costs grow with the fourth power of a regularisation parameter that we set η = 0.1. We therefore determine the coefficients using a fixed number of shots N E = 10 5 to populate the Hessian matrix and we determine coefficients using N E = 10 7 shots to populate the gradient vector -the latter sampling budget is comparable to the case of gradient descent, albeit slightly higher, such that the evolution remains stable until reaching the convergence criterion [28]. We use a simple Tikhonov regularisation as dicsussed in ref. [28]. Note that a significant advantage of analytic descent is that it does not require a matrix inversion.
Sequential optimisation: We consider the sequential optimisation techniques introduced in refs. [34][35][36]. As such, we determine and jump to the global minimum of the energy along a single parameter slice θ k via the update rule as determined by coefficients in our analytical approximation and arctan(·, ·) is the 2-argument arctan function. We set the number of shots N E = 10 8 for evaluating the coefficients, such as E (C) k , such that the evolution can reach our convergence criterion. This inevitably oversamples in the early evolution and, of course, one could adaptively set the precision. We stress again, however, that all other approaches would benefit from optimally setting sampling rates throughout the evolution as already discussed in the specific case of gradient descent. Nevertheless, we use a constant sampling rate in order to be able to compare vastly different optimisation techniques and their convergence rates. As such, a left-toright shift in Fig. 2 in the main text should be viewed as 8-qubit Spin-Ring Hamiltonian 10 8. 10 9. 10 10. 10 11. 10 12.
10 -1. k , using NE = 10 6.41 shots instead of NE = 10 8 in the case of sequential optimisation. The approach becomes unstable before approaching our convergence criterion ∆E = 10 −4 , even though initially it outperforms others.

Cumulative Total Number of Measurements
an artefact of our choice.
We also note that it was necessary to choose a sampling rate N E larger than in case of gradient descent where N E = 10 6.41 did suffice. We have repeated our simulation of sequential optimisation with N E = 10 6.41 and indeed Fig. 8 confirms that the evolution (orange dots) can become unstable before we approach our convergence criterion. It is also evident that sequential optimisation may be favourable in the early evolutions, however, note that the overall cost of optimisation is dominated by the later stages of the evolution.

Analytic Descent with Quantum Natural Gradient
Let us now show that our approximation of ansatz circuits can be used to obtain a classical model for computing how the quantum Fisher information matrix F Q of the variational states ρ(θ) := Φ(θ)ρ 0 depends on the continuous parameters θ. F Q reduces to other notions in special cases such as the Fubini-Study metric tensor and has been used extensively, e.g., in variational simulation or natural gradient optimisation [16,[29][30][31]48]. This metric tensor was first proposed in the context of variational quantum algorithms in [16] and has been used to improve convergence speed and accuracy of optimisations as well as to avoid local minima [29-31, 48, 49]. A general approach for optimising arbitrary quantum states was proposed in [30] via the quantum Fisher in-formation matrix; a general approximation for noisy quantum states can be estimated via SWAP tests as [F Q ] mn = 2Tr[(∂ m ρ(θ))(∂ n ρ(θ))]. Indeed, in the limit of pure states entries of this metric tensor can be estimated using Hadamard-tests [16,29,48]. We have derived the approximation of the general matrix elements [F Q ] mn in Appendix C; for present purposes we need only state the leading terms explicitly as, e.g., Here the multi-variate trigonometric functions, e.g., F BB (θ), can be straightforwardly computed using the previously outlined techniques. These functions multiply the real coefficients, such as F BB , which can be computed from quantum-state overlaps as where v k are basis vectors in parameter space. These overlaps Tr[ρ(θ )ρ(θ )] correspond to variational states of shifted parameters θ and θ , and can be estimated using SWAP tests or when the states are approximately pure as ρ(θ) ≈ |ψ(θ) ψ(θ)|, then the overlaps | ψ(θ )|ψ(θ ) | 2 could be estimated via Hadamard tests. The latter would only require a single copy of the state. Let us now apply the quantum natural gradient optimiser whereby we multiply our classical approximation of the gradient vector with the inverse of the metric tensor F Q [29][30][31]. Although the metric tensor can be approximated classically via Eq. (9), we remark that its quantum estimation cost becomes negligible in the vicinity of the optimum [28].
We simulate the effect of shot noise in the following way. In conventional, gradient-based optimisations one would estimate entries of the gradient vector to a precision . We set this precision such that the relative uncertainty in the gradient vector is 10% as (0.1 g ) 2 = ν k=1 Var[g k ], where Var[g k ] is the variance of a single vector entry [28]. One could distribute measurements optimally [28], but we set the number of measurements such that the standard deviation of each gradient entry is 0.1 g / √ ν. In order to be able to compare this to our analytic descent technique, we determine the coefficients E (B) k to the same standard deviation 0.1 g / √ ν and we determine all other coefficients to a proportionally inferior precision 0.1 g . Since the variance of our classical gradient vector in Eq. (B7) is dominated by the uncertainty in E (B) k , this way the overall number of measurements required for analytic descent is only a factor of 2 more than determining the gradient vector. Note that our optimal measurement distribution strategy would of course be preferable. Fig. 9(a) shows simulation results of a LiH Hamiltonian of 6 qubits. We use an ansatz circuit with 4 blocks and overall 78 parameters. We start every optimisation from a randomly selected point in parameter space that spin-ring systems. Logarithmic plots show the distance from the exact ground-state energy (Residual Energy) as a function of the iterations. A classical approximation of the entire energy surface is determined at each iteration step of analytic descent (solid lines) and in an internal loop we descent towards its minimum using a classical computer (not shown here). Analytic descent (solid lines) crucially outperforms conventional natural gradient (dashed lines) and appears to increase its convergence rate (steeper slope on plots). We simulate the effect of shot noise due to finite measurements -determining one step of analytic descent requires a factor of 2 more measurements (included in graphs) than natural gradient. Dashed grey lines show our convergence criterion 10 −3 , which is comparable to chemical accuracy.
is close to the Hartree-Fock solution. In Fig. 9(solid) we only plot the external optimisation loop of analytic descent. We plot curves that correspond to analytic descent in Fig. 9(solid) such that we propagate data points by 2 steps at every iteration -to reflect their relative measurement costs. We have used a very fine step size in the case of analytical descent, which allows us to follow the natural gradient evolution of the parameters very smoothly ranging up to many thousands of conventional gradient steps per a single classical optimisation procedure (one iteration in Fig. 9). This small step size has several advantages, for example, it keeps the evolution stable even when the inverse of the ill-conditioned metric tensor F Q is applied to the gradient vector. Fig. 9(b) shows simulation results of a spin-ring Hamiltonian. We have determined the ground state of this Hamiltonian using the previously introduced ansatz circuit, which consists of 2 blocks and overall 84 parameters. Analytic descent performs better than natural gradient even when being far from the optimum point. The gradient in this case is typically large and results in large steps that quickly drive away from the reference point θ 0 . Most importantly, both Fig. 9(a) and Fig. 9(b) confirm our expectations and we observe that analytic descent crucially outperforms natural gradient in the vicinity of the optimum. In some regions -especially when approaching the optimum -analytic descent even appears to result in an improved convergence rate (steeper slope in the figure).

Details of the simulation
We use the ansatz circuit structure shown in Fig. 7 in our simulations. This consists of layers of single-qubit X and Y rotations as well as layers of two-qubit Pauli ZZ gates.
In case of Analytic Descent, at every step there is a classical optimisation procedure involved, for which we have used the natural gradient update rule and we aborted the internal loop when the similarity measure is low via 1−f < .5. We estimated the metric tensor and inverted it using a large regularisation parameter η = 0.01 to ensure that its measurement cost is reasonable. The step size is 0.001 (0.1) in the case of analytic descent (natural gradient).
Let us now briefly compare the measurement cost of determining f to the measurement cost of determining a single gradient vector. We first compute the variance of the estimator with approximating the exact vector norm via our classical approximation's norm as g ≈ g . For example, assuming that Var[g k ] = S 2 is constant then determining the gradient vector to a relative precision = r g requires overall N g = r −2 ν 2 S 2 / g 2 samples. In such a scenario we find that the number of shots required to determine f to a precision r (which can be, e.g., r = 0.1) is given by N f = r −2 νS 2 / g 2 and therefore the ratio N f /N g ≈ 1/ν is small in practically relevant scenarios, e.g., in our simulations the number of parameters is large. Furthermore, we certainly do not need to query f at every iteration but, e.g., at every 10 iterations. We consider a 6-qubit Hamiltonian of the LiH molecule in the following. We use an ansatz circuit with 4 blocks and overall 78 parameters and start the optimisation at the vicinity of the Hartree-Fock solution. We do so by adding uniform random numbers (−0.5, 0.5) to the initial parameters of the Hartree-Fock solution. The step size is 0.001 (0.1) in the case of analytic descent (natural gradient). We also determine the metric tensor at every iteration step and regularise it with a large η = 0.01.
We also consider a 12-qubit spin-ring Hamiltonian Hamiltonian from Eq. (D2): We randomly generate ω i and set J = 0.05. We use an ansatz circuit of 2 blocks and overall 84 parameters. We start the optimisation from the lowest energy computational basis state by adding uniform random numbers (−0.5, 0.5) to its parameters. The step size is 0.01 (0.01) in the case of analytic descent (natural gradient).
We simulate shot noise when determining the gradient vector (in case of conventional natural gradient) and the coefficients in Eq. (B1). We do so by adding Gaussian distributed random numbers to their exactly determined values, as discussed in the main text.