Training variational quantum circuits with CoVaR: covariance root finding with classical shadows

Exploiting near-term quantum computers and achieving practical value is a considerable and exciting challenge. Most prominent candidates as variational algorithms typically aim to find the ground state of a Hamiltonian by minimising a single classical (energy) surface which is sampled from by a quantum computer. Here we introduce a method we call CoVaR, an alternative means to exploit the power of variational circuits: We find eigenstates by finding joint roots of a polynomially growing number of properties of the quantum state as covariance functions between the Hamiltonian and an operator pool of our choice. The most remarkable feature of our CoVaR approach is that it allows us to fully exploit the extremely powerful classical shadow techniques, i.e., we simultaneously estimate a very large number $>10^4-10^7$ of covariances. We randomly select covariances and estimate analytical derivatives at each iteration applying a stochastic Levenberg-Marquardt step via a large but tractable linear system of equations that we solve with a classical computer. We prove that the cost in quantum resources per iteration is comparable to a standard gradient estimation, however, we observe in numerical simulations a very significant improvement by many orders of magnitude in convergence speed. CoVaR is directly analogous to stochastic gradient-based optimisations of paramount importance to classical machine learning while we also offload significant but tractable work onto the classical processor.

exciting challenge. Most prominent candidates as variational algorithms typically aim to find the ground state of a Hamiltonian by minimising a single classical (energy) surface which is sampled from by a quantum computer. Here we introduce a method we call CoVaR, an alternative means to exploit the power of variational circuits: We find eigenstates by finding joint roots of a polynomially growing number of properties of the quantum state as covariance functions between the Hamiltonian and an operator pool of our choice. The most remarkable feature of our CoVaR approach is that it allows us to fully exploit the extremely powerful classical shadow techniques, i.e., we simultaneously estimate a very large number > 10 4 − 10 7 of covariances. We randomly select covariances and estimate analytical derivatives at each iteration applying a stochastic Levenberg-Marquardt step via a large but tractable linear system of equations that we solve with a classical computer. We prove that the cost in quantum resources per iteration is comparable to a standard gradient estimation, however, we observe in numerical simulations a very significant improvement by many orders of magnitude in convergence speed. CoVaR is directly analogous to stochastic gradient-based optimisations of paramount importance to classical machine learning while we also offload significant but tractable work onto the classical processor. As we demonstrate numerically, the approach shares features with phase-estimation protocols that prepare eigenstates with a dominant initial fidelity contribution.

I. INTRODUCTION
Quantum computers are becoming a reality and with an accelerating pace experiments set more and more impressive records [1][2][3][4][5]. Current generations of machines are already well beyond the 50-qubit frontier and have been demonstrated to being capable of significant computational advantage over the best classical supercomputers. Despite rapid progress in improving hardware it is generally believed the fault-tolerant, error corrected systems that are expected to emerge ultimately require significantly better and larger hardware and may thus not be within reach in the near term. The reason is that quantum states are highly vulnerable to experimental imperfections and correcting those errors requires highly non-trivial measures, such as encoding a single logical qubit into potentially thousands of physical qubits.
It is thus a very exciting challenge in the near term to achieve practical value with these noisy intermediatescale quantum (NISQ) [6] devices despite the damaging noise in the hardware. The most promising candidates, generally known as variational quantum algorithms [7][8][9][10][11], are robust against noise given the quantum circuit is restricted to a shallow depth. The most prominent example is the variational quantum eigensolver (VQE) whereby a circuit of shallow depth is constructed of parametrised quantum gates such that the emerging quantum state is powerful enough to express the ground state of a problem of interest, e.g., the Hamiltonian of a chemical system. Nearly all such techniques * balint.koczor@materials.ox.ac.uk proceed by efficiently estimating the energy (expected value of the Hamiltonian) or an equivalent cost function via sampling with a quantum computer and then the circuit parameters are variationally optimised to find the solution to the desired problem. While these techniques seem promising there are many challenges, especially in reducing high sampling costs and performing non-linear parameter optimisations which suffer from the presence of local traps and possibly flat regions as barren plateaus [12][13][14][15].
Here we make significant progress towards addressing these challenges: First, our approach converges faster than VQE in order(s) of magnitude fewer iterations and has a logarithmic measurement cost via classical shadows when increasing our constraint size -finding a solution as an eigenstate thus has a significantly reduced sampling cost. Second, VQE optimisations have been shown to be NP hard [12] due to local traps -our approach is particularly robust against local traps due to a stochastic generation of a large number of constraints. Third, since in the present work we resort to local Hamiltonians as we use the NISQ-friendly variant of classical shadows [16], barren-plateaus do not necessarily exist and thus pose a less significant issue than local traps [17,18].
In contrast to usual variational minimisation of a single cost function, we define an entirely new class class of algorithms by leveraging the following observation: in order to find an eigenstate a large number of properties of the variational quantum state must satisfy certain uncertainty relations with respect to observable measurements. We define these properties as covariances [19][20][21]  (left) A toy-example of a 2-qubit problem whose eigenstates we aim to find by finding parameters of a variational quantum state |ψ(θ1, θ2) prepared by two parametrised gates. Covariances f k := O k , H ψ between our problem Hamiltonian and between observables O k span classical surfaces (orange surfaces) and express uncertainty relations between those operators. Blue lines show roots as regions in parameter space where these uncertainties (covariances) vanish, i.e., at roots θ the equation is satisfied f1(θ ) = 0. Intersections of the lines in the above surface with those of the below surface (red dots) guarantee eigenstates of the problem Hamiltonian as joint roots f1(θ) = f2(θ) = 0. (right) We use the extremely powerful classical shadow techniques to determine a very large number of these covariances f k (θ1) whose slices along the parameter θ1 are shown in a practically relevant variational circuit (solid lines). We initialise at θ1 = 0 and iteratively find the joint root at θ 1 = 0 (red dot): We use a Levenberg-Marquardt step whereby we linearise the covariances through computing a Jacobian and solve the resulting large, overdetermined linear system of equations.
to pose the problem of finding eigenstates as joint roots of covariances. As we illustrate in Fig. 1(left) these covariances form surfaces as a function of circuit parameters and roots of the individual covariances form submanifolds (blue lines in Fig. 1(left)). Intersections of these as joint roots (red dots) then correspond to eigenstates of the problem Hamiltonian. In our CoVariance Root finding (CoVaR) approach we randomly select a large number of such covariances as illustrated in Fig. 1(right) and apply powerful classical numerical techniques: We linearise the surfaces by computing their analytical Jacobian with a quantum computer and solve a large but tractable linear system of equations to estimate the root (θ 1 = 0 in Fig. 1(right)). We iteratively repeat this procedure until a sufficiently good approximation of an eigenstate is found -which we can verify classically efficiently from our reconstructed covariances.
The most significant advantage of CoVaR is that we can use classical shadows to reconstruct these covariances with an extreme efficiency: we prove that the cost of estimating a very large Jacobian is comparable to a standard gradient estimation and grows only logarithmically with the number of covariances. CoVaR is directly analogous to stochastic gradient-based optimisers that have been the de-facto standard choice for most typical variants of machine learning, e.g., Levenberg-Marquardt is considered to be the fastest method for training classical neural networks [22][23][24][25]. As such, CoVaR is a quantumclassical hybrid that ideally combines the fast convergence speed of Levenberg-Marquardt with the logarithmically efficient (quantum) computation of our large Jacobian.
We demonstrate in a comprehensive set of numerical experiments that the efficacy of root finding is significantly increased by employing such large datasets and our optimisation procedure is robust against local traps, circuit noise, shot noise and noise due to random sampling of constraints. We cover a number of important practical applications, such as recompilation, finding ground and excited states of local Hamiltonians, where CoVaR is particularly powerful as we demonstrate in numerical simulations. Given the rapidly growing literature on variational quantum algorithms we discuss in detail connections and differences to similar approaches.
The structure of this work is the following. In the rest of this introduction section we briefly introduce covariances and related basic notations in Section I A which are both fundamental to quantum mechanics but also form the basis of our approach. In Section I B we then briefly recapitulate notations related to shallow variational quantum circuits. Our main, general results are presented in Section II where we state conditions for finding eigenstates based on covariances and pose our problem as root finding. In Section III we introduce our Co-VaR approach that uses classical shadows to find eigenstates of local Hamiltonians and relies on finding joint roots of very large systems. In Section IV we numerically demonstrate the power and utility of CoVaR in solving practical problems while we compare our technique to various other in Section V.

A. Preliminaries: operator covariances and their properties
In this section we introduce all necessary tools for deriving our main results. First, recall that a pure quantum state is an element of the complex Hilbert space |ψ ∈ C d with the dimension, e.g., in a system of N qubits d = 2 N . We will consider observables as Hermitian operators that act on this Hilbert space as complex Hermitian matrices A ∈ C d×d . For any pair of such Hermitian operators we can define the following bilinear form that we will refer to as a covariance.
Definition 1 (Covariances). Given two arbitrary Hermitian operators A, B ∈ C d×d we can define a covariance functional between them that depends on a pure quantum state |ψ via the bilinear form These covariances are fundamentally important in quantum mechanics and they are closely related to the statistics when an observable property of a quantum system is measured -covariances then express the compatibility of these observable properties of a quantum system. It simplifies following derivations to introduce an orthonormal set of Hermitian operators: for example, Pauli strings O k ∈ {Id 2 , X, Y, Z} ⊗N form a complete orthonormal set with respect to the Hilbert-Schmidt scalar product Tr[P † k P l ]/2 N = δ kl . Here δ kl denotes the Kronecker delta and X, Y and Z are Pauli matrices. Let us now define our operator pool as a suitable set of such operators.
Definition 2 (Operator pool). We define an operator pool P as a collection of r p orthonormal Hermitian operators as where For example, our operator pool can be constructed of Pauli strings as O k ∈ {Id 2 , X, Y, Z} ⊗N where k ∈ {0, 1, 2, 3} N and the overall number of terms is denoted as r p ≤ 4 N .
Let us define the covariance matrix associated to our operator pool from Definition 2 which depends on a quantum state |ψ .
Definition 3 (Complex covariance matrix). Given a collection of operators O k , O l ∈ P from Definition 2 we define an associated Hermitian covariance matrix C(ψ) ∈ C rp×rp that depends on a pure quantum state |ψ and has the matrix entries Note that the above covariance matrix C(ψ) expresses fundamental uncertainty relations between the observables O k in the operator pool via the matrix inequation C ≥ 0 [21]. We remark that our definition above of a covariance matrix has complex entries: While this definition will simplify our following arguments, it is worth noting that in the literature other conventions are also commonly used [19][20][21]. For example, ref. [21] defines a covariance matrix in terms of the anticommutator as the real part (4) while the imaginary part is often referred to as the commutator matrix 1 2 We will find it convenient to compactly describe the covariance matrix with complex entries thereby simultaneously referring to both the anticommutator and commutator matrices.
Given the above definitions we can straightforwardly derive a number of useful identities which we need not prove here. Corollary 1. Given the decompositions A = k a k O k and B = k b k O k of Hermitian operators A and B in terms of the orthonormal operator basis from Definition 2 we can obtain the covariance functional from the covariance matrix as where a, b ∈ R rp are coefficient vectors of the operators and C(ψ) is the covariance matrix in this operator basis. We will later find it useful to express the special cases as the vector of covariances, or covariance functions O k , A ψ = C a, i.e., our primary quantities of concern will be covariances with the system Hamiltonian O k , H ψ The variance of any Hermitian operator A can also be calculated conveniently from the covariance matrix using the decomposition of A into orthogonal operators. Lemma 1. Given the decomposition A = k a k O k of any Hermitian operator A in terms of the orthonormal operator basis from Definition 2 we obtain the variance of the operator as where a ∈ R rp is a coefficient vector and a C a ≥ 0 guarantees that C is positive semidefinite given its Hermiticity. We will later find it useful to express the variance in terms of covariance functions Var[A] = k a k O k , A ψ .

B. Parametrisation through variational quantum circuits
A variational quantum circuit U (θ) usually refers to a series of ν parametrised quantum gates U (θ) := U ν (θ ν ) . . . U 2 (θ 2 )U 1 (θ 1 ) [10] in some specific form, or ansatz which often has layers of the form Where H m is the Hermitian generator of the gate parametrised by θ m and W m can be a non-parametrised unitary associated with this layer. We will find it useful later to consider specific parametrised gates where H m ∈ {Id 2 , X, Y, Z} ⊗N is a Pauli string as typical in practice -we will refer to these as Pauli gates. The number of gates in a shallow ansatz circuit is usually chosen such that the circuit depth grows slowly, such as polylog(N ) [7][8][9][10][11]. Let us list here a number of well-studied ansätze.
Hardware Efficient Ansätze are designed to be optimised for low circuit depth and maximal expressivity. U l (θ l ) are typically chosen as the native gates of the given hardware platform and the rotation angles θ l of each gate are treated as parameters to be optimised [26].
Unitary Coupled Cluster ansatz (UCC) is a problem inspired ansatz for quantum chemistry. It proposes the candidate ground state using excitations of orbitals from some reference state |ψ 0 , typically the Hartree-Fock state as e T (θ)−T (θ) † |ψ 0 . Here the cluster operator T (θ) [27] is often restricted to single and double excitations, leading to the 'UCCSD' ansatz (SD for single and double).
Quantum Alternating Operator Ansatz, Hamiltonian Variational Ansatz and further variants are motivated by a time-discretised and trotterised adiabatic evolution that is guaranteed to drag the eigenstate of a trivial Hamiltonian H 0 to the desired problem Hamiltonian H for a sufficiently deep ansatz. The evolution time θ l of each piecewise constant, trotterised evolution U l (θ l ) = m e −iθ l Hm is variationally optimised to find the ground state.
Applying this quantum circuit to an easy-to-prepare reference state of N qubits defines our parametrised ansatz states as Parameters of the ansatz circuit θ are then varied through classical optimisation techniques such that the quantum state |ψ(θ opt ) at the optimal set of parameters is a solution to our problem [10]. Usually this optimisation is done by minimising a cost function, most typically the energy of a problem Hamiltonian E(θ) = ψ(θ)|H|ψ(θ) [8,28], but note that variants of the VQE paradigm allow for the optimisation of other cost functions, such as the variance of the Hamiltonian [29] or non-linear functions of expected values [30].
In the usual case when the gate generators H m in Eq. (5) are Pauli gates, the cost function E(θ) has been shown to be a trigonometric polynomial [31]. Finding the global minimum of E(θ) as trignomoteric functions has been shown to be NP hard [12] given the rapidly increasing number of local minima.
Here we introduce a different paradigm; Instead of searching for the minimum of a single classical function E(θ), we efficiently estimate a large number of covariances that each depend on the set of parameters θ and thus each corresponds to a unique surface as a function of θ as illustrated in Fig. 1. We prove that these parametrised covariances are indeed smooth functions of the circuit parameters θ.
Lemma 2 (Smooth covariance functions). Given a variational quantum state |ψ(θ) := U (θ)|0 ⊗N as defined via a variational quantum circuit U (θ) ∈ SU(2 N ) we define the parametrised covariances as The covariance functions f k : R ν → C are smooth, infinitely differentiable functions of the circuit parameters θ ∈ R ν for any Hermitian operator O k and problem Hamiltonian H.
Refer to Appendix B 1 for a proof. Above we have introduced the more compact notation for these covariance functions as f k (θ) to highlight that we pose our problem of finding eigenstates by finding simultaneous roots of a cohort of smooth functions {f k (θ)} rp k=1 . Furthermore, in the practically important special case when the ansatz circuit is composed of Pauli gates we show that the covariances f k (θ) are actually trigonometric polynomials in θ via ref. [31].
Corollary 2 (Trigonometric polynomials). In the specific but pivotal scenario when parametrised gates in the ansatz circuit in Eq. (5) are Pauli gates, the covariances are trigonometric polynomials as f k (θ) = where the prefactors C j ∈ C depend on the index k while T j (θ) are trignometric monomials, i.e., products of single-variate sine and cosine functions.
As such, finding parameters such that f k (θ) = 0 for all k is equivalent to finding roots of the corresponding (trigonometric) polynomial system.

II. GENERAL RESULTS: FINDING EIGENSTATES BY FINDING ROOTS
This section introduces the main theoretical underpinnings of our approach in a general setting, i.e., without making any assumptions about the problem Hamiltonian or the type of operator pool. In contrast, in Section III we will introduce CoVaR which is a specific, practically motivated approach where we restrict operators to local Pauli strings which in return allows us to utilise the powerful classical shadow technique. We note that we will also investigate another theoretically interesting special case of operator pools in Appendix A.

A. Finding eigenstates of a problem Hamiltonian
Finding approximate representations of eigenstates of a problem Hamiltonian is a key application of near-term quantum computers. The primary hope for quantum advantage in the near term is usually placed on variational quantum algorithms whereby the solution to a problem is encoded into the ground state of a Hamiltonian. Most notable is the Variational Quantum Eigensolver [8] which aims to find the ground state of a Hamiltonian via a variational minimisation of the system's energy. Finding excited states is also of particular importance for, e.g., analysing chemical reactions in drug discovery or in catalysis [32].
Furthermore, applications for finding eigenstates also exist outside of quantum simulation, for example, in solving classical optimisation problems using quantum hardware via the Quantum Approximate Optimisation Algorithm (QAOA). These were introduced to solve problems such as constraint satisfaction and Max-Cut [7] but have been extended beyond. Furthermore, finding eigenstates is also highly relevant to the continued design and improvement of applications and quantum algorithms. For example, recompilation problems are highly relevant as we will show. Similarly, the preparation of logical states in quantum error correction can be cast as eigenstate finding procedures [33,34].
The aforementioned techniques typically proceed by exploiting the fact that the problem Hamiltonians of interest r a=1 h a H a decomposes into a polynomially growing number r ∈ poly(N ) of Pauli operators whose expected values can be estimated efficiently with a quantum computer. In the following we will denote the collection of these Pauli strings as Q = {H a } r a=1 . Let us now introduce our main result which uses covariances described in the previous section to finding eigenstates of a problem Hamiltonian.
Theorem 1. Given the decomposition of a fixed problem Hamiltonian H = r a=1 h a H a into a set of basis operators H a ∈ Q ⊆ P which form a subset of our operator pool. This subset usually has a polynomial size as r ∈ poly(N ). Given a fixed quantum state |ψ , simultaneous roots of all covariances provide a sufficient condition for the eigenvalue equation  We note that the individual covariance functions may vanish without implying the presence of eigenstates of the problem Hamiltonian, for example H a , H ψ =0 can be satisfied for a single index a in the special case when |ψ is an eigenstate of H a . These form a submanifold of the smooth covariances when viewed as a function of circuit parameters as illustrated with blue lines in Fig. 1. We therefore predicate that all covariance functions in our operator pool simultaneously vanish (red dots in Fig. 1) for all indexes k which necessarily implies an eigenstate of the problem Hamiltonian. The problem of searching for eigenstates of H then becomes that of finding simultaneous roots of a system of covariances.
The above theorem ensures us that in an eigenstate all the exponentially many covariances vanish (necessary conditions), however, it is sufficient to verify only that the polynomially growing number of covariances are zero (sufficient conditions). Of course Var[H] = 0 certainly guarantees an eigenstate, however, the experimental estimation of Var[H] proceeds by computing expected values of individual Pauli terms and is thus informationally equivalent to estimating the above covariances [9][10][11].
While Eq. (7) lists all sufficient conditions with respect the minimal operator pool that only contains the Paulidecomposition terms of our problem Hamiltonian as Q, in the following we consider unions such that our operator pool is enlarged as with a polynomially growing number of operators O k that are orthogonal to our problem Hamiltonian Tr[HO k ] = 0 (via not including common terms {H a } ∩ {O k } = ∅). Roots of all covariances with respect to our enlarged operator pool then signify an eigenstate and we will show below that the enlarged operator pool increases the efficacy of our optimisation algorithm, i.e., by over-constraining the Jacobian of our root finding approach. We will refer to the size of our enlarged pool |Q | = N c as the number N c of constraints.

B. Finding joint eigenstates of commuting observables
Many problems of practical interest are concerned with finding joint eigenstates of a group of observables that all commute with each other. For example, to prepare logical states for quantum error correction we wish to produce an eigenstate of the generators of the corresponding stabiliser group -these generators mutually commute [35]. Another example is the case of recompilation of quantum circuits. Here we wish to transform a given gate sequence V into a native gate sequence U with an optimal circuit depth, e.g., to make it resilient to noise.
Both in the case of Full Unitary Matrix Compilation (FUMC) [33] and Fixed Input State Compilation (FISC) [36] the problem can be stated as applying U † after V onto our reference state |0 (see section IV A for details) which at the solution V = U would correspond to the identity operation U † V = Id and the resulting state |0 is then the ground state of the Hamiltonian − N j=1 Z j . While one ultimately aims to find the ground state of this Hamiltonian, note that we can also accept any computational basis state |n which are simultaneous eigenstates of the mutually commuting terms {Z j }. This motivates our approach of finding joint eigenstates of the individual Hamiltonian terms. Proof. For each individual index a we can apply Theorem 1 to the corresponding operator H a and obtain necessary and sufficient conditions such that |ψ is an eigenstate of the particular operator H a . It follows that if all sufficient conditions from Theorem 1 are satisfied for all indexes a, as listed above, then |ψ is a simultaneous eigenstate of all H a .

C. Conventional techniques for finding roots
As introduced above, our approach is based on estimating operator covariances with a quantum computer which we use to inform our decision of updating parameters of our variational quantum circuit. Our aim ultimately is to find a simultaneous root of these covariances at which parameters the variational state is guaranteed to be an eigenstate of our problem Hamiltonian.
There are a large number of well-established techniques for finding simultaneous roots of vector-valued functions and almost all such techniques are in some way related to Newton's original method [37,38]. Newton's method proceeds by linearising the non-linear (but smooth) vector of covariance functions f (θ) via the first-order Taylor expansion as Given each covariance function is an infinitely differentiable, smooth function of the parameters θ one can indeed apply Newton's method and can approximate roots by solving the equation f (θ +∆θ) = 0 using the above expansion f (θ) = −J∆θ and neglecting second-order terms. This results in a linear system of equations which can be solved using techniques from linear algebra.
The approach results in an iterative procedure whereby at every iteration we compute the Jacobian J with a quantum computer and apply its (regularised pseudo)inverse to the vector of covariances f to compute the parameter-update rule as We derive expressions for computing the Jacobian with a quantum computer in Appendix B 3 using well established techniques from the literature [9][10][11], e.g., parameter-shift rules. Furthermore, we also discuss in Appendix C 4 that by stacking real parts of J and f on top of the imaginary parts results in realJ andf -enforcing that the solution of the linear system of equations is a real vector ∆θ. While we aim to compute the Jacobian and the covariances with a quantum computer, we note that the resulting linear systems of equations are solved with a classical computer. It is important to note that we would obtain an under-determined system of equations if our operator pool were smaller than the number of ansatz parameters as r p < ν. This is the reason why we require that our operator pool, and thus the dimension of the vector f is at least as large as the number of circuit parameters -indeed, later we will aim to set up highly over-determined systems of equations.
While powerful, the vanilla Newton method has its limitations and is only guaranteed to converge when starting near a root -given the linear model in Eq. (9) is only accurate for small ∆θ . Nevertheless, a number of advanced techniques have been developed to increase the radius of convergence and some variants of the Newton method have been proved to be globally convergent under mild continuity conditions of the functions [37,39,40]. In particular, the simplest globally convergent approach first attempts a conventional Newton step and if the norm of the vector-valued function f does not decrease then a line search is attempted in the step direction λJ −1f [37] whereby one searches for λ that minimises f along the 1-dimensional search direction, see Appendix C 4 for more details. This approach is guaranteed to converge to a root as long as the Jacobian is non-singular and wellconditioned [37].
Another family of closely related approaches are the Levenberg-Marquadt (LM) methods which are additionally robust against ill-conditioned Jacobian matrices. The approach can be shown to be equivalent to the Gauss-Newton algorithm for least-squares minimisation with a trust-region method [37]. It attempts steps along what is formally a "regularised Newton direction" via the regularised inverse ∆θ = [A + λId] −1f with A =J J and accepts the regularisation parameter λ based on some condition, e.g., such that f decreases. The regularisation matrix is either Id, but in practice it is often chosen to be the diagonal matrix diag(J J ). In many practical applications of non-linear least-squares fitting LM can be interpreted as an approximate Hessian optimisation, however, we detail in Section V D that this is not the case for our root finding approach.

III. COVARIANCE ROOT FINDING VIA CLASSICAL SHADOWS
In the previous sections we have described the general theory as the basis for our quantum optimisation algorithm. We now detail concrete settings where our approach may achieve significant practical value in exploiting near-term quantum devices. Our aim is that the number of constraints N c in the linear system of equations is tractable but is significantly larger than the number of circuit parameters as ν N c . For this reason we choose a p-local operator pool of size r p ∈ O(N p ) that consists of all Pauli strings that act non-trivially on only p qubits as P = {P k ∈ {Id 2 , X, Y, Z} ⊗N : P k is p-local}.
This fits very well with the use of classical shadows for determining a very large number of local Pauli strings. In particular, the recent development of the classical shadows method [16] allows us in a NISQ-friendly way to measure N c covariances and their derivatives using a mea- Performance improvement when we increase the number of constraints Nc illustrated on a 14-qubit recompilation problem of 'rediscovering' unknown parameters (ν = 124) of an ansatz (refer to Section IV A for more details). CoVaR was run for a fixed number 20 of iterations (initial average fidelity F = 46 ± 7%) and the final achieved infidelity 1−F with respect to the ground state is reported: Blue crosses show the best of three runs of CoVaR and we plot their fit with the function a(Nc/ν) −b + c (blue curve) as we detail in Appendix E 1. The worst of three runs were also fitted with the same function (blue shade) confirming a polynomially (in Nc) increasing performance with b ≈ 3.2 -although we expect this degree to depend on the problem. The large spread of datapoints is due to our randomly generated constraints (with respect to 3-local Pauli strings). surement count that is only logarithmic in N c . Therefore, it is possible to offload processing to the classical computer with only a small increase in the number of measurements (quantum resources) required. This combination is ideal for NISQ-era algorithms where quantum resources are limited and it is generally to our benefit if we can offload large, but tractable calculations to a classical computer. In the remainder of this section we describe the application of this method to local Hamiltonians, using a measurement channel of single-qubit Pauli gates [16]. In Fig. 2 we provide a diagrammatic representation of the CoVaR algorithm.

A. Stochastic optimisation with very large operator pools
Despite very promising experimental progress [1][2][3][4][5], near-term quantum devices are noisy and in order to avoid practically prohibitive accumulation of errors the circuit depth D(N ) is required to be shallow and is usually assumed to grow poly-logarithmically as D(N ) ∈ polylog(N ). The Jacobian is generally a non-square matrix with dimension J ∈ C Nc×ν , where ν is number of ansatz parameters typically scaling as ν = N polylog(N ) due to shallow circuit depth. As such, for a sufficiently large system we can always over-constrain the Jacobian just by including covariances with respect to only two-local Pauli strings given then the number of constraints N c = O(N 2 ) grows faster than the number of circuit parameters. We can thus conveniently define a very large operator pool for Theorem 1 relative to the number of parameters in the ansatz circuit.
For this reason we set our operator pool P to contain all p-local Pauli strings and we randomly select constraints f ∈ C Nc of a large size N c but much smaller than the full operator pool as ν N c r p -but still much larger than the number of circuit parameters. This construction has the following advantages. First, the large (but tractable) size of the Jacobian yields an overconstrained linear system of equations in Eq. (10) which significantly improves convergence speed as we demonstrate below. Second, randomly choosing constraints has the advantage of navigating out of local traps as we numerically simulate in Appendix D 2. Third, we employ stochastic Levenberg-Marquadt (LM) methods that adaptively regularise the Jacobian and are thus by construction robust against the noise produced by random choice of constraints, as well as the hardware/shot noise on expectation values -and rigorous proofs of convergence are available in the literature [41,42]. These are indeed properties why stochastic LM and stochastic gradient descent have been extremely popular in the classical machine learning context, i.e., due their robustness against noise as well as their robustness against getting stuck in local traps [43,44] [45] In Fig. 3 we confirm numerically on a 14-qubit recompilation problem that indeed the performance of root finding increases as the number of constraints N c /ν in the linear system of equations is increased. As we detail below in Section IV A, this recompilation problem is a hard benchmarking task with the advantage that our ansatz is capable of expressing the exact solution. In Fig. 3 we ran CoVaR for a fixed number 20 of iterations and plot how close the evolution came to the solution, i.e., the infidelity with respect to the ground state. We assume an idealised simulation with no shot noise or circuit noise; thus the only source of 'error' is the linearisation of the non-linear covariances via Eq. (9) while the performance is significantly improved as we increase the number of constraints.
Let us attempt to intuitively explain on an analytical example why such an increasingly over-constrained system of equations improves our ability to find the solution Take for example the simple case when the ansatz circuit has a single parameter θ and (as illustrated in Fig. 1 (right)) thus the covariance function vector f (θ) can be linearised via Eq. (9) as Here the Jacobian is J k = f k (θ) (assuming J k and f k are real as we have stacked real and imaginary parts on top of each other). For each individual function f k (θ), Newton's single-variate parameter update approximates the root as ∆ k θ = −f k (θ)/J k , however, we incur an error .. due to the non- linearity of f k (θ) as we illustrate in Fig. 4 (blue lines). On the other hand, the least squares solution simultaneously takes into account all linearised constraints as f (θ+∆θ) = 0 and is given analytically Let us now analyse the time complexity of classically computing the (pseudo)inverse of the JacobianJ. We prove in Appendix C 6 that computing the least-squares solution to the linear systems of equations is dominated by the step of computing A :=J J which can be performed in time O(ν 2 N c ) and, as such, scales linearly with the number of constraints. The rest of the procedure, including the computation of the inverse of the small square matrix A ∈ C ν×ν can then be computed in negligible additive time. Given the number of constraints grows at most as N c ≤ O(N p ) for our specific choice of p-local Pauli strings the computation time t grows at most as t ≤ O[N p+2 polylog(N )] with the number of qubits N .
We confirm these expectations in Fig. 5 and estimate that a very large matrix with N c = 10 6 constraints for a variational circuit of ν = 10 3 can straightforwardly be computed in a matter of minutes and fits into the RAM of a typical single node -while distributed computation for larger datasets > 10 7 is possible with negligible communication between nodes. We expect determining necessary expected values from classical shadows has a comparable computation time which we detail below.

B. Noise robustness
Let us now demonstrate the aforementioned noiserobustness of our approach: as we experimentally estimate the Jacobian and the covariances we always incur a certain amount of shot noise (due to finite sampling) but also possible noise due to experimental imperfections. While we demonstrated in a noise-free environment that performance is improved when increasing the number of constraints, one might think that it could also lead to an accumulation of noise. For this reason we prove in Appendix C 5 that the error in our estimate of the update rule ∆θ in Section III A does not accumulate as we increase the number of constraints, i.e., the error is constant bounded by the worst-case error in a single Jacobian/covariance entry.
We obtain a similar conclusion for the error (shot noise) propagation in the general multi-variate case by applying the error propagation formula of ref [15] for matrix inversion. In particular, the error in the update rule scales with the fourth power of the smallest inverse singular value (or regularisation parameter λ −4 ) of J. Given singular values of our ν × N c -dimensional Jacobian matrix grow with N c , we expect CoVaR is particularly robust against shot noise. In our ν = 1-dimensional example in the previous subsection we had a singular value [ Nc k=1 J 2 k ] 1/2 of J J which indeed grows with the square root of N c for non-zero derivatives |J k | > 0.
In Fig. 6 we repeated our simulations from Fig. 3 with added shot noise and circuit noise. In particular, Fig. 6(orange) shows our simulations with only shot noise added and confirms our above analytical arguments: As we increase N c /ν the optimisation is able to come closer and closer to the root in a fixed number 20 of iterations up until a point when we reach a shot-noise floor E where the performance is no longer increased. This shot-noise floor is indeed below the precision of determining indi- Fig. 6(black) shows the performance of root finding under simulated circuit noise but without shot noise. As we detail in Appendix E 2, we have assumed two-and single-qubit gate error rates 2 = 10 −3 and 1 = 2 /4, respectively, which is comparable to the performance of state-of-the art hardware. While the optimisation is performed with noise, the plotted infidelities are calculated without noise to reflect that the correct parameters are found as, e.g., error mitigation techniques are typically applied for extracting noise-suppressed expected values from a final state [46][47][48][49]. These results show a very similar performance to the case with shot noise only in Fig. 6(orange): the performance is increased up until a point where a noise-floor is reached -and the magnitude of this noise floor in our example appears to be very close to the case of shot noise only. Interestingly, our approach finds circuit parameters very close to the ideal ones (small final infidelities) despite circuit noise -this indeed resembles to the phenomenon of Optimal Parameter Resilience [50], meaning this recompilation task is not merely learning in the applied circuit noise. These simulations confirm the robustness of our approach against experimental noise.

C. Estimating a large number of covariances via classical shadows
Recall that a p-local problem Hamiltonian can be specified in terms of its Pauli decomposition as H = Computation time with a Mathematica code of the update rule ∆θ =J −1 f via the regularised inverse of the non-square matrix JacobianJ ∈ R 2Nc×ν with a fixed number of circuit parameters ν = 1000 and increasing number of constraints (covariances) Nc. The time complexity is O(Nc) linear in the leading dimension as the number of constraints and the absolute time is very reasonable, i.e., less than an hour even for very large matrices with Nc = 10 7 (Computed with a desktop PC).
where the Pauli strings P (q) a ∈ Q (p) are p-local, i.e., they only act on p qubits non-trivially. Such local Hamiltonians are highly relevant in many important problems which include, for example, recompilation, spin models in materials science, boolean satisfiability problems (3SAT) and fermionic models using mappings that retain operator locality [9,11,[51][52][53][54].
Let us consider an operator pool P q that contains all q-local Pauli strings and thus has size r p ∈ O(N q ). We randomly choose covariances f k from this operator pool such that N c r p , and as we discussed we aim to estimate a large number of covariances (constraints) via a large but tractable N c . Let us now establish that we need only reconstruct expected values of at most p+qlocal Pauli strings in order to determine all covariances. Proof. Let us explicitly write the covariances as Above the product of the Pauli strings P is proportional to a p+q-local Pauli string as P (p+q) k up to possibly a prefactor ±i depending on whether P (q) k and P (p) a commute or anticommute etc. as discussed in Appendix B 3. As such, above we obtain a weighted sum of only expected values of Pauli strings, and thus we conclude that any covariance of the form P Note that determining all covariances that satisfy the sufficient conditions in Eq. (7) require that the locality of the operator pool is at least as large as the locality of the problem Hamiltonian via q ≥ p.
The classical shadow procedure [16] fits very well with our CoVaR approach as it allows us to estimate a very large number of these covariances such that the number of samples (quantum resources) increase only logarithmically with the number of constraints -while the required measurements are very NISQ friendly. Let us briefly recapitulate the main steps to reconstructing Pauli strings using classical shadows.
• We apply a random unitary U to rotate the state. In our case of local Pauli strings the unitaries are chosen randomly from single qubit Clifford gates on each qubit and the procedure is thus equivalent to randomly selecting to measure in the X, Y or Z bases -we measure each qubit to obtain N -bit measurement outcomes |b i ∈ {0, 1} N .
• We then generate the classical shadows by applying the inverse of the measurement channel M, which can be done efficiently as the channel chosen is a distribution over Clifford circuits. The classical snapshots are generated asρ , the classical shadows are collections of these snapshots S(ρ; N ) = [ρ 1 , ...,ρ N ].
• From these classical shadows we can construct K estimators of ρ from our N batch snapshots aŝ ρ (k) = 1 r kr i=(k−1)r+1ρ i with r = N batch /K and classically calculate estimators of the Pauli expectation valuesô i (N, The classical computational resources are quite modest.
• The sample complexity of obtaining these estimators of M Pauli operators of locality l to error is Using classical shadows allows us to reconstruct all (p+q)-local Pauli strings with a sample complexity that is merely logarithmic in the system size. This fits particularly well with the preset approach: when the locality p+q of Pauli strings is modest then we can obtain a large, polynomially growing number of constraints N c ∈ O(N q ). Furthermore, given the covariances are fully determined by expected values (of local Pauli strings), we Black results show the performance of CoVaR under simulated circuit noise as described in Appendix E 2, indicating the resilience of the method to reasonable levels of circuit noise. In the large-Nc limit a noise floor is approached whose magnitude in our particular example is comparable for both cases of only shot noise (orange) and only circuit noise (black) -large spread of the datapoints is due to our randomly generated constraints and fits of the best (solid curves) and worst (shaded area) of 3 runs of CoVaR are included.
show that their analytical derivatives can be estimated using expected values at shifted circuit parameters via the so-called parameter-shift rules [55]. In particular, each partial derivative in the Jacobian [J] kn := ∂ n f k (θ) is determined by estimating expected values at two different shifted parameters as discussed in Appendix C 1. As such, we can fully determine our Levenberg-Marquardt step just using expected values of local Pauli strings. Let us now state the sample complexity of CoVaR whose (quantum) cost is dominated by estimating the Jacobian and let us compare it to the cost of determining a gradient vector used in energy minimisation. Statement 2. Given a p-local problem Hamiltonian H we use classical shadows to determine a large number N c of covariances with respect to q-local Pauli strings. The sample complexity of determining the Jacobian of size J ∈ C Nc×ν to an error is In contrast, determining the gradient of the energy expected value H using classical shadows has a complexity O(3 p ν log(r)/ 2 ). As such, determining a very large Jacobian is only logarithmically more expensive then determining an energy gradient (up to a multiplicative constant 3 q that depends on the modest locality of our choice, e.g, q = 2, 3. Proof. Theorem 1 of ref. [16] established that M Pauli strings O i of locality l, can be estimated to precision pa- rameters , δ via the number of batches K = 2 log(2M/δ) and the number of samples in the individual batches as shadow . This results in an overall number of samples N batch K and the norm is given in Lemma 3 in ref. [16] as O k 2 shadow = 3 l . We have established in Lemma 6 that we can determine the Jacobian matrix J ∈ C Nc×ν by applying the classical shadow procedure 2ν+1 times at different circuitparameter configurations with M ≤ 3rN c . As such, determining these Pauli strings of locality l = p + q requires the number of samples N batch K ≤ 68 2 3 p+q log(6rN c /δ). Given we apply the classical shadow procedure 2ν + 1 times we obtain the following upper bound on the sample complexity N s ≤ (2ν+1)3 p+q 68 2 log(6rN c /δ) In contrast determining a gradient vector for gradient descent requires 2ν applications of the classical shadow procedure each with M = r and thus we obtain the sample complexity N (grad) s ≤ 2ν3 p 68 2 log(2r/δ). In both cases we have determined the necessary Pauli strings to precision and both the energy gradient and the covariance Jacobian are then obtained from these as a linear combination with respect to Hamiltonian coefficients which leads to a worst-case error propagation of O(r).
Actually, this bound on the number of required measurements in terms of the locality is noted to be conservative and it is expected that the actual constants are much smaller in practice [16]. Furthermore, the inclusion of the development of derandomized classical shadows [56] has the potential to significantly reduce the number of required measurements, i.e., an order of magnitude reduction has been demonstrated in numerical experiments [16]. These techniques could thus greatly improve the speed at which the covariances can be extracted by optimising the Pauli measurement basis to the specific Pauli strings in our operator pool -but we do not consider these in our above performance bounds. Furthermore, the overhead of CoVaR relative to determining a single gradient vector in gradient descent is the constant 3 q (up to the logarithmic dependence on N c ) and is only due to the increase in the locality of Pauli strings with q = 2, 3 etc. We can thus expect that determining a very large Jacobian has a comparable complexity to determining just a single gradient vector in gradient descent. We will demonstrate in the following that this increased size of the Jacobian has significant advantages in practical applications.

A. Recompilation
The ability to recompile a given quantum circuit into an equivalent but practically feasible or more favourable representation is crucial for the successful exploitation of quantum computers. The ideas exist in many variants, from the application of classically tractable analytical gate-replacement rules to automatic discovery techniques [57][58][59]. In variational recompilation, we want to find a parametrised unitary circuit U (θ) that approximates a target unitary V . This target unitary is required to approximate the action of U on the entire Hilbert space in case of recompiling a Full Unitary [33] in Fig. 7(b) or just to approximate the action on a specific input state in Fig. 7(a). After applying the circuits V and U (θ) † consecutively, the goal is to find circuit parameters θ such that the state of the registers is in the ground state |0 of the Hamiltonian H = − N j=1 Z j [36]. However, the problem would be equally solved by finding ansatz parameters to produce any computational basis state, (i.e. any eigenstate of H) given we can then just append single qubit X rotations to the ansatz to produce the desired operation. This feature, along with the local Hamiltonian allowing for the efficient measurement of many covariances makes it particularly amenable to root finding which is not limited to only searching for the ground state.
For this reason we apply Corollary 3 to the present problem and define our problem Hamiltonians as Q := {Z a } N a=1 . We can indeed enlarge this pool by further considering products of single-qubit Pauli Z operators. Our aim is then to find joint roots of the covariances from Corollary 3 which then guarantee that the solution corresponds to a joint eigenstate of all operators in Q as one of the computational basis states. After having found one of these computational states we just apply single qubit X rotations to our ansatz to map to the |0 state.
Here we consider an example of parameter rediscovery as a benchmark, whereby we recompile a unitary V = U (θ ) that has the same form as the parametrised quantum circuit U (θ) but with the parameters fixed at some random solution values θ . This has the advantage of being a very hard problem to solve variationally, while also giving us a guarantee that our circuit is capable of expressing the solution -while note that below we will benchmark our CoVaR approach on practical problems as well. Fig. 7(c) shows the performance of root finding when we initialise relatively close to the solution (by disturbing parameters |∆θ k | ≤ 0.3) on a 10-qubit, 2-layer parameter rediscovery problem and compares it to the performance of gradient descent. In this recompilation problem our operator pool P contained all 3-local Pauli strings and we chose randomly N c operators at every iteration, see details in Appendix E 2. Indeed, Fig. 7(c) confirms that root finding is able to converge significantly faster to the shot noise floor, i.e., a limitation due to finite sampling of expected values. Furthermore, Fig. 7(c) confirms that root finding has a significantly improved convergence rate (steeper slope), which is improved with a greater number of constraints (blue vs. light blue), while note that the quantum resources required for a single iteration is comparable to that of gradient descent. Fig. 7(d/blue) shows applying root finding to an initial state that we obtained by a short period of gradi- ent descent from a random state -applying gradient descent to a random state has the effect of producing a state with an appreciable overlap with the lowest energy eigenstates, allowing root finding to efficiently converge. In contrast, in Fig. 7(d/black) we demonstrate the performance of root finding when we start from a randomly chosen initial point in parameter space on the same problem. It fails to make any progress; This is due to the fact that root finding works well when there are only a small number of eigenstates that significantly contribute to the state produced by the PQC. In contrast, random states as nearly equal superpositions of a large number of basis states do not contain a dominant eigenstate towards which root finding could converge thus CoVaR fails to make significant progress. This is very much analogous with fault-tolerant phase-estimation protocols which do indeed similarly fail under random initialisation, but enable us to efficiently prepare any eigenstate given a good initialisation is possible. These signify the importance of initialisation when searching for eigenstates and clearly demonstrate that even a short period of gradient descent may be sufficient for these purposes.
The performance of variance-VQE (a gradient based method that minimises the variance of the Hamiltonian, see Section V B for more details) is also shown for comparison -it is another method which, like root finding, is not only searching for the ground state of the Hamiltonian. Variance-VQE is not stuck in the same way as root finding, but makes slow progress due to its relatively (compared to root finding) slow convergence speed.

B. Spin Models
Spin models are highly relevant in the study of condensed matter physics, quantum statistical mechanics and many problems in practice can be mapped to spin models, such as NP problems [60], cf. approximate optimisation algorithms (QAOA) or spin systems in materials science [9,11,[52][53][54]. Furthermore, lattice models of quantum field theories [61] typically have local Hamiltonians.
Here we simulate CoVaR when used to search for low energy excited states of a spin chain described by the Hamiltonian We use periodic boundary conditions (N +1=1) and c i are randomly selected onsite interactions comparable in strength to the couplings J, and σ i is the Pauli vector for the ith qubit. This Hamiltonian cannot be simulated classically for large N with reasonable computational resource despite its simple structure [62,63] -and it is relevant for studying the phenomenon of many-body localization in condensed matter systems [64]. We use a hardware-efficient ansatz of 20 layers for 10 qubits. Before root finding was performed, the ansatz parameters were initialised to θ imag using Imaginary Time Evolution [65] to a state with low expected energy. This ensured that only a limited number of eigenstates contributed significantly to the state produced by the ansatz. CoVaR was then performed from points in parameter space with small random variations from θ imag to map out the low energy subspace. This process is shown in Fig. 8 in mapping out the lowest energy subspace of a 10qubit spin chain; CoVaR rapidly finds a state an order of magnitude closer to the ground state in energy error than Imaginary Time Evolution is able to converge to. This indeed confirms that root finding has a significantly improved convergence rate when compared to Imaginary Time Evolution, where the latter is equivalent to natural gradient [66,67] and is thus a second-order method that requires increased absolute quantum resources (samples) [66]. While here we focused on practical applications of CoVaR, we demonstrate additional numerical simulations in Fig. 12 in the Appendix whereby we explicitly compare CoVaR to VQE; we conclude that in comparison to other variational methods, CoVaR again exhibits superior performance on the present spin-ring problem. While here we only report the probability of converging to the ground state specifically, we note that 97.1% of all runs converged to one of the low-energy eigenstates -the distribution of eigenstates found is detailed in Appendix D 3. Initialisation was done the same way as in Fig. 8 by running Imaginary Time Evolution from a random initial state until reaching selected energies between −5.8 and the ground state. A linear decay of success probability-reminiscent of phaseestimation protocols-with ground state overlap is found, although we expect the rate of this decay to strongly depend on the population-distribution of the lowest-energy eigenstates. excited states using orthogonality constrained VQE techniques [36,68] can be difficult because we need to discover the parameters for and project out every state from the ground state up to the energy of the state we wish to find. CoVaR is agnostic to the energy of the state and acts to find states close to the one it is initialised into. Although CoVaR could converge to states that have been found in previous runs of the algorithm, similar techniques of projecting out previously found eigenstates could also be applied. We can also potentially use classical techniques, such as Interval Analysis [57], to find all the roots within an area of parameter space, reducing the problem converging to an already known eigenstate (root).
In Fig. 9 we show the probability of converging to the ground state of our spin chain as a function of the overlap between the ground state and the initial state. These initial states were obtained by performing Imaginary Time Evolution from random points in parameter space down to energies between −5.8 and the ground state (at E = −5.99). The initial fidelity where root finding was started was recorded ( Fig. 9 horizontal axis) and the fraction of runs that converged to the ground state is listed. Indeed we find a nearly linear relationship and the vast majority of runs that do not converge to the ground state converge to one of the other low-lying energy levels. We provide the distribution of how these runs converged in Appendix D 3. Furthermore, the observed relation between fidelity and probability is directly analogous to phase estimation whereby a measurement is used to collapse the system into the desired eigenstate with a probability that is given by the fidelity with respect to that state. In contrast, variational quantum algorithms converge to local minima and only an exponentially small fraction of such local energy minima may be close to the ground state as proved in ref [17].

V. COMPARISON TO EXISTING TECHNIQUES
A. Relation to variational quantum algorithms A large subset of variational quantum algorithms is concerned with minimising a cost function E(θ) which is usually the expected value of a Hamiltonian H , such as in case of VQE. Among other practical limitations, this surface may have a large number of local minima that can trap local optimisers [69]. The main difference is that CoVaR uses a large number of such surfaces that are randomly selected and computing this large data requires similar quantum resources (shots) as a standard gradient estimation in case of variational quantum algorithms. In the following sections we compare CoVaR in more detail to specific variational quantum algorithms and related techniques.

B. Comparison to variance minimisation
Minimising the variance σ 2 := H 2 − H 2 of a Hamiltonian allows us to find eigenstates and has been explored in the context of quantum chemistry [29]. Furthermore, the so-called variance-VQE [70] approach uses variational quantum circuits and estimates this variance as well as its derivative using a quantum computer. Note that the tools we have introduced can naturally be applied in this context: Given a minimal pool P = {H a } r a=1 that only contains Pauli terms of the Hamiltonian and thus a covariance matrix H a , H b ψ of the Hamiltonian terms only, our Jacobian allows us to compute the gradient of the variance ∂ n σ 2 = J h, as we detail in Appendix C 3.
While our CoVaR approach contains full information about the variance and its gradient, it is important to recognise, however, that a gradient descent optimisation has an inferior convergence rate when compared to the quadratic Newton method. In Fig. 7 we compare the two techniques and indeed find that CoVaR has a superior performance. Furthermore, note that CoVaR uses strictly more information than variance minimisation: while the operator pool P = {H a } r a=1 containing only the Hamiltonian terms is sufficient in Theorem 1, we significantly enlarge this pool such that N c r and aim to find joint roots of this large number of covariances.

C. Comparison to subspace expansion
Subspace expansion [71] is a method for discovery of low-energy excited states starting from an estimated ground state |ψ G . One then explores directions in Hilbert space by applying low-weight operators as excitations to the ground state. Typically operator pools of Pauli operators are used to produce a new set of states |ψ k = P k |ψ G for calculating the overlaps H kj = ψ k | H |ψ j and S kj = ψ k |ψ j . Diagonalising H kj then reveals better ground-state energies than that of |ψ G . As opposed to CoVaR, subspace expansion cannot prepare the "good quality" representation of the eigenstates with a quantum computer, but is rather limited to estimating their energies with a classical computer.
The connection to CoVaR is elucidated further in Appendix A 1 where we express covariance functions in terms of a set of quantum state overlaps O k , We can make a weak analogy to subspace expansion based on the following observation. Given covariance functions can be expressed as state overlaps, the method explores possible directions in Hilbert space via our operator pool that is beyond the capabilities of the ansatz and we gain information through a linearisation as the Jacobian at what parameters the nearest eigenstate may be found via the vanishing overlaps. Similarly, subspace expansion also uses operators additional to the ansatz to explore around the estimated ground state to extract low energy excited states. As such, we may be able to use existing heuristics from subspace expansion for selecting problem-specific operator pools.

D. Comparison to Hessian optimisation
A Hessian-based Newton-Raphson optimisation [72] of the function E(θ) uses the update rule Where H is the Hessian matrix of second derivatives ∂ n ∂ n E(θ) which can be estimated on a quantum device using standard techniques such as the parameter-shift rule [31,73]. Let us compare our root finding with this Hessian optimisation. A similarity may be that both methods use additional information to provide improved convergence over methods that use only gradient information. The Hessian obtains information about the local curvature of the E(θ) manifold which is a "classical" multivariate function and its local curvature can even be accurately captured by analytical approximations [31] -with the use of a quantum computer. In contrast, CoVaR uses information from operator covariances of the variational state |ψ(θ) extracted from exploring directions in Hilbert space through a randomly chosen, large operator pool as detailed in Appendix A 1. In this sense, CoVaR is a quantum-aware method, as it does not only use the fact that E(θ) (and its derivatives) are efficiently calcula-ble on a quantum computer, but also the relationship between |ψ(θ) , the Hamiltonian and possible directions in Hilbert space. The most pronounced difference between the two techniques is from a practical point of view: while we can use the classical shadow procedure ν-times to estimate a very large Jacobian, for the Hessian we need to use it O(ν 2 )-times at different circuit-parameter configurations. In this sense CoVaR obtains significantly more information using fewer measurements.
It is also interesting to point out that the Gauss-Newton and LM techniques can be interpreted as approximate Hessian optimisations of the vector norm f 2 when the Jacobian H ≈ 2J J gives a good approximation of the Hessian matrix by keeping only the first order derivatives [∂ n f k (θ)][∂ m f k (θ)] but neglecting second order derivatives of the form ∂ n ∂ m f k (θ). This, however, does not apply to CoVaR given in our case the functions f k (θ) are trigonometric polynomials for which the second order derivatives are dominant, especially near solutions. Take for example the simple function g(θ) = − ν n=1 cos(θ n ) which has a minimum at θ = (0, 0 . . . , 0) and while its first derivatives vanish near the minimum, the second derivatives |∂ n ∂ n g(θ)| ≈ 1 are dominant. Note also that the vector norm f 2 has no immediate relation with the energy surface E(θ), and thus it is not related to a second-order energy minimisation.

E. Comparison to natural gradient
Quantum Natural Gradient, which is equivalent to Imaginary Time Evolution for ideal, unitary circuits improves over both the convergence rate and ability to avoid traps of vanilla gradient descent by taking into account the geometry of the space of quantum states. Here we estimate the Quantum Fisher Information [30,[65][66][67] as which is equivalent to the real part of the Quantum Geometric Tensor F(θ) = Re[F ij ] and is used to compute the parameter update Similarly to Hessian optimisation, this method has a complexity O(ν 2 ) for calculating the tensor which, as opposed to the Hessian, is independent of the energy surface, and rather expresses relations between states reached by varying different parameters.
CoVaR can also be thought of as an optimisation using additional information about the space of quantum states, but in this case extracted from the many covariance functions (which do depend on H) instead of the geometric tensor.

F. Relation to parent Hamiltonians
Let us now relate CoVaR to existing techniques that do not aim to find eigenstates, but rather aim to find Hamiltonians H parent that encode a fixed state |ψ as an eigenstate. The H parent are then called as parent Hamiltonians to the state |ψ .
In particular, these techniques proceed by assuming that the parent Hamiltonian can be expressed in terms of the ansatz as a linear combination of basis operators H a as [74,75] via the real coefficient vector h ∈ R r . Here r is the rank of the decomposition, i.e., the number of independent basis operators. The covariance matrix C(ψ) ∈ C r×r then depends on our trial quantum state |ψ that we have defined in Eq. (3). The parent Hamiltonian is then found by finding the nullspace of this covariance matrix given every coefficient vector h in the nullspace satisfies Ch = 0 and given our expression for the variance in Lemma 1 it guarantees that h C h = Var[H parent ] = 0 in the particular state |ψ . CoVaR clearly works according to a reverse logic whereby the problem Hamiltonian is fixed and we search for quantum states that result in 0 covariances. We then search the space of quantum states via an efficient parametrisation, i.e., variational circuits, using a quantum computer.

VI. DISCUSSION
We have demonstrated that CoVaR shows significantly improved performance by many orders of magnitude compared to analogous variational algorithms due to its effective use of classical shadows. However, a main limitation is its vulnerability to random parameter initialisation. Although this seemingly has a resemblance to barren plateaus whereby expected-value landscapes suffer from flat regions due to vanishing gradients [13,17,18], the present limitation is quite different in nature. In particular, recall that phase-estimation protocols provably efficiently find an eigenstate of an efficiently simulable Hamiltonian given an initial state is provided with a sufficiently large overlap with the desired eigenstate; Our approach is quite similar as it gets attracted to any eigenstate with a significant contribution to the initial state. For this reason we expect the main limitation of the present approach is not decoupled from the general challenge of finding good initialisation for faulttolerant phase estimation protocols or finding problem specific ansätze for VQE problems -and this challenge may be attributed to general hardness results of finding ground/eigenstates [76]. Interestingly, we have demonstrated in numerical simulations that even a short period of gradient descent evolution provides sufficient initialisation in practice.
Furthermore, barren plateaus do not necessarily exist for our focus of local Hamiltonians and shallow ansätze [17,18]. Nevertheless, random initialisation of gradient-based VQE optimisers still prohibits finding eigenstates of large systems due to local traps [17]: First, optimising VQAs has been shown to be NP-hard due to persistent local minima [69]; Second, ref. [17] proved that a broad class of shallow VQA models that exhibit no barren plateaus are untrainable due to local traps. As we demonstrated in numerical simulations, our stochastic Levenberg-Marquardt approach indeed does mitigate the effect of these local traps: while gradient based optimisers fail to make progress around a local trap as the gradient of the energy surface vanishes, CoVaR is not an energy minimiser and those specific parameters may well yield a non-zero step for CoVaR. Furthermore, Co-VaR is also less vulnerable to getting stuck due to our randomly generated constraints (covariances): even if a single iteration makes no progress, in the next iteration a new, randomly generated set of constraints may well yield a non-zero step as we demonstrate in Fig. 13 where sometimes several steps of CoVaR are required to escape a trap. This is analogous to the well-known advantage of stochastic gradient descent in the machine-learning context [43] and we note that exploring globally convergent root-finding techniques may also be a fruitful direction for future research [37,39,40].
There are a number of apparent extensions to our approach that we have not considered here. First, given the classical shadows are stored in a classical computer we can in principle determine multiple sets of update rules from them and apply the one that most decreases the variance or any other metric as opposed to our fully randomised scheme. Second, it would be worth exploring some specific use cases of CoVaR in more detail, such as finding highly excited states. Third, in this work we have used classical shadows to extract a large number of covariances in the case where both the Hamiltonian and operator pool is constructed of local Pauli strings. It is an interesting direction for further work to attempt to use other randomised measurement channels to measure covariances with similar efficiency for nonlocal Hamiltonians or operator pools. Several works have appeared recently that make significant progress by developing shadow-measurement channels that interpolate between Pauli measurements (as in this work) and the powerful global Clifford measurements [77,78]. These intermediate-depth techniques are amenable to NISQ devices and allow for the measurement of non-local properties. As a matter of fact, related techniques leveraging simultaneous measurements of commuting observables are also highly relevant given they allow the efficient reconstruction of a large number of not necessarily local Pauli strings as crucial in applications of quantum chemistry [79][80][81][82].
Finally, we expect the present approach to be resilient against reasonable levels of experimental noise. First, our update rule in Eq. (10) is invariant under global depolarising noise when the expected value H is known to high precision, e.g., via well-established error mitigation techniques [46][47][48][49]. Second, we numerically simulated an approximate noise model that goes beyond global depolarisation and we observed a very good robustness against experimental noise. While these observations speak for the practicality of the present approach, we leave it to future work to confirm the performance in current and near-future generation hardware.

VII. CONCLUSION
In this work we considered powerful variational quantum circuits that have been extensively investigated in a hope to exploit near-term quantum computers. Most of these near-term quantum algorithms aim to encode the solution to a practical problem of interest to eigenstates of a Hamiltonian, typically the ground state. As a direct analogy to successful variational techniques in quantum chemistry, nearly all quantum variational algorithms so far have proceeded by posing the problem as a variational search. In this paradigm we minimise the single classical cost function-typically the expected value of a Hamiltonian-with respect to circuit parameters.
Our work opens a new research direction in the efforts of achieving practical value with near-term quantum computers: we observe that the condition for finding eigenstates can be posed as finding joint roots of a large number of properties of the quantum state as covariance functions -these express fundamental quantummechanical uncertainty relations. We have devised the powerful root finding technique CoVaR and demonstrated that increasing the number of these constraining covariances significantly increases the efficacy of the search procedure.
The most remarkable feature of CoVaR is that it allows us to fully exploit the extremely powerful classical shadow techniques in a way that prior variational techniques could not, i.e., we simultaneously estimate a very large number of randomly chosen properties, e.g., > 10 4 − 10 7 of the quantum state and their derivatives with respect to circuit parameters. These inform our search procedure via a large but tractable linear system of equations that we solve with a classical computer.
Our approach can be viewed as directly analogous with (stochastic gradient descent and) stochastic Levenberg-Marquardt techniques that have been extremely popular in the context of classical machine learning -and we generally expect CoVaR inherits the fast convergence speed of Levenberg-Marquardt as we indeed demonstrated in practical examples. In fact, Levenberg-Marquardt is the default and fastest method for training classical neural networks [22][23][24][25] -but with a limitation that handling a large Jacobian becomes the bottleneck for too deep neural networks. In stark contrast, we view this limitation of the classical technique a major advantage of our approach given we can populate the large Jacobian using only logarithmic quantum resources. CoVaR thus allows us to offload non-trivial but tractable calculations onto the classical computer and combines the best of both worlds, i.e., fast convergence and fast (quantum) computation of the Jacobian. Furthermore, as we demonstrated, using a large number of randomly generated constraints makes CoVaR particularly robust against getting stuck in local traps in analogy with stochastic techniques in the classical machine learning context [22,43].
We proved that the quantum resources, using classical shadows, required for a single iteration of our procedure is comparable to that of a standard gradient estimation in conventional VQE. In addition to its significantly improved convergence speed, our approach exhibits a robustness against shot noise and against experimental imperfections thanks to our large dataset. We have explored a number of practically motivated important applications whereby the problem Hamiltonian is local given the classical shadow procedure is very NISQ-friendly in such scenarios, requiring only single-qubit measurements in a random basis. These include, recompiling quantum circuits and finding ground and excited states. Our numerical simulations confirm the superiority of our approach and indicated that it can significantly outperform others by many orders of magnitude. Furthermore, previous techniques for finding excited states of Hamiltonians assumed a sequential search whereas ours naturally converges to any of the eigenstates -and can thus be applied naturally in this context. Similarly, recompilation is another natural set of problems for CoVaR given any eigenstate of the problem Hamiltonian can be accepted as a solution. While the presented applications tackling local Hamiltonian problems are ideal for CoVaR, important quantum chemistry problems may be non-local and may thus be challenging for classical shadows depending on the encoding. Fortunately, two fields of active research are making progress to alleviate this issue: first, compact fermion encodings result in local Hamiltonians at the cost of a modest qubit overhead, and second, recent advancements in classical shadows allow for efficiently measuring non-local properties or specifically, measuring fermionic operators [77,78,83,84]. It is thus expected the present approach will be highly competitive and will spark further developments in the field.
Finally, our work makes exciting connections to various fields, including fundamental uncertainty relations in quantum mechanics as covariances, exploitation of classical shadows, stochastic optimisation in machine learning and working with big data. We believe it will be interesting to explore these connections to further improve the presented techniques in the hope of achieving practical value with near-term quantum computers.

ACKNOWLEDGMENTS
We thank Jonathan Foldager, Hsin-Yuan Huang and Suguru Endo for providing us with useful comments. We thank Simon C Benjamin for his support throughout this work. B.K. conceived the idea and contributed to writing the manuscript, G.B. performed numerical simulations and contributed to writing the manuscript. G.B. and B.K. acknowledge the EPSRC Hub grant under the agreement number EP/T001062/1 for hardware provision. B.K. thanks the University of Oxford for a Glasstone Research Fellowship and Lady Margaret Hall, Oxford for a Research Fellowship. The numerical modelling involved in this study made use of the Quantum Exact Simulation Toolkit (QuEST), and the recent development QuESTlink [85] which permits the user to use Mathematica as the integrated front end. We are grateful to those who have contributed to both these valuable tools.

Appendix A: Orthogonal constraints
We presented the general theory of our approach in Section II whereby we compute covariances with respect to an arbitrary operator pool P in order to search for eigenstates of an arbitrary Hamiltonian H. While our practically motivated CoVAR approach leverages powerful classical shadows, it restricts the problem Hamiltonian and the operator pool to local Pauli strings. While Pauli strings are orthonormal in operator space, their actions on quantum states are generally not orthogonal directions in Hilbert space. In the present section we explore another kind of operator pool P whereby the operators represent orthogonal directions in state space. Let us first start by interpreting covariances as overlaps in Hilbert spaces.

Interpretation as state overlaps
Lemma 3. The covariances from Definition 1 can be interpreted as overlaps between quantum states as where we can define the (unnormalised) vectors |φ A := (A − A )|ψ for any operator A ∈ C d×d with norm Proof. The above property immediately follows from the defining expression of covariances from Definition 1.
The above lemma informs us that in Theorem 1 and in Corollary 3 we compute overlaps as O k , H ψ = φ k |φ H and thus we actually decompose the quantum state H|ψ into a set of quantum states |φ l that we obtain by acting on |ψ with elements of our operator pool. Given that |φ H must be the null vector when the eigenvalue equation is satisfied, it is necessary that any (non-parallel) overlap with this vector must vanish.
While Pauli strings form an orthonormal basis of operator space, they have the disadvantage that in Hilbert space they result in non-orthogonal actions φ k |φ l = δ kl , i.e., we decompose the vector H|ψ into a non-orthogonal basis. On the other hand, it is possible to define an operator pool that corresponds to orthogonal directions in Hilbert space.
Lemma 4 (Orthogonal operator pool). Let us consider strings of X Pauli operators as X k ∈ {Id, X} ⊗N using the binary index k ∈ {0, 1} N . We define the orthogonal operator pool P := {O k := U X k U † , k ∈ {0, 1} N } via the operators, where the unitary quantum circuit U maps our reference |ψ = U |0 onto our quantum state. The quantum states |φ k := O k |ψ form an orthonormal system φ k |φ l = δ kl and thus the operators O k map to orthogonal directions in Hilbert space.
Proof. Orthonormality in Hilbert space follows from O k = δ k0 and via where |k are standard basis vectors with k|l = δ kl .
The resulting covariances can actually be shown to be entries in a column vector of the Hamiltonian matrix. In particular, given the states |φ k form an orthonormal basis they can be used to represent the Hamiltonian matrix as the covariances f k = ψ|O k H|ψ = φ k |H|φ 0 = Col 0 (H), which are then actually elements of the first column vector of the problem Hamiltonian. These covariances, as entries of the Hamiltonian matrix, can be computed using the Hadamard-test techniques presented in ref. [86]. In particular, the covariances are obtained by applying our variational quantum circuit U onto the standard computational basis states as |φ k = U X k U † |ψ = U |k and we compute the overlap of this state with our variational quantum state |ψ . The approach can straightforwardly be implemented via the Hadamard test, whereby we apply the X k operations in |k = X k |0 controlled on an ancilla qubit. Derivatives of these covariances can be similarly computed by applying the generator of the quantum gate controlled on the same ancilla. Let us now show that sum of squares of the covariances is equivalent to the variance of the Hamiltonian. Proof.
As the |φ k are a complete basis set due to being a unitary transformation of the computational basis, k |φ k φ k | = Id and we therefore obtain Importantly, while this operator pool has the advantage that the covariances represent independent, orthogonal directions in state space it is clear that we would generally need to compute all 2 N − 1 of these orthogonal constraints, as elements of the first column of the Hamiltonian matrix, in order to be able to compute the variance and thus to verify that the quantum state |ψ is an eigenstate of the Hamiltonian. In stark contrast, in Theorem 1 we have shown that given a decomposition of a Hamiltonian into an operator basis H = r k=1 h a H a which typically grows polynomially with the system size, it suffices to only compute the corresponding polynomial number of covariances. Although these operators, such as Pauli strings, are orthonormal in operator space, they do not correspond to orthogonal directions in Hilbert space.

Finding eigenstates via orthogonal operator pools
Let us now apply our orthogonal constraints to finding eigenstates by finding roots. It is interesting to note that it follows from our relation in Lemma 5 that the Newton step through the inverse Jacobian from (10) as J −1 f is guaranteed to represent a descent direction for the variance Var [H] given that the gradient vector grad( f 2 ) = J f = grad (Var[H]).
Furthermore, we have written our problem as a leastsquares minimisation of the constraints f k and thus the Gauss-Newton and the Levenberg-Marquardt approaches can be interpreted straightforwardly: our root finding approach is equivalent to a non-linear least squares minimisation.
An advantage of this scheme is that we can randomly sample the constraints according to an importance sampling, i.e., the constraints are picked with a probability proportional to their magnitude p k = |f k | 2 . We can efficiently upper bound these probabilities in an experiment as p k = φ k |H|ψ = | k|U † H|ψ | 2 . In particular, we run the quantum circuit U † H a U and measure samples in the standard basis, whereby we obtain the binary string k with probability | k|U † H a |ψ | 2 . It then follows that p k ≤ a c a | k|U † H a |ψ | 2 . There is of course no guarantee that these probabilities have structure, however, when performing energy minimisation first, the probabilities are more likely to be peaked around the lower energy basis vectors |φ k .
In case of finding eigenstates of a set of mutually commuting Hamiltonians H a we can compute covariances as φ k |H a |ψ individually for all operators H a . If all such variances vanish then we are guaranteed that Var[H a ] = 0 for all a. In Figure 10) we compare the performance of root finding for two choices of operator pool on a parameter rediscovery problem (3-local Pauli strings and the orthogonal operator pool), showing that both pools give very similar performance.
Let us first compute covariances assuming the Hamiltonian H = r a=1 h a H a is given in terms of Pauli strings H a ∈ {Id 2 , X, Y, Z} ⊗N . The covariances f k = O k , H ψ = A re + iA im − BC are completely determined by the expected values from Eq. (B2) of Hermitian operators. We can significantly simplify these when O k and H a are Pauli strings as where indeed P ka , Q ka ∈ ±{Id 2 , X, Y, Z} ⊗N are Hermitian Pauli strings given any two Pauli strings O k and H a either commute or anticommute and thus Here the particular Pauli strings P ka and Q ka and their signs ± can be determined straightforwardly and efficiently from the indexes k, a ∈ {0, 1, 2, 3} N using the algebra of Pauli matrices, i.e., the Pauli group. We therefore conclude that the covariances can be computed in terms of only expected values of Hermitian Pauli strings as We need to estimate overall 3r expected values of Pauli strings to estimate a covariance f k given P ka = 0 when Q ka = 0 and vice versa.
We consider the covariances f k (θ) with respect to Pauli strings O k ∈ P in our operator pool as defined in Eq. (6) for a fixed Hamiltonian H = r a=1 h a H a . Recall that the Jacobian is defined in terms of the partial derivatives J kn := ∂ n f k (θ). We can explicitly compute these derivatives by recalling that the covariances can be expressed where P ka , Q ka , H a and O k are Pauli strings. Above we can use well established techniques for experimentally estimating derivatives of general expected values for a variety of ansatz constructions and gatesets [9][10][11]. Furthermore, in Section III we focus on the typical practical scenario when the ansatz circuit consists of Pauli gates and thus we can use parameter-shift rules [55] for computing derivatives-while generalisations in [88,89] are also applicable-as linear combinations of two expected values as for any Hermitian observable O where v n is the n th standard Euclidean basis vector. As such, we can compute all derivatives in Eq. (C1) by applying parameter-shift rules.

Computing the Jacobian using classical shadows
Let us now describe an explicit measurement protocol in the specific case when our Hamiltonian is local and the operator pool P = {O k } Nc k=1 is also local as in Statement 1 and thus we can use classical shadows to determine a large number M of local Pauli-string expected values. We then compute the large Jacobian by estimating Pauli strings at different shifted circuit parameters via the parameter-shift rules for computing partial derivatives.
Lemma 6. Given an operator pool P = {O k } Nc k=1 and a problem Hamiltonian H = r a=1 h a H a in terms of local Pauli strings O k and H a and a variational circuit with ν parametrised Pauli gates, we can determine a the Jacobian J ∈ C Nc×ν by applying the classical shadow procedure 2ν + 1 times at different circuit-parameter configurations with each estimating M ≤ 3rN c Pauli expected values. In contrast determining a gradient vector for gradient descent requires 2ν applications of the classical shadow procedure each with M = r.
Proof. First, at parameters θ we determine overall M = r + N c expected values as O k and H a from Eq. (C1) using a single application of the classical shadow procedure. Second, we estimate derivatives of expected values ( ∂ Ha ∂θn , ∂ O k ∂θn and either ∂ P ka ∂θn or i ∂ Q ka ∂θn ) via the above parameter-sift rule. We determine all derivatives with respect to a fixed parameter θ n by estimating Pauli strings at parameters θ + v n π/2 as well as at parameters θ − v n π/2 by two applications of the classical shadow procedure each with M ≤ r + N c + rN c . Thus determining all derivatives requires 2ν applications of the classical shadow.

Computing the variance gradient from the Jacobian
The gradient of the variance can be computed via Lemma 1 when the operator pool is P = {H a } r a=1 as This confirms that the gradient of the variance is determined by our Jacobian. We can also compute the gradient of the vector norm f 2 = rp k=1 |f k | 2 . When the operator pool is larger than the problem Hamiltonian terms then we compute the gradient of the norm of the covariance vector  Fig. 8. Gradient descent (orange) for 1000 adaptive size steps to reach local minima, followed by 100 iterations of CoVaR (blue). Several steps of CoVaR were sometimes required before finding a set of operators that produced a step that escaped the minimum. Mean energy differences from the ground state are: ∆E localmin = 0.019, ∆ECoVaR = 1.8×10 −4 . The last step of each run of gradient descent had an energy improvement below 2 × 10 −5 indicating being stuck in a local trap. Note that a single iteration of CoVaR does not necessarily decrease the energy given CoVaR is not an energy minimiser as is demonstrated here but is also visible in Fig. 8 where inJ and inf we have stacked real and imaginary parts on top of each other.

Computing the update rule classically
We need to first stack real and imaginary parts on top to solve for only real parameter upadtes. In particular, we want to solve the linear system of equations J∆θ = f . Here we estimate both J and f with a quantum computer and we want to compute the parameter update ∆θ. If we merely compute the pseudoinverse of J and apply it as J −1 f then the resulting solution vector ∆θ is generally complex. However, we require that our parameter-update be real as ∆θ ∈ R which we enforce via the following set of linear equations Here λ is a regularisation parameter that we dynamically set by choosing λ = 0.0001 × 2 i where i is incremented from 0 until the condition f (θ t ) < f (θ t−1 ) is met, i.e., the norm of the covariance vector has decreased from the previous iteration. If the step ∆θ to be taken is too large for any individual parameter is too large |∆θ i | > 1, the update step is rescaled to normalise this value to 1. This prevents our algorithm from taking overly large steps. For the selection of constraints, N c constraints are selected randomly from the chosen operator pool at every iteration.
We implemented a linesearch algorithm whereby we compute the update rule ∆θ and compute the value of the vector norm f (θ) at parameter values θ = θ t + κ∆θ in small increments in κ. However, it was observed empirically that a step close to the canonical choice κ = −1 was almost always chosen, so linesearch was not used in the numerics in this work.

Existence of a shot noise-floor
We extend our example in Section III A and consider noisy entries of the Jacobian and the covariance vector as J k + k and f k (θ) + η k for random variables k and η k due to shot noise (and possibly other sources of random errors). The solution to our linear equation in Eq. (11) is thus modified as where we dropped all higher order terms as products k η k as well as considered the approximation where we also dropped the term O( k /[ k J 2 k ] 2 ). Let us now consider the error propagation to the solution ∆θ by considering the linear error propagation formula  Fig. 9 showing the probability of converging to the ground state with initial overlap (convergence being an energy difference of less than 10 −3 in this case) as a function of its initialisation energy. Each bar corresponds to one point in Fig. 9, but there are two additional points on the right of the figure.
we gain increasingly more information about the root, however, the random error that propagates into our solution does not scale with the number of constraints.
6. Time complexity of classically solving the linear system of equations Let us now derive the time complexity of computing the regularised inverse as (J J + λR) −1J f of the Ja-cobianJ ∈ R 2Nc×ν in which we have stacked real and imaginary parts on top of each other and R is a regularisation matrix. This computation can be broken up into four steps.
• First, compute the square matrix A :=J J +λR as the product of two non-square matrices as [A] mn = λR mn + • Finally, we compute the matrix-vector product Av which requires ν 2 operations.
Given N c ν, the overall computation time is dominated by the first step as computing A and thus the time complexity is t ∈ O(ν 2 N c ) which is merely linear in the dominant dimension N c . We confirm this theoretical scaling in Fig. 5 and conclude that the absolute times are very reasonable, i.e., we can compute the update rule in a matter of minutes for up to a very large number of covariances N c = 10 6 . Of course, the computation can be heavily parallelised and can also be preformed in a distributed-memory model with negligible communication between nodes. As such, in principle one could straightforwardly use a very large number of covariances N c ≈ 10 8 that would still require reasonable classical computational resources. While in Section IV B we focused on practical applications of CoVaR here we demonstrate a comparison of convergence speed between gradient descent and Co-VaR. In particular, in Fig. 12 we compare the speed of convergence to the ground state of the spin chain, using an identical ansatz and hyperparameters as those used for runs of CoVaR in Fig. 8.

Demonstration of escaping local traps
As we noted in the main text, even when gradient descent is stuck in a local trap due to the gradient of the energy surface vanishing, CoVaR can be used to navigate out from such a trap given CoVaR is not an energy minimiser and may yield a non-zero step -especially that constraints that determine a CoVaR step are generated randomly at every iteration. We numerically demonstrate this in Fig. 13 using our spin-chain problem, with all parameters the same as those used for Fig. 8. The local traps were found using 1000 iterations of gradient descent with an adaptive step size and achieved a mean energy difference from the ground state of ∆E = 0.019. CoVaR was then run for 100 further iterations and was able to achieve a mean energy difference of ∆E = 1.8 × 10 −4 .

Convergence of root finding to eigenstates
In this section we further analyse the property of Co-VaR that it converges to eigenstates that have a dominant contribution to the initial state as reported in Fig. 9. For this reason, in Fig. 14 we performed simulations of CoVaR optimisations with initial states of increasing expected energy starting near the ground state.
Recall that the expected value of the energy can be written as E = ψ|H|ψ = k E k p k , where E k are eigenenergies (eigenvalues) of H and p k are the probabilities (fidelity) that |ψ is in the k-th eigenstate of H.
Given a gapped Hamiltonian with energies E 0 < E 1 < E 2 . . . , quantum states with a low energy must necessarily have a high probability (fidelity) to be in the lowest lying eigenstates. As such, an initial state that has expected energy E = E 0 p 0 + E 1 p 1 + . . . close to the ground state energy as E ≈ E 0 guarantees the high probability (fidelity) p 0 ≈ 1. Indeed, CoVaR nearly always converges to the ground state in Fig. 14 (bars on the left) when E ≈ E 0 and thus p 0 ≈ 1.
On the other hand, starting CoVaR from initial states that have higher expected energies than E 0 causes Co-VaR to not always converge to the ground state due to the necessarily increased populations of excited states: Fig. 14(bars in the middle) feature instances where CoVaR does not converge to the ground state, however, it then nearly always converges to the first excited state. For example, when no higher eigenstates are populated the probability of the first excited state is p 1 = 1 − p 0 . Interestingly, we find a nearly linear relationship between the probability of the eigenstate, such as p 0 , and the percentage when CoVaR converges to that state in Fig. 9 -and note that this relationship would be exactly linear for the case of fault-tolerant phase-estimation protocols. As we keep increasing the energy, higher excited states start to contribute as in Fig. 14(bars on the right) and thus CoVaR may also converge to those eigenstates.

Scaling of performance
We performed simulations to assess the scaling of the performance of CoVaR, for both recompilation and spinchain problems and plot the results in Fig. 15. For the recompilation problem we knew by construction the parameters θ of the ground state and the circuit depth was constant due to a fixed number 2 of ansatz layers for all qubit counts N . For the spin-chain problem we first searched for the solution at each qubit number N using natural gradient descent followed by CoVaR. The number of ansatz layers was then increased until a desired precision with respect to the ground state was achieved (here an infidelity of 10 −5 ) thus obtaining a series of solution parameters θ at every qubit count.
These solution parameters were perturbed to obtain initial states for CoVaR with a desired initial overlap with the ground state (50% for recompilation and 80% for the spin chain to suppress convergence to excited states). Of course, such states could equally well have been created through alternative initialisation methods, including performing an initial period of gradient or natural gradient descent. Finally, CoVaR was run for a fixed number 20 of iterations and statistics of the final achieved fidelity are plotted in Fig. 15. (left) simulations for our recompilation problem from Section IV A using a constant initial fidelity Finit = 50% and a constant ansatz depth of 2 layers. (right) simulations for the spin chain problem using an increasing ansatz depth that can estimate the ground state to at least an infidelity of 10 −5 . An initial overlap of Finit = 80% was used to suppress convergence to excited states and only the runs that did converge to the ground state (and not to an excited state) are included in the present statistics. Compared to our recompilation simulations (left) using a constant ansatz depth, here the performance may appear to be decreasing slightly as we increase the number of qubits -which would be explained by the increasing depth (polylogarithmically in Section III A) of the ansatz circuit.

Appendix E: Details of numerical simulations
We performed all statevector simulations using the open-source tools QuEST [90] and its high level Mathematica based interface QuESTlink [85]. Shot noise was simulated by adding Gaussian noise of standard deviation 1/ √ N s to computed matrix and vector elements. Numerics for recompilation and the spin chain were both performed using a hardware efficient ansatz of the form in Fig. 11.

Effect of Constraint Number on Performance
Data for Fig. 3 and Fig. 6 was obtained by simulating a 14-qubit parameter rediscovery problem using 2 layers of the ansatz in Fig. 11. The initial states were initialised close to the solution by randomly perturbing the solution parameters resulting in an initial average fidelity of F = 46 ± 7%. Fits in these figures are of the form 1 − F min = a (N c /ν) −b +c and we report fitted parameters in Table I. The orange line in Fig. 10 is identical to the orange line in Fig. 6.
The noisy simulations were performed assuming the following simplified noise model. Recall that global depolarising noise is a relatively good approximation in complex quantum circuits and becomes near-exact for random circuits, refer for rigorous bounds to ref. [91]. This error channel acts on any density matrix via the Kraus map D(ρ) = F ρ + (1 − F )ρ max , where ρ max is the maximally mixed state, i.e., white noise, and F is the fidelity. In practice this error model does not capture more subtle physical processes that corrupt the expected value measurement. Nevertheless, it was shown in ref [48] that nearly all typical error models used in practice admit the decomposition F ρ + (1 − F )ρ err where ρ err ≈ ρ max is an error density matrix that we do not expect to be exactly the maximally mixed state, albeit in practice it is relatively close to the maximally mixed state via its vanishing commutator norm from ref [48]. In order to go beyond global depolarisation, but without resorting to computationally infeasible explicit noise simulations, we compute the noisy expected value as O k = Tr[O k ρ] = F O k id + (1 − F )Tr[O k ρ err ] by approximating the term Tr[O k ρ err ] ∼ N (0, σ 2 ) using random Gaussian numbers. Here we set σ = 0.01 which is determined by the distance of ρ err from the maximally mixed state which we simulate with random Gaussian numbers that are unique to each observable indexed by k. Furthermore, we choose the fidelity F = 0.9 ≈ (1− 1 ) ν1 (1− 2 ) ν2 , such that it approximates the performance of a typical, state-of-the-art experimental device with two-qubit error rates 2 = 0.001 and single-qubit error rates 4-times smaller 1 = 0.25 2 given in our circuit we have ν 1 = 196 and ν 2 = 52 singleand two-qubit gates, respectively. Fig. 7 The numerics for recompilation in Fig. 7(c) were done on a 10-qubit parameter rediscovery problem for two layers of HEA (ν = 88). Gradient descent was preformed with a learning rate η = 0.1 for both VQE and V-VQE.

Spin-chain simulations
The Hamiltonian in Eq. (13) was used with parameters J = 0.1 and c i chosen randomly between −1 and 1. For the spin-chain simulations in Fig. 8, Imaginary Time Evolution was used from a random initialisation until an energy of E = −5.9 was reached, with parameters θ imag . These parameters were then disturbed by |∆θ k | ≤ 0.05 to produce 7 low energy states. CoVaR was then run from these initial states for 40 iterations. Imaginary Time Evolution was also continued from θ imag until convergence and reached an energy difference to the ground state of ∆E = 0.012 compared to the 4 × 10 −4 of CoVaR.