Measurement cost of metric-aware variational quantum algorithms

Variational quantum algorithms are promising tools for near-term quantum computers as their shallow circuits are robust to experimental imperfections. Their practical applicability, however, strongly depends on how many times their circuits need to be executed for sufficiently reducing shot-noise. We consider metric-aware quantum algorithms: variational algorithms that use a quantum computer to efficiently estimate both a matrix and a vector object. For example, the recently introduced quantum natural gradient approach uses the quantum Fisher information matrix as a metric tensor to correct the gradient vector for the co-dependence of the circuit parameters. We rigorously characterise and upper bound the number of measurements required to determine an iteration step to a fixed precision, and propose a general approach for optimally distributing samples between matrix and vector entries. Finally, we establish that the number of circuit repetitions needed for estimating the quantum Fisher information matrix is asymptotically negligible for an increasing number of iterations and qubits.


I. INTRODUCTION
With quantum computers rising as realistic technologies, attention has turned to how such machines could perform as variational tools . This results in a hybrid model with an iterative loop: a classical processor determines how to update the parameters describing a family of quantum states (parametrised ansatz states), while a quantum coprocessor generates and performs measurements on that state (via an ansatz circuit). This is of particular interest in the context of noisy, intermediate-scale quantum devices (NISQ devices) [25], because complex ansatz states can be prepared with shallow circuits [26][27][28][29]. Such shallow circuits will potentially enable obtaining useful value before the era of resource-intensive quantum fault tolerance methods. As such, variational quantum algorithms promise to solve key problems that are intractable to classical computers, such as finding ground states [2,4,6,11,30]-as relevant in quantum chemistry and in materials scienceor approximately solving combinatorial problems [1] and beyond.
Despite their potential power, variational algorithms might require an extremely large number of quantumcircuit repetitions -optimally using quantum resources will therefore have a crucial economic importance. Attention has recently been focused on statistical aspects of these variational quantum algorithms [31][32][33][34][35][36], such as the effect of shot noise and the reduction of their measurement costs. It is our aim in this work to establish general scaling results by rigorously characterising the number of measurements required to obtain a single iteration step in case of so-called metric-aware quantum algorithms. Let us first introduce basic notions.

A. Variational quantum algorithms
We consider variational quantum algorithms which typically aim to prepare a parametrised quantum state * balint.koczor@materials.ox.ac.uk ρ(θ) := Φ(θ) ρ 0 where we model via a mapping Φ(θ) that acts on the computational zero state ρ 0 of N qubits and depends continuously on the parameters θ i with i ∈ {1, 2, . . . ν}. This mapping can in general contain non-unitary elements, such as measurements [37,38], but in many applications one assumes that it acts (approximately) as a unitary circuit that decomposes into a product of individual quantum gates. These gates typically act on a small subset of the system, e.g., one and twoqubit gates.
Recently a novel variational algorithm was proposed for simulating real-time quantum evolution using shallow quantum circuits [8] and was further generalised to imaginary time and natural gradient evolutions [37,39] which can be used as optimisers of variational quantum eigensolvers (VQE) [2,4,16,40]. This was shown to significantly outperform other approaches, such as simple gradient descent, in terms of convergence speed and accuracy according to numerical simulations [37,39,41].
In this work, we consider generalisations of the aforementioned techniques as variational algorithms that need to estimate the following two objects: (a) a positivesemidefinite, symmetric matrix, which is usually the quantum Fisher information that characterises sensitivity with respect to parameters θ k ; (b) a vector object that is in many applications the gradient vector of the loss function. Examples of such algorithms are provided in references [37,39,[42][43][44], and we will refer to them in the following as metric-aware quantum algorithms. The metric tensor typically only depends on the parameter values while the vector object additionally depends on, e.g., a Hermitian observable H that in typical scearios represents the Hamiltonian of a physical system and decomposes into a polynomially increasing number r h of Pauli terms.

B. Quantum natural gradient
To be more concrete, in the following we will focus on one prominent algorithm, the recently introduced quantum natural gradient approach [37,44] which is equivalent to imaginary time evolution when quantum circuits are noiseless and unitary [37,39]. This approach can arXiv:2005.05172v3 [quant-ph] 9 Sep 2021 be used as a VQE optimiser when minimising the expectation value E(θ) := Tr[ρ(θ)H] over the parameters θ. However, the approach generalises to any Lipschitz continuous mapping as an objective function [37].
In particular, natural gradient descent governs the evolution of the ansatz parameters according to the update rule [37] where t is an index and λ is a step size. Here the inverse of the positive-semidefinite, symmetric quantum Fisher information matrix F Q ∈ R ν×ν corrects the gradient vector g k := ∂ k E(θ) for the co-dependence of the parameters, and both objects can be estimated efficiently using a quantum computer while the inverse F −1 Q is computed by a classical processor.
We discuss different protocols for estimating the matrix [F Q ] kl and vector g k entries for both pure (idealised, perfect quantum gates) and mixed quantum states (via imperfect quantum gates or non-unitary elements as measurements) in the Appendix. We now highlight two results. a) We derive the general upper bound [F Q ] kl ≤ r 2 g , where r g is the maximal number of Pauli terms into which generators of ansatz gates can be decomposed (Lemma 1). This bound is a generalisation of what is known as the Heisenberg limit in quantum metrology [45], refer also to [46][47][48]. b) The matrix F Q might be ill-conditioned and the inversion in Eq. 1 requires a regularisation. We will use the simple variant of Tikhonov regularisationF −1 Q := [F Q +ηId] −1 in the following; we derive analytical lower and upper bounds on the singular values of this inverse matrix in the Appendix (Lemma 3) using a).

II. UPPER BOUNDS ON THE MEASUREMENT COST
To motivate our approach, we illustrate in Fig 1 (a/green) how naively using the same number of measurements for estimating each matrix and vector entry, such as in [41], can result in impractical sampling costs.
In particular, we aim to reduce the error due to shot noise (finite sampling) of the vector v :=F −1 Q g in the update rule in Eq. (1). We first express how the error in the matrix and vector entries propagates to the parameter-update rule in Eq. (1). We quantify this error as the expected Euclidean distance ∆v 2 = 2 , and this translates to the condition is the variance of a single vector entry.
We derive an analytical formula in Lemma 2 in the Appendix: we express the error in terms of the variances Var{[F Q ] kl } and Var[g l ] of the measurements used to estimate the matrix and vector entries, respectively, as The coefficients a kl and b k describe how the error of [F Q ] kl and g k propagates through matrix inversion and subsequent vector multiplication into the precision . We remark that these results are completely general and can be applied to any quantum algorithm that requires the estimation of both an inverse matrix and a vector object, such as a Hessian-based optimisation. We derive general upper bounds on the variances Var{[F Q ] kl } and Var[g l ] for different experimental strategies in the Appendix; The error 2 in Eq. 2 is reduced proportionally when repeating measurements. In the following, we assume that N F measurements are assigned to estimate the full matrix F Q while N g measurements are used to estimate the gradient vector g [49]. We now state an upper bound on them in terms of the precision .
Theorem 1. To reduce the uncertainty of the vector v = F −1 Q g due to shot noise to a precision , the number of samples to estimate the matrix F Q in Eq. (1) is upper bounded as while sampling the gradient has a cost upper bounded by The overall measurement cost of determining the natural gradient vector is N F +N g . Here Spc[A] denotes the average squared singular values of a matrix A ∈ C d×d via its Hilbert-Schmidt or Frobenius norm as Spc[A] := A 2 /d and g ∞ is the absolute largest entry in the gradient vector.
The constant factors f F and f g in Theorem 1 are specific to the experimental setup used to estimate the matrix or vector entries. For example, for r g = 1 the factor simplifies as f F ≤ 2. The upper bounds in Theorem 1 crucially depend on the regularisation and we prove that Spc[F −1 Q ] ≤ η −2 , refer to Lemma 3 in the Appendix. The product Spc[H]f g is a constant that reflects the complexity of estimating the expected value of the fixed H (and can be reduced with advanced techniques that simultaneously estimate commuting terms [34,36,[50][51][52][53]). It is interesting to note that the sampling cost of the gradient vector N g depends on the metric tensor via Spc[F −1 Q ] (and vice versa). Let us illustrate this point in an example where one of the entries inF −1 Q is extremely large in absolute value and therefore via the matrix/vector product it magnifies both the mean and the variance of the gradient entries. Indeed, reducing such a magnified variance to our fixed precision requires an increased number of measurements in the gradient vector. We finally remark that Theorem 1 is quite general and the upper bounds apply to all metric-aware quantum algorithms [37,39,[42][43][44] up to minor modifications.
We will establish in the following, that in many cases sampling the gradient vector N g dominates the overall at every iteration step t of the natural gradient evolution. This quantifies how much more it costs to estimate the natural gradient vector v(t) than it would cost to estimate the gradient vector g(t) assuming the same precision . κ converges to its constant (black) asymptotic approximation. Optimally distributing measurements (blue) via Result 3 significantly reduces sampling costs. However, naively (green [49]) using a fixed number of measurements for estimating each matrix and vector element results in a substantial overhead. (b) Multiplying g(t) (red) with the inverse of FQ results in v(t) (blue) whose norm might be orders of magnitude larger. (c) In practice a relative precision is required, such that is proportional to the vector norms, refer to text. Carefully setting the regularisation parameter η significantly improves the practical applicability: solid lines with η = 10 −1 result in a sampling cost of v(t) comparable to (green shows κ = 1) or even smaller than g(t). Refer to the main text for a remark about mitigating the initial high overheads seen in graphs (a) and (c).
cost of the natural gradient approach as N F +N g ≈ N g . Before doing so, let us first bound the sampling cost of the natural gradient vector relative to the sampling cost N smpl of the gradient vector that would be used in simple gradient descent optimisations. Note that the difference between N g and N smpl is that the latter corresponds to the scenario when we fix the metric tensor as the identity matrix F Q := Id ν and thus the precision is 2 := ∆g 2 .
Theorem 2. Determining the natural gradient vector to the same precision as the gradient vector requires a sampling overhead κ := (N F +N g )/N smpl . This overhead is upper bounded in general up to the potentially vanishing term y = N F /N smpl , as in Result 1 and Result 2. Here η is either a regularisation parameter or the smallest singular value of F Q . The second equality establishes an approximation as a constant factor which is valid, e.g., when the evolution is close to the optimal point.

III. SCALING AS A FUNCTION OF THE ITERATIONS
Theorem 1 establishes that the sampling cost N F of the matrix F Q depends on the norm of the gradient vector, which is expected to decrease polynomially during an optimisation. In a typical scenario we expect that, even if initially estimating the matrix dominates the sampling costs, asymptotically sampling the vector g dominates the costs.

Result 1. The upper bound in Theorem 1 results in the
with c > 0, the natural gradient vector requires only a constant sampling overhead asymptotically as , when compared to the gradient vector via Theorem 2. We remark that convergence is guaranteed under mild continuity conditions [31].
We have numerically simulated the natural gradient evolution from Eq. (1) and determined its overhead κ. This quantifies how much more it costs at every iteration step t to estimate the natural gradient vector v(t) than it would cost to estimate the gradient vector g(t) assuming the same precision . In fact, carefully increasing the regularisation parameter (as η = 10 −1 ) reduces the sampling cost by several orders of magnitude without significantly affecting the performance: both evolutions decrease the gradient norm with a similar rate, compare solid and dashed red lines in Fig 1 (b).
It is striking that the overhead plotted in Fig 1 (a) can be very high initially; while the focus of the present paper is on the asymptotic costs with respect to time and size, it is worth noting that this high initial cost could be straightforwardly mitigated by, e.g., only occasionally updating a low-rank approximation of the metric tensor. This may be expected to have little impact on the convergence rate since in the early phase the advantage of using natural gradient is typically less pronounced.
Recall that Fig 1 (a) via Result 1 assumes a constant precision throughout the evolution which is not practical. In fact, one would require a relative precision such that = 0 g(t) in case of the gradient vector and = 0 v(t) in case of the natural gradient vector, for some fixed 0 . In particular, using a moderate regularisation of the inverse as η = 0.1, the cost of estimating v(t) is comparable or even smaller than estimating g(t), We finally stress that in Fig. 1(a) we do not actually compare the overall performance of the simple and natural gradient methods, but only their per-iteration (per-epoch) costs. We therefore conclude that the natural gradient optimisation requires overall less samples to converge (i.e., asymptotically constant overhead but faster convergence rate) when compared to simple gradient descent, see also [37,39,41]. Moreover, we prove in the following that even the significant initial overheads in Fig 1 (a-c) do in many practical applications asymptotically vanish for an increasing number of qubits.  4)) with some b ≥ 1. We finally obtain the growth rates Note that the vector norm g 2 ∞ might in general also depend on the number of qubits, e.g., exponentially vanishing gradients in case of barren plateaus [54-56] which would result in an exponentially decreasing relative sampling cost of the metric tensor. One may also think of scenarios where the gradient norm grows, however, one could then in practice decrease the inverse precision −1 proportionally as typical at the initial stages of an optimisation. To simplify our discussion, we assume that the gradient norm g ∞ is fixed (bounded), e.g., the evolu-tion is initialised in a close vicinity of the optimal parameters as a good classical guess is known. This also encompasses scenarios where the optimisation is near termination approaching a fixed e.g., chemical precision. We summarise the resulting measurement cost in the following.
Result 2. Assume that the number of Pauli terms in the Hamiltonian grows polynomially implying The relative sampling cost of the matrix F Q vanishes for general polylog(N )-depth circuits when b > (2−s) and, following Theorem 2, determining the natural gradient vector requires at most a constant overhead asymptotically , when compared to the gradient vector.
Note that Result 2 guarantees a vanishing sampling cost of the matrix F Q when the number of terms in the Hamiltonian grows faster than quadratically, i.e., b > 2. We have explicitly calculated the growth rates b in case of 3 example Hamiltonians in the Appendix as b = 1, 2, 3, respectively, and plot the relative sampling costs N F /N g in Fig 2. We remark that this result can be applied to the general class of metric-aware quantum algorithms [37,39,[42][43][44].

V. OPTIMAL MEASUREMENT DISTRIBUTION
So far we have assumed that N F (N g ) measurements are distributed uniformly among the ν 2 (ν) matrix (vector) entries [49]. However, the overall number of samples N F + N g (from Theorem 1), needed to obtain the vector v =F −1 Q g to a precision , can be minimised by distributing samples between the elements of F Q and g optimally [34]. We denote the matrix [N F ] kl and the vector [N g ] k entries that represent the number of measurements assigned to individual elements in F Q and in g, respectively. The number of samples required is reduced to N opt = Σ 2 / 2 with N opt ≤ N F + N g . We now state explicit expressions for determining Σ, [N F ] kl and [N g ] k .
Result 3. Measurements are distributed optimally when the number of samples for determining individual elements of the matrix and gradient are given by respectively. Here Var[·] is the variance of a single measurement of the corresponding element and we explicitly define Σ via the coefficients a kl and b k as number of qubits Furthermore, the symmetry of the Fisher matrix can be explicitly included just by modifying the coefficients a kl , as discussed in the Appendix.
We remark that this result is completely general and can be applied to any of the metric-aware quantum algorithms [37,39,[42][43][44].
[ Fig. 1 (a/c) blue] shows how the optimal distribution of samples reduces the measurement overhead across the entire evolution -most significantly for small regularisation parameters [ Fig. 1 (a/c), η = 10 −5 ], in which case some matrix elements might be crucially larger than others. Moreover, result 3 automatically takes into account the decreasing sampling cost of the matrix as established in Results 1-2. This is illustrated in Fig 2 (b); For the first few iterations, far from convergence, the bulk of the measurements are directed to the matrix, comparatively few go to the elements of the gradient [Fig 2 (b), t = 1]. However, close to convergence, consistent with Result 1, the gradient takes the majority of the measurements, [Fig  2 (b), t = 20].

VI. DISCUSSION AND CONCLUSION
In this work we established general upper bounds on the sampling cost of metric-aware variational quantum algorithms (e.g., natural gradient). We analysed how this sampling cost scales for increasing iterations in Result 1 and for increasing qubit numbers in Result 2. The latter establishes that the relative measurement cost of the matrix object F Q is asymptomatically negligible in many practically relevant scenarios, such as in case of quantum chemistry applications.
Natural gradient has been shown to outperform other optimisation approaches in numerical simulations [37,39,41]. We proved in this work that for both an increasing number of iterations and number of qubits the sampling overhead per-iteration (per-epoch) of the natural gradient approach is constant asymptotically when compared to simple gradient descent. The most important implication of our results is therefore that the overall cost of natural gradient is lower since it converges to the optimum faster.
We finally established a general technique that optimally distributes measurements when estimating matrix and vector entries, further reducing the cost of general metric-aware quantum algorithms. Let us finally remark on the generality of our results: our techniques are immediately applicable to other problems beyond metric-aware approaches, for example, to Hessian-based optimisations via Eq. (2) as detailed in the Appendix.
where P l ∈ Herm[C d×d ] are tensor products of single-qubit Pauli operators that act on an N -qubit system and form an orthonormal basis of the Hilbert-Schmidt operator space, and d = 2 N is the dimensionality. We denote as r h ∈ N the Pauli rank, i.e., the number of non-zero Pauli components in the Hamiltonian. Note that in general r h ≤ 4 N . In the following derivations we assume for simplicity that ansatz circuits U c are unitary and decomposes into a product of individual gates that typically act on a small subset of the system, e.g., one and two-qubit gates. We assume in Eq. (A2) for ease of notation that each quantum gate depends on an individual parameter θ i with i = {1, 2, . . . ν}. Individual gates U k (θ k ) ∈ SU (d) of the quantum circuit from Eq. (A2) are in general of the form U k (θ k ) := exp[−iθ k G k ] and their generators G k ∈ Herm[C d×d ] decompose into a sum of Pauli strings resulting in ∈ N is the Pauli rank of the generator G k . We additionally assume that g kl ≤ 1/2 for simplicity -but any other upper bound could be specified. It follows in general that the derivative ∂ k U k (θ k ) decomposes into a sum of r (k) g unitary operators as (A3) For ease of notation, in the following we consider circuits via Eq. (A2) which decompose into gates U k (θ k ) with Pauli rank r g = 1. This is naturally the case for a wide variety of ansatz circuits, e.g., circuits that consist of single-qubit rotations and two-qubit ZZ or XX evolution gates as depicted in Fig. 3. This assumption results in a simplified structure of the gates as U k (θ k ) := exp[−iθ k P k /2] and their derivatives as where P k is the Pauli generator of the gate U k (θ k ). This construction simplifies our following derivations, however, the generalisation to arbitrary parametrised gates straightforwardly follows from linearity of Eq. (A3). We finally define the partial derivative of the circuit in Eq. (A2) using our simplified ansatz as which itself is unitary via [D k ] † = [D k ] −1 (and we omit its explicit dependence on the parameters θ) and P l P † l = Id d . We remark that in case of non-unitary parametrisations one would need to consider the general mapping ρ(θ) := Φ(θ) ρ 0 . The circuit derivative then decomposes into Pauli terms as p kmn P m ρ(θ)P n . (A5)

Upper bound on the quantum Fisher information
We now derive a general upper bound on the quantum Fisher information for unitary parametrisations.
Lemma 1. In case of unitary ansatz circuits that act on arbitrary quantum states ρ via quantum gates that decompose into at most r g Pauli terms, entries of the quantum Fisher information matrix are upper bounded as [F Q ] kl ≤ r 2 g . Proof. When the ansatz circuit consists of unitary gates, the quantum Fisher information assumes its maximum for pure states. Considering the pure state ρ = |ψ ψ|, it follows from [37] that Applying the Cauchy-Schwarz inequality yields where F max is a bound on the scalar quantum Fisher information, i.e., diagonal entries of the matrix F Q . Let us determine this bound via for an arbitrary |ψ . It follows from Eq. (A3) that where |ψ m are some valid, normalised states and therefore ψ l |ψ m ≤ 1 and we used that g kl ≤ 1/2. This finally establishes the general upper bound for unitary ansatz circuits whose gates decompose into at most r g Pauli terms as and in case of simplified ansätze with r g = 1 from Sec. A 1 one obtains [F Q ] kl ≤ 1.

Components of the gradient
Components of the gradient vector can be measured via Hadamard test. We discuss this on the example of simplified ansätze from Sec. A 1, while the generalisation follows from linearity. Let us first express the gradient components g k := ∂ k E(θ) in terms of the derivative circuits from Eq. A4 as where the second equation uses the decomposition of the Hamiltonian into Pauli operators from Eq. (A1) via denoting the matrix elements M kl := Im 0| [D k ] † P l U c |0 . These matrix elements can be estimated by using an ancilla qubit via the circuits in Fig. 2 of reference [8] and the corresponding proof can be found in footnote [53] of [8], refer also to [43]. The probability p of measuring this ancilla qubit in the |± basis with outcome +1 determines the matrix elements via (2p kl −1) = M kl for every Pauli component in the Hamiltonian P l . This finally yields the explicit form of the gradient vector in terms of the measurement probabilities 0 ≤ p kl ≤ 1. Note that each probability p kl is estimated by sampling a binomial distribution which has a variance σ 2 kl = p kl (1 − p kl ). It follows that the variance of the gradient components are determined by these individual variances via Re-expressing this variance in terms of the matrix elements via p kl = (M kl +1)/2 yields the simplified form This expression is related directly to the parametrised quantum state |ψ(θ) via the expectation value as M kl = −2Re ∂ k ψ(θ)|P l |ψ(θ) . In complete generality, i.e., when gates decompose into a linear combination of at most r g Pauli terms, the variance of the gradient entries is upper bounded (via Eq. (A9)) as via Eq. (A1) and recall that Tr[P k P l ] = d δ kl , where δ kl is the Kroenecker delta and d = 2 N . So far we have assumed that each term in the Hamiltonian is estimated separately from outcomes of independent ancilla measurements and the above variance therefore corresponds to overall r h measurements. Indeed, advanced techniques could be used for simultaneously measuring commuting terms in the Hamiltonian (possibly without an ancilla qubit) reducing the overall number of shots [34,36,[50][51][52][53] and we would like to take this into account in our final result. We conclude by stating the upper bound on the variance Var[g k ] of a single measurement to estimate the gradient entry g k as Here we have introduced the constant factor f g . We can generally state the bounds 1 ≤ f g ≤ r g r h as f g depends on the system (type of gates via r g ) and on the measurement technique used for estimating terms in the Hamiltonian (number of commuting groups). Here the lower bound (best case scenario f g = 1) is saturated for Pauli gates (r g = 1) and Hamiltonians from Eq. (A1) in which all terms commute and are measured simultaneously. The upper bound (worst case scenario) is saturated by Hamiltonians from Eq. (A1) in which all r h terms are estimated from separate measurements (all terms are non-commuting) and all terms have comparable strengths (optimally distributing samples does not reduce Var[g k ]) . The factor f g interpolates between these two extremal cases and will correspond to a value in the bounded rage 1 ≤ f g ≤ r g r h . In most of this work we assume a fixed H and therefore we can treat f g and Spc[H] as constants. The only exception is our derivation in Result 2 where we make the mild, general assumption that the number r h of terms in the Hamiltonian grows polynomially and therefore necessarily the product Spc where [g k ] n is the gradient that would be measured by the above protocol for the pure eigenstate |ψ n (θ) . The above discussed protocol therefore estimates the correct gradient for mixed states -as long as the parametrisation is approximately unitary, such as in case of noisy gates. The same upper bound holds for the variances via n p n = 1 and 0 ≤ p n ≤ 1, and the bound is only saturated by pure states. In summary, the variance of the gradient entries is upper bounded as Var[g k ] ≤ Spc[H]f g , where f g is a constant factor that only depends on the ansatz structure, on the particular quantum algorithm that is used to estimate the entries and on the Hamiltonian. We remark that the above discussed protocol is used in other metric-aware quantum algorithms, and our bounds therefore apply to other vector objects used in these algorithms [37,39,[42][43][44].

Components of the quantum Fisher information matrix
We will now focus on determining variances of the quantum Fisher information entries [F Q ] kl . For pure states as ρ = |ψ ψ|, entries of the quantum Fisher information can be expressed via the state-vector scalar products [37] [ The second term in the above equation vanishes when the global phase evolution of |ψ is zero [43] and an experimental protocol for measuring the remaining component Re ∂ k ψ|∂ l ψ was used in [39] for simulating imaginary time evolution. We now propose a protocol that determines both terms in Eq. A14. Assuming the simplified ansatz from Sec. A 1, our protocol allows to evaluate the coefficients by measuring an ancilla qubit using the circuits in Fig. 2 of reference [8], refer to footnote [53] of [8] for a proof. These circuits allow for estimating the probabilities p a , p b and p c by sampling the ancilla qubit as a binomial distribution. The quantum Fisher information is then obtained as Since the probabilities p a , p b and p c are determined from binomial distributions, their variances are given by, e.g., , we can express the variances as in terms of the estimated quantities A kl , B k and C k , and we used the expressions, e.g., (A kl +1)/2 = [p a ] kl . Note that the inequality (1 − [B k ] 2 )B 2 l ≤ 1/4 is saturated when B k = 1/ √ 2 and in general |A kl |, |B l |, |C l | ≤ 1. Using this inequality we can establish the general upper bound when gates decompose into a linear combination of at most r g Pauli terms.
When assuming noisy unitary circuits, Result 3 in [37] establishes that [F Q ] kl ≈ 2 Tr[(∂ k ρ)(∂ l ρ)] and the approximation becomes exact for pure states as ρ = |ψ ψ|. The Hilbert-Schmidt scalar products Tr[(∂ k ρ)(∂ l ρ)] can be measured using the circuit based on SWAP tests from [43] and one can directly estimate the quantity [F Q ] kl = (2p kl − 1) by measuring the probability p kl of an ancilla qubit in case when using the simplified ansatz from Sec. A 1, i.e., when gates decompose into single Pauli terms. We remark that this implementation requires more qubits when compared to the above introduced pure-state approach. However, it is preferable as it results in negligible approximation errors when gates are imperfect, refer to [37]. The variance follows as Var{ kl ) ≤ 1 in case of the simplified ansatz from Sec. A 1 and we have used [F Q ] kl ≤ 1 from Lemma 1.
In complete generality, i.e., when gates decompose into a linear combination of at most r g Pauli terms, the variance of the matrix entries is upper bounded as In summary, the variance of the matrix entries are upper bounded as Var{[F Q ] kl } ≤ f F , where f F is a constant factor that only depends on the ansatz structure and the approach used to estimate the matrix entries. We remark that the above discussed two protocols are used in other metric-aware quantum algorithms and our bounds therefore apply to other matrix objects estimated by these algorithms [37,39,[42][43][44].
where g ∞ is the absolute largest element in the gradient vector. We assume that every matrix and vector element is assigned measurements uniformly as N F /ν 2 and N g /ν where N F and N g are the overall number of measurements required to estimate the matrix and vector objects such that the vector v is obtained to a precision . Using the upper bounds on the variances of individual gradient vector entries from Eq. (A11) and individual matrix entries from Eq. (A16) and Eq. (A15) we derive the explicit bound where H is the Hilbert-Schmidt or Frobenius norm of the Hamiltonian and f F , f g are constant factors that depend on the ansatz structure and the and the approach used to estimate the gradient/Fisher matrix, refer to Sec. A 1. For example for the simplified ansatz (r g = 1) in Sec. A 1 we obtain f F ≤ 2 and 1 ≤ f g ≤ r h . Here the lower bound f g = 1 is saturated when when all terms in the Hamiltonian from Eq. (A1) commute and can be measured simultaneously while the upper bound f g = r h is saturated when all r h terms in the Hamiltonian need to be estimated independently (because they do not commute) and their strengths are comparable, refer to Sec. A 3. We use the above derived upper bounds and obtain where F −1 Q is the Hilbert-Schmidt or Frobenius norm of the inverse matrixF −1 Q . We now require that 2 /2 =: 2 F and 2 /2 =: 2 g as a possible choice to satisfy 2 = 2 F + 2 g . This results in the explicit bound on the number of measurements after substituting V F and V G as We introduce the notation Spc[F −1 Q ] := F −1 Q 2 /ν = 1 ν ν k=1 σ 2 k (F −1 Q ) to denote the average of the squared singular values ofF −1 Q . Note that, for example, the identity operator yields Spc[Id] = 1 and we derive upper and lower bounds on in general in Lemma 3.
We can now establish the bound And we can therefore bound the growth rate of the quantity Spc[F −1 Q ] as Spc[F −1 Q ] = O(ν s ) with −2 ≤ s ≤ 0.
Appendix C: Optimal Measurements Lemma 4. Measurements are distributed optimally when the number of samples for determining individual elements of the matrix and gradient are given by respectively. Here Var[·] is the variance of a single measurement of the corresponding element and we explicitly define Σ via the coefficients a kl and b k from Appendix B as we find that the optimal fraction of measurement to be assigned to each element is where Σ := By substituting this results in the error measure we can remove the dependence on the total number of measurements N opt , to yield the required result.
so that the error measure error measure can be written as Using the error measure written in this form as a starting point for the derivation in the proof of Lemma 4 we trivially obtain the same results with the elements a kl replaced with a kl .