Nearly Optimal Quantum Algorithm for Estimating Multiple Expectation Values

Many quantum algorithms involve the evaluation of expectation values. Optimal strategies for estimating a single expectation value are known, requiring a number of state preparations that scales with the target error ε as O (1 /ε ). In this paper, we address the task of estimating the expectation values of M diﬀerent observables, each to within additive error ε , with the same 1 /ε dependence. We describe an approach that leverages Gily´en et al. ’s quantum gradient estimation algorithm to achieve O ( √ M/ε ) scaling up to logarithmic factors, regardless of the commutation properties of the M observables. We prove that this scaling is worst-case optimal in the high-precision regime if the state preparation is treated as a black box, even when the operators are mutually commuting. We highlight the ﬂexibility of our approach by presenting several generalizations, including a strategy for accelerating the estimation of a collection of dynamic correlation functions.


Introduction
A fundamental task of quantum simulation is to perform an experiment in silico.Like traditional experimentalists, researchers using quantum computers will often be interested in efficiently measuring a collection of properties.For example, the electronic ground state problem is frequently cited as a motivation for quantum simulation of chemistry, but determining the ground state energy is only a starting point in most chemical applications.Depending on context, it may be essential to measure the dipole moment and polarizability, the electron density, the forces experienced by the classical nuclei, or various other quantities [1,2].Similarly, in condensed matter physics and beyond, correlation functions play a central role in the theory of quantum many-body phenomena due to their interpretability and measurability in the lab [3,4].
In this letter, we consider the problem of accurately and efficiently estimating multiple properties from a quantum computation.We focus on evaluating the expectation values of a collection of M Hermitian operators {O j } with respect to a pure state |ψ .We aim to evaluate each expectation value to within additive error ε using as few calls as possible to a state preparation oracle for |ψ (or its inverse).One simple approach is to repeatedly prepare |ψ and projectively measure mutually commuting subsets of {O j }.Alternatively, strategies based on amplitude estimation achieve a quadratic speedup with respect to ε but entail measuring each observable separately [5][6][7].A range of newer "shadow tomography" techniques use joint measurements of multiple copies of |ψ to achieve polylogarithmic scaling with respect to M at the expense of an unfavorable 1/ε 4 scaling [8][9][10][11].In certain situations, randomized methods based on the idea of "classical shadows" of the state obtain 1/ε 2 scaling while improving upon sampling protocols with determin-istic measurement settings [12,13].We review these existing approaches in Appendix A and compare them to our new strategy in Table I and Appendix B.
Our main contribution is an algorithm that achieves the same 1/ε scaling as methods based on amplitude estimation, but also improves the scaling with respect to M from O(M ) to O( √ M ), where the tilde in O(•) hides logarithmic factors.Our approach is to construct a function f whose gradient yields the expectation values of interest and encode f in a parameterized quantum circuit.We can then apply Gilyén et al.'s quantum algorithm for gradient estimation [14] to obtain the desired scaling.The following theorem formalizes our result.
Theorem 1.Let {O j } be a set of M Hermitian operators on N qubits, with spectral norms O j ≤ 1 for all j.There exists a quantum algorithm that, for any Nqubit quantum state |ψ prepared by a unitary U ψ , outputs estimates o j such that | o j − ψ| O j |ψ | ≤ ε for all j with probability at least 2/3, using O( √ M /ε) queries to U ψ and U † ψ , along with O( √ M /ε) gates of the form controlled-e −ixOj for each j, for various values of x with |x| ∈ O(1/ √ M ).
As we show in Corollary 3, this query complexity is worst-case optimal (up to logarithmic factors) in the high-precision regime where ε ∈ (0, 1  Comm. Non-comm.k-RDM Table I.A comparison of the (worst-case) complexities, in terms of state preparation oracle queries, of different approaches for measuring multiple observables.We consider three applications: estimating the expectation values of M commuting or non-commuting observables, and determining the fermionic k-RDM of an N -mode system.Here, ε denotes the additive error to which each quantity is estimated.
We compare strategies based on naive sampling, amplitude estimation, and shadow tomography to our gradient-based approach.We cite the specific works used to determine these complexities, including the Pauli-specific shadow protocol of Ref. 11.Note that methods based on sampling and shadow tomography also work under a weaker input model where only copies of the state are provided.

Lower Bounds
In Ref. 15, Apeldoorn proved a lower bound for a task that is essentially a special case of our quantum expectation value problem.We explain how a lower bound for our problem can be obtained as a corollary.Their results are expressed in terms of a particular quantum access model for classical probability distributions: Definition 1 (Sample oracle for a probability distribution).Let p be a probability distribution over M outcomes, i.e., p ∈ [0, 1] M with p 1 = 1.A sample oracle U p for p is a unitary operator that acts as where the |φ j are arbitrary normalized quantum states.
We rephrase Lemma 13 of Ref. 15 below.Here and throughout this paper, we count queries to a unitary oracle U and to its inverse U † as equivalent in cost.
Theorem 2 (Lemma 13, Ref. 15 (rephrased)).Let M be a positive integer power of 2 and let ε ∈ (0, There exists a known matrix A ∈ {−1, +1} M ×M such that the following is true.Suppose A is an algorithm that, for every probability distribution p, accessed via a sample oracle U p , outputs (with probability at least 2/3) a q such that Ap− q ∞ ≤ ε.Then A must use Ω( √ M /ε) queries to U p in the worst case.
We can use this theorem to derive the following corollary, establishing the near-optimality of the algorithm in Theorem 1 in certain regimes.
Corollary 3. Let M be a positive integer power of 2 and let ε ∈ (0, 1 3 √ M ).Let A be any algorithm that takes as an input an arbitrary set of M observables {O j }.Suppose that, for every quantum state |ψ , accessed via a state preparation oracle U ψ , A outputs estimates of each ψ|O j |ψ to within additive error ε (with probability at least 2/3).Then, there exists a set of observables {O j } such that A applied to {O j } must use Ω( √ M /ε) queries to U ψ .
Proof.Assume for the sake of contradiction that for any {O j } and U ψ , the algorithm A uses o( √ M /ε) queries to U ψ to estimate every ψ| O j |ψ to within error ε (with success probability at least 2/3).For any sample oracle U p of the form in Eq. ( 1), consider the state A quick computation verifies that the i-th entry of the vector Ap is equal to ψ(U p )|Z i |ψ(U p ) , where Z i denotes the Pauli Z operator acting on the i-th qubit.Since the matrix A is known, it is clear that |ψ(U p ) = U A (I ⊗ U p ) |0 for a known unitary U A : Therefore, we can apply algorithm A with O j = Z j for j ∈ {1, • • • , M } and U ψ = U A (I ⊗ U p ).By our assumption, this constitutes an algorithm that for every U p , estimates each entry of Ap to within error ε using o( √ M /ε) queries to U p , contradicting Theorem 2, and completing the proof.

Background on Gilyén et al.'s gradient algorithm
Our framework for simultaneously estimating multiple expectation values uses the improved quantum algorithm for gradient estimation of Gilyén, Arunachalam, and Wiebe (henceforth, Gilyén et al.) [14].Gilyén et al. built on earlier work by Jordan [16], which demonstrated an exponential quantum speedup for computing the gradient in a particular black-box access model.Specifically, Jordan's algorithm uses one query to a binary oracle (see Appendix C) for a function f , along with phase kickback and the quantum Fourier transform, to obtain an approximation of the gradient ∇f .
While we defer a technical discussion of Gilyén et al.'s algorithm to Appendix C (and refer the reader also to Ref. 14), we give a brief, colloquial description of their algorithm here.It is helpful to review their definition for a probability oracle, Definition 2 (Probability oracle).Consider a function f : R M → [0, 1].A probability oracle U f for f is a unitary operator that acts as where |x denotes a discretization of the variable x encoded into a register of qubits, |0 denotes the all-zeros state of a register of ancilla qubits, and |φ 0 (x) and |φ 1 (x) are arbitrary quantum states.
Gilyén et al. show how such a probability oracle can be used to encode a finite-difference approximation to a directional derivative of f in the phase of an ancilla register, e.g., a first-order approximation is implemented by As in Jordan's original algorithm, a quantum Fourier transform can then be used to extract an approximate gradient from the phases accumulated on an appropriate superposition of basis states.By using higher-order finite-difference formulas, Gilyén et al. are able to estimate the gradient with a scaling that is optimal (up to logarithmic factors) for a particular family of smooth functions.We restate the formal properties of their algorithm in the theorem below.
Theorem 4 (Theorem 25, Ref. 14 (rephrased)).Let ε, c ∈ R + be fixed constants, with ε ≤ c.Let M ∈ Z + and x ∈ R M .Suppose that f : R M → R is an analytic function such that for every k ∈ Z + , the following bound holds for all k-th order partial derivatives of f at x (denoted by Then, there is a quantum algorithm that outputs an estimate g ∈ R M such that ∇f (x) − g ∞ ≤ ε, with probability at least 1 − δ.This algorithm makes O(c √ M log(M/δ)/ε) queries to a probability oracle for f .

Expectation values via the gradient algorithm
To construct our algorithm and prove Theorem 1, we build a probability oracle for a function whose gradient encodes the expectation values of interest and apply the quantum algorithm for the gradient.
Proof of Theorem 1.We begin by defining the parameterized unitary for x ∈ R M .The derivative of this unitary with respect to x is We are interested in the expectation of the O j with respect to the state |ψ , so we define the following function f : Using Eq. ( 7), we have Therefore, the gradient ∇f (0) is precisely the collection of expectation values of interest.Now, we verify that f satisfies the conditions of Theorem 4. Observe that f is analytic and that the k-th order partial derivative of f with respect to any collection of indices α ∈ {1, . . ., M } k takes the form for some operator V (x, α) which depends on both α and x.Note that V is a product of terms which are either unitary, or from {O j }.Since O j ≤ 1 for all j, we have V ≤ 1, and therefore |∂ α f (0)| ≤ 2 k−1 for all k and α.
By setting c = 2, we satisfy the derivative conditions of Theorem 4.
To construct a probability oracle for f (see Definition 2), we need a quantum circuit that encodes f (x) into the amplitudes of an ancilla.We construct such a circuit using the Hadamard test for the imaginary component of ψ| U (x) |ψ [17,18].Let where H denotes the Hadamard gate, c-U (x) the U (x) gate controlled on the first qubit, and S := |0 0| + i |1 1| the phase gate.Applied to |0 ⊗ |0 , this circuit encodes f (x) in the amplitudes with respect to the computational basis states of the first qubit: for some normalized states |φ 0 (x) and |φ 1 (x) (see Appendix D for more details).Note that F (x) uses a single call to the oracle U ψ .All that remains is to add quantum controls to the rotations in F (x), so that F (x) is controlled on a register encoding x.Specifically, we consider the unitary < l a t e x i t s h a 1 _ b a s e 6 4 = " q p q G where G M n is a set of 2 nM points distributed in an Mdimensional unit hypercube, with n = O(log(1/ε)), and x max is a rescaling factor.The values of x max and n are chosen to satisfy the requirements of the gradient algorithm (see Appendix C).Here, |k = |k 1 . . .|k M for k ∈ G M n denotes the basis state storing the binary representation of k in M n-qubit index registers.The controlled time evolution operator for each O j can be implemented efficiently as a product of n controlled-e −ixOj gates with exponentially spaced values of x, each controlled on the appropriate qubit of the jth index register.We illustrate an example of such a U f in Figure 1.
U f is a probability oracle for the function f , and each call to U f involves a single call to the state preparation oracle U ψ .Theorem 4 then implies that with probability at least 2/3, every component of the gradient of f , and hence all of the expectation values ψ| O j |ψ , can be estimated to within an error ε using O( √ M /ε) queries to U f .The complexity in terms of the controlled time evolutions follows from multiplying the number of controlled time evolutions required for each query to U f , i.e., O(log(M/ε)) per observable, by the total number of queries, i.e., O( √ M /ε).As discussed in Appendix C, we have x max ∈ O(1/ √ M ) as a consequence of the details of the proof of Theorem 4 in Ref. 14.This completes the proof of Theorem 1.
Furthermore (see Appendix C), the space complexity of the gradient algorithm is the same as that of the probability oracle up to an additive logarithmic factor [19].Therefore, our algorithm uses O(M log(1/ε) + N ) qubits.

Discussion
In this letter, we considered the problem of simultaneously estimating the expectation values of multiple observables with respect to a pure state |ψ .We presented an algorithm that uses O( √ M ε −1 ) applications of U ψ and its inverse, where M denotes the number of observables and ε the target error, and U ψ is a unitary that prepares |ψ .We explained how a lower bound on a closely related problem posed in Ref. 15 implies that, for algorithms given black-box access to U ψ , this query complexity is worst-case optimal up to logarithmic factors when ε ∈ (0, 1 3 √ M ).In fact, our algorithm affirmatively resolves an open question from Ref. 15 regarding the achievability of this bound for the simultaneous estimation of classical random variables [20].These results imply that the optimal cost for expectation value estimation can become exponentially worse with respect to M when one demands a scaling that goes as ε −1 instead of ε −2 .Furthermore, the instances used in establishing our lower bounds involve a set of mutually commuting observables, implying that commutativity isn't necessarily helpful when one demands ε −1 scaling.
We presented a comparison with other approaches for the estimation of expectation values in Table I, which we elaborate on in Appendix A and Appendix B. For example, we find that our algorithm is capable of estimating each element of the k-body fermionic reduced density matrix (k-RDM) of an N -mode system to within error ε using O(N k /ε) state preparation queries.This offers an unconditional asymptotic speedup compared to existing methods when ε = o(N −k/3 ).This may be particularly useful in practical applications where we wish to achieve a fixed error in extensive quantities by measuring the 1 or 2-RDM and summing Ω(N ) elements.
Our gradient-based approach to estimating expectation values can be extended to other properties.For example, consider the task of evaluating a collection of two-point dynamic correlation functions.These functions take the form where A and B are some simple operators and U (t, t ) is the time evolution operator that maps the system from time t to time t.These correlation functions are often directly accessible in experiment, as in the case of angle-resolved photoemission spectroscopy [3], and are also central to hybrid quantum-classical methods based on dynamical mean-field theory [21][22][23].In Appendix E, we explain how a generalization of our approach can reduce the number of state preparations required for estimating a collection of these correlation functions.
Although we focused on quantifying the number of state preparation oracle queries, we also considered two other complexity measures.Our approach requires time evolution by each of the M observables.The total duration of time evolution required scales as O(M/ε).We also need an additional O(M log(1/ε)) qubits, although we can modify our approach to trade off between space and query complexities (see Appendix F).When we are interested in simultaneously estimating O(N ) expectation values, the asymptotic scaling of the space complexity is only logarithmically larger than that of storing the system itself.This is the case in a variety of contexts, for example, in the evaluation of the momentum distribution [24].In other situations, the space overhead may be more substantial, though the capability of modern simulation algorithms to use so-called "dirty ancilla" (temporarily borrowing qubits in an arbitrary state) may offset this challenge in some contexts [25][26][27].As a concrete example, we consider the double-factorized simulation of the electronic structure Hamiltionian proposed in Ref. 26.Von Burg et al. find that the time complexity of their simulation algorithm can be minimized by using O(N 3/2 ) qubits for data-lookup.These same qubits could be used by our algorithm for expectation value estimation to parallelize the measurement of O(N 3/2 ) observables, offering a O(N 3/4 ) asymptotic speedup without any additional qubit overhead.
Another potential reason for modifying our approach arises when the observables of interest have different norms, or when the desired precision varies.In Appendix G, we consider addressing this situation by measuring certain observables using our strategy and measuring others using a sampling-based method.In Appendix H, we take a different approach, and generalize Gilyén et al.'s gradient estimation algorithm to accommodate functions whose gradient components are not necessarily uniformly bounded.This allows us to simultaneously estimate the expectation values of observables {O j } with arbitrary norms O j (possibly greater than 1) using O( j O j 2 /ε) queries.By rescaling the individual observables we can then also vary how precisely we estimate each expectation value, thereby extending Theorem 1 to the most general setting.
Our focus has been on the asymptotic scaling of our approach, but it will also be desirable to understand the actual costs.Performing a fault-tolerant resource estimate and a comparison against other measurement strategies in the context of a practical application would be a useful line of future work.It is possible that our approach could be modified to obtain a further speedup by taking advantage of the structure of the states and/or observables for particular problems of interest.Another potentially fruitful direction would be to explore extensions of the gradient algorithm to yield quantum algorithms for the Hessian or even higher-order derivatives.
Extracting useful information from a quantum computation, especially a quantum simulation, is a bottleneck for many applications.This is especially true in fields such as quantum chemistry and materials science, where it may be necessary to couple high-level quantum calculations with coarser approximations at other length scales in order to describe macroscopic physical phenomena.We expect that our gradient-based approach to the estimation of expectation values will be a useful tool and a starting point for related approaches to other problems.

Appendix A: Prior work on expectation value estimation
This letter focuses on the task of estimating the expectation value of multiple observables with respect to a pure state |ψ .Motivated by settings where the cost of state preparation is the dominant factor, we mainly quantify the resources required in an oracle model where we count the number of calls to the state preparation unitary and its inverse.To provide concrete motivation for this cost model and for the task in general, consider the example where our state of interest is the unknown ground state of some second-quantized electronic structure Hamiltonian under the Jordan-Wigner transformation.In this case, the state preparation step is expected to be tractable under certain assumptions but relatively expensive, even using modern methods (e.g., by applying the ground state preparation algorithms of Ref. 28, 29 in conjunction with state-of-the-art techniques for block-encoding the Hamiltonian [25,26]).At the same time, the observables of interest may be especially simple (e.g., the elements of a fermionic reduced density matrix).We discuss the situation where the cost of state preparation does not necessarily dominate, and the possible trade-offs available in the context of our approach, in Appendix F Let U ψ denote the unitary which prepares |ψ from the |0 state, and let {O j } be a collection of M Hermitian operators.For the sake of simplifying the comparison with existing approaches, we make the additional assumption in this section that the O j are also unitary, though this requirement could be relaxed by using techniques based on block-encodings [7].As in the main text, our goal is to minimize the number of calls to U ψ and U † ψ required to obtain estimates o j of the M expectation values ψ| O j |ψ such that for all j ∈ {1, . . ., M } with probability at least 2/3.We note that some of the methods we compare against are usable under a weaker input model, where the algorithm is merely provided copies of |ψ rather than access to a state preparation oracle.A straightforward approach is to repeatedly prepare the state |ψ and simultaneously measure as many of the operators as possible on each copy.To this end, consider dividing the M operators into G groups of mutually commuting terms.Then calls to U ψ suffice to estimate the M expectation values.This can be accomplished by simultaneously diagonalizing the operators within each group and measuring in the common eigenbasis when this is tractable, or by performing phase estimation on the operators within each group using the same copy of |ψ .The outcomes are then averaged over repeated iterations.One key advantage of this approach is that, although it has poor scaling with the target error ε, it does not scale polynomially with M , and instead scales with G, which could be considerably smaller than M .In practice, optimally grouping and sampling the observables may be challenging and can introduce substantial overheads not captured by the query complexity (e.g., the classical cost of finding optimal groupings [30,31] and of simultaneous diagonalization, the quantum gate complexity of implementing basis change unitaries corresponding to the common eigenbasis, or of phase estimation).Alternatively, using a strategy based on amplitude estimation [5], we can estimate expectation values with a scaling proportional to ε −1 [6,7].Amplitude estimation, as originally implemented in Ref. 6, allows for the estimation of the expectation value ψ|U |ψ of a unitary operator U .estimation algorithm of Ref. 6.The amplitude estimation algorithm works by performing phase estimation on the Szegedy walk operator, One can verify that the operator S is diagonal in the subspace spanned by |ψ and U |ψ , with eigenvalues e ±ix for x = cos −1 (2| ψ|U |ψ | 2 − 1).Phase estimation on S therefore allows for the determination of | U |ψ|U | 2 up to an accuracy ε with a cost that scales as 1/ε.Information about the phase may be regained by repeating this for different variants of U controlled by an ancilla qubit.One can generalize this approach to the estimation of the expectation value of an arbitrary observable O by providing a U that block encodes O [7].Unfortunately, this algorithm does not generalize in a straightforward way to the task of estimating multiple expectation values, even when the operators commute (beyond the strategy of treating each one separately).As a consequence, estimating all M expectation values using amplitude estimation requires calls to U ψ and U † ψ .While this leads to an asymptotic advantage over naive sampling in some settings, it fails to do so in cases where G/ε = o(M ).
These two well-known strategies are complemented by the more recent body of work that began with Aaronson's definition of the "shadow tomography" problem, the problem of estimating the expectation value of many two-outcome measurements given multiple copies of some input state [8].Ref. 8 proposes a computationally expensive protocol for this task that achieves scaling logarithmic in the number of two-outcome measurements M and proportional to ε −4 , up to logarithmic factors.Ref. 12 put forward an alternative protocol based on randomized measurement that also scales logarithmically in M , and improves the scaling with ε from O(ε −4 ) to O(ε −2 ) at the expense of limiting the types of observables that can be treated efficiently (i.e., without introducing a scaling polynomial in the Hilbert space dimension).A series of additional works have offered improvements and variations on both the randomized measurement approach [13,32,33], and the more general approaches based on gentle measurements [9][10][11].
In this work, we are primarily concerned with the high-precision regime, and aim to achieve O(ε −1 ) scaling.It is natural to ask whether it is possible to simultaneously achieve an asymptotic cost that is logarithmic in the number of observables M and linear (up to logarithmic factors) in ε −1 using our input model, where the input state is unknown and accessed via a black-box state preparation unitary.In Corollary III in the main text, we point out that recent results preclude this, showing that the desired ε −1 scaling in the precision necessarily comes at the cost of a square root dependence on M for certain collections of observables.On the other hand, Ref. 15 provides an example of a collection of operators-namely, projectors onto orthogonal states-where this scaling is achievable.

Appendix B: Applications
In this appendix, we apply our method to three illustrative cases and consider the potential for asymptotic speedups over alternative approaches.As in the main text, we focus on quantifying the cost with respect to the number of state preparation oracle queries.We note that some of the approaches we compare against (namely, methods based on sampling and shadow tomography) are usable under a weaker access model, where copies of the state are provided rather than queries to the state preparation oracle.
A major application of these ideas is the measurement of the fermionic k-body reduced density matrices (k-RDMs) of a particular pure state.The k-RDM of a pure state |χ with support on N fermionic modes is a tensor specified by the N 2k "matrix elements" of the form χ| a , where the 2k indices p j take values ranging over the N modes.The 1-and 2-RDMs are particularly important, being sufficient to determine the expected energy of a state and many other properties of interest [34].
The O(N 2k ) terms in the k-RDM can be divided into Ω(N k ) groups of O(N k ) mutually commuting operators [35].This allows for each of the terms to be estimated to within a precision ε by a simple sampling procedure with a query complexity of O(N k /ε 2 ) [13].The asymptotic scaling with ε can be quadratically improved (see Appendix A by applying amplitude estimation to learn each component individually with the requisite error, leading to a query complexity of O(N 2k /ε).Some shadow tomography protocols scale better with respect to N and k than either of these approaches, at the expense of scaling with ε −4 [8][9][10][11].In particular, the technique described in Ref. 11 can estimate the k-RDM at a cost that scales as O(k log(N )/ε 4 ) by estimating the expectation values of all degree 2k majorana operators under the Jordan-Wigner transformation to within a precision ε and using these values to reconstruct the k-RDM.Our gradient-based algorithm's scaling of O(N k /ε) for this application follows directly from Theorem 1 in the main text.In terms of the number of state preparation oracle queries, our method thus strictly improves upon sampling and amplitude estimation for this application, and provides an asymptotic advantage over all prior approaches for learning the k-RDM when ε ∈ o(N −k/3 ).This might naturally occur when we are interested in obtaining estimates of some extensive quantities to within a fixed precision by summing estimates of O(N ) local observables.
For the case where we wish to apply our ideas to compute the expectation values of M mutually commuting observables with respect to a state |ψ , the potential benefit still exists, but the trade-offs are less favorable.In this case, commutation allows every expectation value to be measured on a single copy of |ψ .Sampling then yields O(log(M )/ε 2 ) scaling, where the logarithmic dependence on M arises from the application of concentration inequalities together with the union bound to guarantee that all of the M expectation values are estimated to within ε simultaneously (with some constant probability of success).This is in contrast to the O( √ M /ε) scaling of Theorem 1 in the main text.This does not contradict the optimality of our approach in the high-precision regime, due to the requirement that ε < 1/(3 √ M ) in Theorem 2 and Corollary 3 in the main text.Inside this region of applicability, sampling has at best the same scaling as our algorithm.The trade-offs between sampling and quantum gradient approaches, as well as algorithms that blend the two, are discussed in Appendix F Another simple case to consider is one where we wish to measure M observables that all fail to commute.Then it is clear that our approach offers an unconditional speedup when compared with approaches based on sampling and amplitude estimation (in terms of the number of state preparation oracle queries).It is still possible in this case to obtain a better scaling with respect to M by employing shadow tomography [8][9][10][11].We provided a brief discussion of these approaches in Appendix A To summarize, these strategies require a number of state preparation oracle calls (actually, copies of the state) that scale logarithmically (or poly-logarithmically) with M at the cost of scaling with ε −4 .When ε = o(M −1/6 ) our approach has a favorable asymptotic scaling in terms of the state preparation oracle query complexity.
We note that these shadow tomography protocols are limited in other ways that may ultimately lead to a gate complexity advantage for our method for particular applications, even when the query complexity advantage is absent.Some such protocols, such as the one in Ref. 8, have gate complexities that scale exponentially in one or more relevant parameters.Others, such as the one in Ref. 11, are computationally efficient but only apply to limited sets of operators (Pauli observables in the case of Ref. 11).The proposal of Ref. 9 avoids both of these obstacles but returns a representation of the state in terms of a quantum circuit that must itself be prepared and measured to obtain the expectation values of interest.In some cases, it might be fruitful to apply our measurement techniques to the state whose preparation circuit is learned by this latter proposal.

Appendix C: Further background on Gilyén et al.'s quantum algorithm for gradient estimation
In this appendix, we restate a few of the important definitions used in the main theorem which summarizes the performance of the quantum algorithm for the gradient (Theorem 4 in the main text of this work, Theorem 25 in Ref. 14).We also point out some of the details of the gradient algorithm relevant to our consideration of costs beyond the phase oracle complexity that are not included in the main statement of the theorem.We refer the interested reader to Ref. 14 for a rigorous analysis and further information about the implementation.
In Theorem 4 of the main text, we referred to the probability oracle access model for f .We recall this definition here, as well as the definitions for the phase oracle access model from Ref. 14 and the binary oracle access model used in Jordan's original gradient algorithm [16].
Definition 3 (Probability oracle).Consider a function f : R M → [0, 1].A probability oracle U f for f is a unitary operator that acts as where |x denotes a discretization of the variable x encoded into a register of qubits, |0 denotes the all-zeros state of a register of ancilla qubits, and |φ 0 (x) and |φ 1 (x) are arbitrary quantum states.
Definition 4 (Phase oracle).Consider a function f : R M → R. A phase oracle A f for f is a unitary operator that acts as where |x denotes a discretization of the variable x encoded into a register of qubits and |0 denotes the all-zeros state of a register of ancilla qubits.
Definition 5 (Binary oracle).Consider a function f : R M → R. For some precision parameter ε > 0, an ε-accurate binary oracle B f for f (x) is a unitary operator that acts as where |x denotes a discretization of the variable x encoded into a register of qubits, |0 denotes the all-zeros state of a register of ancilla qubits, and f (x) denotes a fixed-point binary number such that |f (x) − f (x)| ≤ ε.
While we referred to the probability oracle access model in our Theorem 4 (in the main text), the original Theorem 25 of Ref. 14 described the gradient algorithm purely in terms of the phase oracle access model.However, they also show how we can efficiently obtain a probability oracle from a phase oracle.
Theorem 5 (Theorem 14, Ref. 14).Let U f be a probability oracle for a function f (x).For any ε ∈ (0, 1  3 ), we can implement an ε-approximate phase oracle A f such that for all input states |ψ , where A f denotes an exact phase oracle for f .This implementation uses O(log(1/ε)) invocations of the probability oracle U f (or its inverse) and O(log log(1/ε)) additional ancilla qubits beyond those required by U f .
Therefore, we can regard calls to a phase oracle for f as equivalent to calls to a probability oracle for f up to logarithmic factors.More technically, the actual implementation of the gradient algorithm requires the use of a modified phase oracle known as a fractional query phase oracle.A fractional query phase oracle is defined in the same way as the phase oracle of Definition 4, except that it has an additional parameter s ∈ [−1, 1] which rescales the argument of the exponential.Fortunately, Ref. 14 explains how a fractional phase oracle can be naturally arrived at from a probability oracle in a theorem closely related to Theorem 5 (essentially by applying a rotation to rescale the amplitude f (x) of |1 in Eq. (C1) to sf (x), before converting to a phase oracle).
The statement of Theorem 1 in the main text gives the cost of the gradient estimation algorithm in terms of the number of calls to a probability oracle.Similarly, for our expectation value estimation algorithm, we mainly focus on quantifying the cost in terms of oracle complexity (in particular, the number of queries to the state preparation oracle).However, we also wish to describe some of the secondary costs that we encounter in our algorithm, so it is useful to note a few additional details about the gradient estimation algorithm.
A secondary cost we might consider is the amount of time evolution required for each observable.From the proof of Theorem 25 in Ref. 14, we know that the phase oracle is queried at uniform superpositions of points within a series of M -dimensional boxes and that the largest such box has a side length of where r is defined implicitly by the equation and m is a positive integer, Here, c is the same fixed constant as that introduced in Theorem 4 in the main text for bounding the partial derivatives of f .As a consequence of these expressions, we have that the largest side length shrinks as M increases.Specifically, . Calls to the phase oracle are generated using calls to the probability oracle and its inverse over the same input parameters (see the proof of Theorem 14 in Ref. 14, stated above as Theorem 5 for convenience) and the box size for the probability oracle directly determines the amount of time evolution by each observable.Therefore, for each probability oracle query, we require at most O(1/ √ M ) units of time evolution by each observable.The other substantial secondary cost to consider is the space complexity.The probability oracle U f for our expectation value algorithm can be implemented using N + 1 + M n qubits, where n = O(log(1/ε)).The additive cost of N + 1 comes from the system register and the ancilla for the Hadamard test.The M n terms is due to the need for M n-bit ancilla registers to prepare a superposition over states indexing the 2 nM points in the hypercube G M n .To be specific, this hypercube is composed of the Cartesian product of M copies of the set G n , defined as The logarithmic scaling of n with 1/ε comes from the precision requirements of the gradient algorithm (see the definition of n in Theorem 21 in Ref. 14 and note that the factors of ∈ {−m, −m + 1, ..., m} that appear in the argument of the oracle in Theorem 25 can be accounted for by compiling a family of 2m + 1 related phase/probability oracles on grids of varying size).The conversion from a probability oracle to an ε -approximate phase oracle requires only O(log log(1/ε )) additional ancilla.The gradient algorithm nominally requires storing O(log(M/δ)) copies of each value and performing a coherent median finding step, but this can be performed classically for our purposes, eliminating the need for an additional multiplicative factor of O(log M ) in the number of qubits.We therefore have that the space complexity for our estimation algorithm is ) is the number of queries we make to the phase oracle.The oracle, gate, and qubit complexities of our algorithm for estimating the expectation values of a general collection of observables (with arbitrary norms) are analyzed in Appendix H see SI Theorem 6 for an explicit statement.units of time evolution by these operators, are all negligible.This is particularly reasonable for dynamic correlation functions, where B and the A j 's are frequently some simple local operators.As with the case of expectation value estimation, we are interested in the cost (up to logarithmic factors) required to estimate each quantity to within some additive error ε with a success probability of at least 2/3.The naive approach is to use amplitude estimation to evaluate each quantity separately, resulting in a resource cost of calls to the state preparation oracle U ψ and it's inverse, along with units of time evolution under the system Hamiltonian.
Our alternative approach achieves an unconditional advantage in the number of state preparation calls and may achieve an advantage with respect to the total duration of time evolution, depending on the choice of the time points t j .For convenience, we define t 0 := 0. We proceed as in the expectation value estimation case, constructing a parameterized unitary for use with the Hadamard test, Taking the derivative of U (x) with respect to x and evaluating the resulting expression at x = 0, we have From this we can see that the M different partial derivatives of U (x) with respect to x i are the M different unitary operators whose matrix elements we would like to estimate.Just as we did in Equation 11 in the main text, we can apply a Hadamard test to U (x).We can then add quantum controls to the rotation angles x to construct a probability oracle for a function whose gradient yields the real parts of the matrix elements of interest.We could likewise use the Hadamard test for the real component of U (x) to obtain the imaginary components of the matrix elements of interest.It is simple to show that the resulting functions satisfy the technical conditions of Theorem 4 from the main text.We can then apply the quantum algorithm for the gradient.We analyze the asymptotic scaling of this approach.Each application of the Hadamard test circuit for U (x) requires a single call to the state preparation oracle U ψ and its inverse, plus 2 M j=1 t j units of time evolution.We are interested in estimating M different quantities to within a precision ε, and so we require O( √ M /ε) calls to our probability oracle.Therefore, we require calls to U ψ and U † ψ , along with units of time evolution under the system Hamiltonian.Regardless of the chosen time points, the scaling in the number of state preparation oracle calls is a factor of √ M smaller than for a scheme based on amplitude estimation.If t M = o( M j=1 t j / √ M ), then using our approach also scales more favorably in terms of the total time evolution.For example, consider the case where the time points are evenly spaced in increments of ∆.Then our approach requires O(M 1.5 ∆/ε) units of time evolution, whereas the approach based on estimating each value independently using amplitude estimation requires O(M 2 ∆ 2 /ε) units.
Substituting the result gives C = O(log(M )/ε 2 α).This suggests that for the case of exponentially shrinking group sizes, the asymptotic scaling is identical to that of the sampling method alone; whereas if ε √ M 1, no positive optimal K exists and the extreme value theorem suggests that the scaling abruptly shifts to the O( √ M /ε) scaling predicted by our gradient method.

Polynomially shrinking group sizes
These trade-offs become more visible in cases where the size of each group of commuting terms shrinks polynomially with k.Specifically, let us assume that for some α > 1, Γ > 0 and any K ≥ 1, From this we have from the fact that 1/k α is a convex function of k that which approaches M as K → ∞ for α > 1.This implies that the total cost obeys The first term above shrinks monotonically with K; whereas the second grows monotonically.This suggests that if a local optima exists, then it is a minima for C.This local optima can be found by differentiating to find the best K, which leads to Substituting this value into our expression for C yields This shows that, depending on the falloff rate of the cummulative sum of the sizes of the groups of commuting terms, intermediate scaling between the Heisenberg-limited scaling of the gradient-based algorithm and the shot noise limited scaling of the grouping can be observed.Specifically, shot noise scaling is optimal as α → ∞ and Heisenberg limited scaling occurs as α → 1. queries Like the algorithm of Theorem 1 in the main text, a central idea is to encode the expectation values in the gradient of the function a probability oracle for which can be implemented as described in the main text.However, the gradient estimation algorithm of Ref. 14 requires a uniform upper bound on the gradient components, in order to guarantee the same additive error ε for each component.This would lead to the suboptimal O(B max √ M /ε) scaling.Therefore, to prove SI Theorem 6, we must first generalize the gradient estimation algorithm of Ref. 14 to allow for non-uniform bounds on the gradient components.We then analyze a different condition on the higher-order derivatives (from that of Theorem 4 in the main text) that is more directly relevant to our function f .

generalized gradient estimation algorithm
We recall some notation from Ref. 14.For any n ∈ N, they define the one-dimensional grid G n := j/2 n − 1/2 + 1/2 n+1 : j ∈ {0, . . ., 2 n − 1} , and use |x for x ∈ G n to implicitly denote the state storing the binary representation of the integer j where x = j/2 n − 1/2 + 1/2 n+1 .For a function f : R M → R, a phase oracle O f for f is any unitary that acts as |x → e if (x) |x , where x = |x 1 . . .|x M for x ∈ G M n .Our main modification to Algorithm 2 of Ref. 14 (i.e., Jordan's gradient estimation algorithm [16]) is to allocate a possibly different number of qubits n i to each |x i register.Then denote this hyper-rectangular grid.We state the modified algorithm for completeness.For now, n i and the parameter S, which determines the number of (fractional) queries to the phase oracle, are all free parameters, to be determined later.

Algorithm 1 gradient estimation algorithm with variable fixed-point precision for each gradient component
for all but a 1/b fraction of the points x ∈ G, then the output k of Algorithm 1 satisfies Proof sketch.This lemma is an extension of Lemma 20 of Ref. 14 (modified to remove the assumption that g ∞ ≤ 1/3), and can be proven by straightforwardly adapting the proof therein, so we highlight the main differences.The "ideal" state after the inverse Fourier transforms in Step 4 of Algorithm 1 is (up to a global phase) so the analysis of phase estimation in [37] shows that for any κ > 1, the output k satisfies We can now derive the analogue of Theorem 21 in Ref. 14. Proof sketch.We set n i = log(12z i /ε) for each i, S = 4/(εr), and h(x) = f (rx + y) in Algorithm 1.Then, and the other conditions of SI Lemma 7 are satisfied as well, so the output k of Algorithm 1 satisfies with probability at least 2/3, for each i.Therefore, by repeating Algorithm 1 O(log(M/δ)) times and taking the median [38] of the outputs gives a g satisfying SI Eq. (H4).The gate complexity is dominated by that of the inverse Fourier transforms.

Condition on the higher-order derivatives
With SI Lemma 8 in hand, it remains to find a function f satisfying SI Eq. (H3) where g the gradient of the function f that we are interested in.To satisfy the analogous condition in their setting, Ref. 14 takes f to be a degree-2m central difference formula.They then make the assumption that for all k ∈ N, the kth derivatives of f satisfy Of course, the |∂ α f (0)| ≤ c k k k/2 assumption is not suitable for our purposes, where the gradient components are allowed to have different magnitudes.Motivated by the function f constructed for the expectation value estimation algorithm (SI Eq. (H1)), we instead make the following assumption: there exists some z ∈ R M such that for all k ∈ N, Proof outline.The key step is to modify Proposition 13 of Ref. 14 to apply to the derivative condition SI Eq. (H5).
In particular, consider drawing y ∈ G uniformly at random.Then, the components y 1 , . . ., y M are i.i.d.symmetric random variables bounded in [−1/2, 1/2], and we have where the second inequality follows from Hoeffding's inequality, using the fact that y i z i ∈ [−z i /2, z i /2].Hence, by Markov's inequality, we have  Proceeding through the rest of the proof in Ref. 14 with the appropriate modifications leads to the stated query complexity.The gate and qubit complexities follow directly from Lemma 8, observing that the assumption on the derivatives implies that ∂ i f (x) ≤ z i for every i ∈ [M ].
We can now prove our general theorem, SI Theorem 6, for estimating the expectation values of arbitrary observables, by showing that the particular function f in SI Eq. (H1) whose gradient we are interested in satisfies the condition of SI Theorem 10.
Proof of SI Theorem 6 (sketch).Let f be the function defined in SI Eq. (H1).As shown in the main text, the components of ∇f (0) are exactly the expectation values ψ| O j |ψ , and a probability oracle for f can be constructed using one query to each of U ψ and U † ψ .Note that for any k ∈ N and α ∈ [M ] k , we have (where the order of the O αi 's in the second expression depends on the values of the α i 's), so we can take z = 2(B 1 , . . ., B M ) in SI Theorem 10 to obtain the gate and qubit complexities, as well the number of queries Q to a phase oracle for f .By Theorem 14 of Ref. 14, an ε -approximate phase oracle for f can be implemented using O(log(1/ε )) queries to a probability oracle for f , O(log(1/ε ) log log(1/ε )) gates, and O(log log(1/ε )) ancillas.Setting ε = x(ε/Q) gives the result.
at most a 1/4 2k fraction of points y ∈ G. Now, comparing SI Eq. (H6) to Equation 52 of Ref. 14, we see that the rest of the proof of Theorem 24 in Ref. 14 goes through with c √ M replaced by z / √ k, leading to the claimed result.The remaining step is to choose r so that f (2m) satisfies the conditions required of f in SI Lemma 8.This gives our analogue of Ref.14 Theorem 25 (see also Theorem 4 in the main text).

Theorem 10 .
Let x ∈ R M .Suppose that f : R M → R is analytic and that there exists a z ∈ R M such that for all k ∈ N, α ∈ [M ] k , we have |∂ α f (x)| ≤ z α .Then, for any 0 < ε < z , we can compute a g such thatPr[|∇f (0) − g ∞ ≤ ε] ≥ 1 − δ using queries to a phase oracle O f for f , O log(M/δ) i∈[M ] log(z i /ε) log log(z i /ε)gates, andO i∈[M ] log(z i /ε) qubits.Proof sketch.This can be proven using the same arguments as in the proof of Theorem 25 in Ref. 14.The main difference is that we have a different bound on |f (2m) (y) − y • ∇f (0)| from SI Lemma 9. Consequently, instead of setting r as in Ref. 14, we choose r so that r −1 = 9 z m/2(64 • 8aπ z m/2/ε) 1/(2m) .
e x i t > Figure1.Schematic depiction of the quantum circuit for U f , the probability oracle for the function f (x) defined in Eq. (13).The top registers encode the (n = 3 bit in this case) binary representations of x1, x2, • • • , xM .The ancilla qubit whose amplitudes encodes f (x) (cf.Eq.(11)) is indicated below the x registers.The final line represents the N -qubit system register.The gates that act on the system register with colored circles represent the doubly-controlled time evolution by the various observables.Estimating the expectation values of the M observables {Oj} requires executing this circuit and its inverse O( √ M /ε) times.
to U ψ and U † ψ , where B := j∈M B 2 j .The algorithm also uses O(Q log(B i /ε)) gates of the form controlled-e −itOj for each j ∈ {1, . . ., M }, for various values of t with |t| ∈ O B−1 log B/ε , as well as O log(M/δ)

)
14e rest of the proof of Lemma 20 in Ref.14follows through with simple modifications.(One minor technical point is that in Ref. 14, the bound in SI Eq. (H2) was mistakenly stated with 1/[2(κ − 1)] on the right-hand side, but this does not take into account fixed-point approximation error.For this reason, in Ref. 14 the parameters a and b are set to 42 and 1000, respectively, but with the corrected bound we find that a and b need to satisfy 1/a 2 +a/b ≤ 1/(24) 2 /4 = 1/2304.)Finally, note from the standard analysis of phase estimation that the measurement outcome can be unambiguously interpreted to determine the (approximate) value of g i provided that the range of Sg i /N i is less than 1.For this, it suffices to have S|g i |/N i < 1/2, i.e., N i > 2S|g i |.