Efficient quantum computation of molecular forces and other energy gradients

While most work on the quantum simulation of chemistry has focused on computing energy surfaces, a similarly important application requiring subtly different algorithms is the computation of energy derivatives. Almost all molecular properties can be expressed an energy derivative, including molecular forces, which are essential for applications such as molecular dynamics simulations. Here, we introduce new quantum algorithms for computing molecular energy derivatives with significantly lower complexity than prior methods. Under cost models appropriate for noisy-intermediate scale quantum devices we demonstrate how low rank factorizations and other tomography schemes can be optimized for energy derivative calculations. We perform numerics revealing that our techniques reduce the number of circuit repetitions required by many orders of magnitude for even modest systems. In the context of fault-tolerant algorithms, we develop new methods of estimating energy derivatives with Heisenberg limited scaling incorporating state-of-the-art techniques for block encoding fermionic operators. Our results suggest that the calculation of forces on a single nucleus may be of similar cost to estimating energies of chemical systems, but that further developments are needed for quantum computers to meaningfully assist with molecular dynamics simulations.

While most work on the quantum simulation of chemistry has focused on computing energy surfaces, a similarly important application requiring subtly different algorithms is the computation of energy derivatives. Almost all molecular properties can be expressed an energy derivative, including molecular forces, which are essential for applications such as molecular dynamics simulations. Here, we introduce new quantum algorithms for computing molecular energy derivatives with significantly lower complexity than prior methods. Under cost models appropriate for noisy-intermediate scale quantum devices we demonstrate how low rank factorizations and other tomography schemes can be optimized for energy derivative calculations. We perform numerics revealing that our techniques reduce the number of circuit repetitions required by many orders of magnitude for even modest systems. In the context of fault-tolerant algorithms, we develop new methods of estimating energy derivatives with Heisenberg limited scaling incorporating state-of-the-art techniques for block encoding fermionic operators. Our results suggest that the calculation of forces on a single nucleus may be of similar cost to estimating energies of chemical systems, but that further developments are needed for quantum computers to meaningfully assist with molecular dynamics simulations. Quantum chemistry is widely regarded as one of the most promising areas of application for quantum computers. This is due to the relative ease of mapping the electronic structure problem onto a quantum device [1][2][3], its difficulty in simulating classically, and its high relevance to industry. Interest in such applications has been steadily increasing following initial beyond-classical quantum computing demonstrations [4,5] and experimental [6][7][8] demonstrations of hardware near the fault-tolerant threshold for quantum error-correction [9]. A significant body of work has emerged in recent years optimizing quantum algorithms for chemistry, both for fault-tolerant quantum computers [10][11][12][13][14][15][16] and current NISQ devices [17][18][19][20], including various experimental implementations [21][22][23][24][25][26][27]. Predominantly this work has focused on estimating energies of ground states of the electronic structure problem, perhaps the most natural property to extract from a quantum chemistry simulation. However, ground state energies are not a quantity typically measured in the lab, and further processing of energy data is required to obtain properties of relevance to industry. Thus, quantum algorithms to estimate properties other than ground state energies are of high interest as we progress towards larger NISQ or future fault-tolerant devices.
The calculation of forces (the derivative of energies with respect to nuclear positions) is a subroutine in most modern computational approaches to navigate molecular potential energy surfaces. Beyond identifying minima and other stationary points, reaction path following is also based on the determination of gradients [28]. The knowledge of low energy stationary points allows the generation of conformational Boltzmann ensembles, calculation of reaction rates, and prediction of tautomer equilibria. Forces are also an essential ingredient of molecular dynamics (MD) simulations, which are invaluable for studying macroscopic thermodynamic properties. This covers highly diverse applications such as the description of heterogeneous processes on surfaces including catalysis [29], observation of phase transitions, such as nucleation processes for water [30], and maybe of highest importance for pharmaceutical research, the interaction of drugs with their targets in the human body [31]. By free energy calculations based on MD simulations, these interactions can be quantified, allowing the prediction of compound affinities [32,33], which are eventually linked to therapeutic doses. Beyond that, MD simulations of drug-target systems enable the observation of conformational changes of the target. The corresponding drug-induced active and inactive states or ligand bias is a ligand-dependent selective signaling pattern [34], which is especially useful to avoid drug-induced side effects [35]. Other decisive parameters such as drug residence time can nowadays be determined through specialized MD simulations [36]. These powerful techniques render MD simulations one of the most broadly used and powerful tools in drug design, and hence make forces a clear target for quantum computing.
Though some research on quantum algorithms for force and gradient estimation has been performed previously, efforts to accurately cost algorithms in a fault-tolerant or NISQ setting have been limited. The suggestion to estimate nuclear forces on a quantum device was first suggested by [37], which studied estimation via the Hellman-Feynman theorem and via the quantum gradient estimation algorithm of [38]. This topic was then relatively untouched by the quantum community for a decade until it was revived by [39][40][41]. Ref. [39] studied force estimation in both a NISQ and FT framework and performed the first experimental force calculation, but only found loose asymptotic bounds of N 7 − N 15 to estimate a single force component. Ref. [40] put the mathematical formulation of force estimation in NISQ on a significantly stronger footing, combined this with gradient estimation for the optimization of variational quantum eigensolvers, but only considered the cost of estimating all N 4 terms in the fermionic 2-reduced density matrix to constant precision. Ref. [41] firmed the theoretical chemistry behind force estimation on a quantum device, presenting a detailed derivation in a Lagrangian formalism focusing on an ab initio exciton model, and stressed the importance of including full response. The paper presented explicit formulas and circuits based on the parameter-shift rule [42,43] but did not provide asymptotic costs for the estimation of forces on a quantum device. These works were followed by small experimental demonstrations of molecular dynamics simulations for various applications [44][45][46], and theoretical studies extending gradient calculations to the derivatives of energies beyond the ground state [47][48][49]. However, many possible optimizations remain for both NISQ and fault-tolerant algorithms to estimate forces. Furthermore, little work has been done to estimate the magnitude of force operator quantities that are relevant for quantum algorithm resource requirements (e.g. induced 1-norms).
In this work we optimize and cost methods for estimating forces and other first-order energy gradients for NISQ and fault-tolerant quantum computers. We study the required tolerance on the error in a force estimation for molecular dynamics and geometry optimization, finding a relevant figure for accurate estimation of the pair correlation function of a moderate-sized water simulation being a root mean square (RMS) error of no more than 6.4 mHa/Å per single derivative component. We optimize tomography methods for NISQ quantum devices, where the relevant cost model is the number of repeated experiments required to achieve a target 2-norm in the error vector. We find that all methods have similar or even slightly better asymptotic costs to estimate an entire force vector to a given accuracy compared to the cost of estimating energies to a similar accuracy. We study methods for block-encoding force operators for future fault-tolerant algorithms, and present the first investigation of the induced 1-norm of a force operator with the system size (a critical property for fault-tolerant quantum algorithms). We find that for state-of-the-art techniques block encodings of derivatives are at most constant or polylog factors more costly than block-encodings of the corresponding Hamiltonian, and that in practice they may be significantly cheaper. We finally detail and cost three separate Heisenberg-limited fault-tolerant algorithms for force estimation: a semi-classical higher-order finite difference algorithm drawing energy estimates at different configurations from the quantum device, an application of the overlap estimation algorithm to gradient estimation, and an extension of the new gradient-based expectation value estimation algorithm of [50]. We determine asymptotic costings for these three algorithms on hydrogen chains and water clusters, and for plane wave systems in first quantization. Surprisingly, due to difficulties to parallelize the overlap estimation algorithm and the need to perform Hamiltonian simulation as part of the reflection subroutine, we find that in some cases the finite difference method will be preferable (or at worst competitive) compared to the gradient-based expectation value estimation algorithm, which strictly asymptotically dominates the overlap estimation algorithm. Our results suggest that while force estimation in NISQ may be somewhat cheaper than energy estimation (ignoring the overhead of needing to optimize the variational preparation of a quantum state), force estimation in FT is at best asymptotically the same cost, and in some cases significantly worse. Ultimately, we do not see a useful beyond-classical molecular dynamics simulation to be tractable in a NISQ or FT quantum computing setting, as such a calculation would require many millions of force estimations to be performed [51], each of which would be at least as costly as estimating the energy of a system. However, for applications such as geometry optimization, coupling parameter estimation or spectral prediction, which do not require such a high number of repeat derivative estimations, our methods appear feasible for early fault-tolerant devices.

A. Outline
We begin this work in Sec. II with a review of ab initio electronic structure theory and how energy derivatives may be estimated as the expectation value of a derivative operator through the Hellman-Feynman theorem. Though we focus on atomic forces (i.e. derivatives with respect to nuclei positions) for the majority of this work, we briefly detail here how these methods may be immediately extended to other first-order properties of a molecular system. In Sec. II A we present a simple, calculable derivation of the force operator in second quantization for an atomiccentered basis orbital set, based on the orbital connection theory of Helgaker and Almlöf [52]. Then, in Sec. II B we derive the exact form of the force operator in a plane wave basis in first-and second-quantization, and demonstrate that this operator is diagonalized by the quantum Fourier transform with aliased frequencies and the fermionic fast Fourier transform respectively. In Sec. II C, we estimate the error tolerance on a force vector required for geometry optimization and molecular dynamics simulations. Based on radial distribution function calculations for a system of 216 water molecules, we estimate that it is relevant for molecular dynamics and geometry optimization to target a RMS of the error in one force component below 0.6 mHa/Å.
We then turn in Sec. III to the optimization and costing of state tomography for the estimation of force vectors in molecular systems. We overview a general scheme for low-cost NISQ tomography methods of arbitrary operators, and in Sec. III A we review previous work on choices of basis rotation to implement this scheme. In Sec. III B, we extend previous work on importance sampling to directly target the 2-norm error in a force vector, and demonstrate the importance of parallelization of measurements where possible. In Sec. III C, we review the fermionic shadow tomography scheme of Refs. [53,54], and calculate the relevant bound on the cost of the number of measurements to estimate a constant 2-norm error here as well. Then, in Sec. III D, we find bounds on the costs of the different methods above, both analytically and numerically for hydrogen chains, and estimate the asymptotic costing of each.
A critical piece of fault-tolerant quantum computation is block encoding, so before giving fault-tolerant algorithms for force estimation we study the cost of block-encoding a force operator. We review general block encodings in Sec. IV A. We give explicit methods to block-encode Hamiltonians and force operators in second quantization for factorized methods (Sec. IV B), and in first quantization for plane waves (Sec. IV C). In molecular systems the cost of simulating block-encoded derivative operators is found to be at most a constant factor worse (as one may differentiate the Hamiltonian in its factorized form), while in plane-wave systems the rescaling factor of the block encodings is found to be identical and the circuit cost only poly(log(N )) worse. In Sec. IV D we study the rescaling factors for block encoding in atomic orbital bases, finding that when using sparse simulation methods the cost of simulating forces is similar to the cost of simulating Hamiltonians, but for factorized methods the cost is clearly asymptotically lower.
We finish this work in Sec. V by designing three new algorithms for force estimation on fault-tolerant quantum computers, and estimating their asymptotic costs on various chemical systems using results from previous sections. In Sec. V A we use higher-order difference formulas to estimate gradients with a fault-tolerant quantum computer as a subroutine to estimate the energy at different atomic configurations. We optimize the importance sampling, choice of finite difference order and step size, and consider the efficiency of reusing the state register on the quantum device between different calls to the subroutine. In Sec. V B, we use the overlap estimation algorithm of [55] to estimate gradients via the Hellman-Feynman theorem at the Heisenberg limit. We optimize this algorithm for general block-encoded Hermitian operators by a factor 4, and optimize importance sampling over the gradient terms. We further optimize the choice of reflection operator using techniques from [56], and demonstrate the ability to perfectly recycle the state register on the quantum device between calls to the amplitude estimation subroutine. However, due to the need to reflect about the ground state (which requires Hamiltonian simulation), we find that the overlap estimation algorithm can only outperform finite difference estimation when state preparation is the dominant cost of estimation in both routines. Finally, we implement force estimation using the new gradient estimation technique of [50], and compare it to both previous methods. We find it achieves a strict asymptotic improvement over the overlap estimation algorithm, which implies in turn that it may often be better than a semi-classical finite difference method. We conclude in Sec. VI, where we summarize our results, discuss the implications for the field of molecular dynamics, and suggest paths for further improvement.

II. ENERGY DERIVATIVE CALCULATION IN ELECTRONIC STRUCTURE
The goal of ab initio electronic structure theory is to solve the time-independent Schrödinger equation, for a given molecular system. In most applications, it is sufficient to consider the non-relativistic, time-independent molecular Hamiltonian H tot , as we do here. Moreover, within the context of the Born-Oppenheimer approximation, one only needs to solve for the electronic Hamiltonian H, given by which defines the electronic Hamiltonian for a molecular system with N a atomic nuclei and η electrons. The first term in Eq. (2), h(r) = η i h i (r), is the one-electron term which is the sum of the electronic kinetic energy and the interaction energy of the electrons with the nuclei with r being a position vector in real space, while the second term is the Coulomb two-electron interaction energy. Here ∇ 2 is the Laplacian with respect to the electronic coordinates, Z A is the A th nuclear charge, and r iA = |r i − R A | and r ij = |r i − r j | are the Euclidean distances between the i th electron and the A th nucleus, and between the i th and j th electrons, respectively.
The molecular electronic Schrödinger equation is a function of the electronic coordinates r with a parametric dependence on the nuclear coordinates R, e.g., The total energy within the Born-Oppenheimer approximation for fixed nuclear positions is given as (for simplicity we omit the explicit position dependence), where V nuc is the nuclear-nuclear repulsion energy with R AB = |R A − R B | the Euclidean distance between the A th and B th nucleus.
Although the above provides a basis for molecular quantum mechanics and is sufficient for computing molecular energies, it is desirable to also be able to compute different molecular properties. Time-independent molecular properties can be expressed as gradients of the ab initio electronic energy E with respect to a suitable perturbation. For example, the first derivative of the energy with respect to an external electric or magnetic field evaluated at zero field strength yields the electric and magnetic dipole moments, respectively. Further, the first derivative of the energy with respect to the nuclear spin (internal magnetic field) yields the hyperfine coupling constants, which are important for multiple spectroscopy techniques such as nuclear magnetic resonance (NMR). Similarly, molecular forces are computed as gradients of the energy with respect to nuclear displacements. These molecular properties are summarized in Table I. Although higher-order and mixed derivatives of the energy lead to additional properties, herein we will focus our attention on first order derivatives of the total energy. Obtaining analytic formulas for these gradients is a rich research area in classical quantum chemistry and we refer the interested reader to Refs. [57][58][59] for more background.
In this work, we will analyze two separate classes of methods for computing energy gradients; computing via the Hellmann-Feynman theorem [60,61], and computing via higher-order finite difference techniques. The Hellmann-Feynman theorem relates the energy derivative to the expectation value of the derivative of the Hamiltonian with respect to that same parameter Here, Ψ is a normalized eigenstate of the Hamiltonian H, and the lower case x represents a general parameter with respect to which derivatives are taken (e.g. a single nuclear coordinate R i , an electric field E, or another quantity in Table I). In practice, the process of calculating the correct total derivative of the Hamiltonian H can be challenging.
All explicit and implicit dependencies for the derivative have to be accounted for. In the remainder of this section we detail the analytic form of these operators in second quantized atomic-centered basis sets, and in arbitrary plane wave basis sets. However, neither of these methods are necessary to implement finite difference calculations.

A. Force operators in second quantization for atomic-centered basis orbitals
To obtain the force operators the Schrödinger equation, Eq. (3), needs to be solved. From the atomic-centered basis orbitals (AO) the Hartree-Fock approximation is typically invoked to obtain first a set of molecular orbitals (MO). However, these MOs yield no analytic form and depend on the set of AOs. Therefore, it is not straight forward to calculate a total derivative operator dH dx to allow force calculations through Eq. (5). However, through the relations of the AOs and MOs to each other, the derivatives can be calculated by the orbital connection theory of Helgaker [52].
A Hamiltonian represented on a quantum computer is conceptually different from the Hamiltonian on a classical computer. The overlap integrals are all pre-computed beforehand in the given MO basis. The wavefunction on the quantum computer only gives the coefficients of all possible determinants. On a classical computer the MO basis is typically part of the wavefunction and not part of the Hamiltonian itself.
We use the following notational conventions. Lower case italics {p, q, r, s} index general (either occupied or virtual) are used for the MOs. Lower case Greek letters {µ, ν, λ, σ} index are used for the AOs. To distinguish vectors and tensors from their elements, they will be written in a bold typeface.
After a Hartree-Fock computation, the η-electron wave function is represented as a single Slater determinant, that is, an anti-symmetric product of spin orbitals {φ p (r)}. These spin orbitals are discretized over a set of basis functions {χ µ (r)}, commonly Gaussian atomic orbitals or plane waves. The spin orbitals are expanded as where C µp denotes an element of the molecular orbital (MO) coefficient matrix. Without loss of generality, we will only consider real-valued MO coefficients. The electronic Hamiltonian from Eq. (2) can be cast in matrix form in the atomic orbital basis, with one-body integrals represented as and two-body integrals In the general case, the set of AO basis functions is not orthogonal. It is therefore necessary to consider their overlap matrix, These three integrals and their total derivatives (the so-called "skeleton" or "core" derivative integrals) are the fundamental building blocks of the molecular gradients. Expressions for the total derivatives of these integrals with respect to an arbitrary parameter x have been derived elsewhere and may be easily computed with most electronic structure software packages. In the orthonormal MO basis, it is useful to introduce the second quantization formalism, which is developed in terms of fermionic creation (annihilation) operators a † p (a q ) that satisfy the anti-commutation relations, {a † p , a † q } = {a p , a q } = 0, and {a † p , a q } = δ pq . In second quantization the electronic structure Hamiltonian Eq. (2) in the MO basis is then given by, with one-and two-body terms in the MO basis At times it is useful to consider the overlap matrix also in the MO basis, which is given by where the use of lower case italic and Greek subscripts distinguishes between the MO and AO representations, respectively. We note that the typical overlap matrix relation S pq = δ pq only holds at the reference configuration. For the molecular electronic Hamiltonian, the energy is given by where γ pq and Γ pqrs are the matrix elements of the one-and two-body reduced density matrices (RDMs) respectively. Energies from ab initio calculations depend on several parameters: the one-and two-body AO integrals h and g, the molecular orbital rotation matrix Θ, the set of determinant amplitudes c, and any other parameters, which we denote as Ω. Given this, and using the chain rule, a general first derivative of the energy E with respect to from any ab initio calculation can be written as The remaining challenge is to fill in explicit expressions for the above elements. As the exact energy is independent of the orbital rotational parameters Θ and CI coefficients c, the corresponding partial derivatives are identically zero, Several of the partial derivatives of the exact energy are trivially evaluated Because there is no dependence on other parameters, e.g. Ω, the only remaining partial derivative of the energy is the one with respect to the overlap of the AO basis functions, where the only terms that depend on S are the MO coefficients C.
With the density matrices γ pq and Γ pqrs being given, the first two terms of Eq. (17) are easy to evaluate: they require the evaluation of atomic orbital core derivatives. The last term of Eq. (17) is a little more involved as we need to find an expression for ∂E/∂S in terms of the one-and two-body reduced density matrices. The core derivative overlap integrals dS/dx can be computed by most electronic structure packages. We obtain for the last term The quantities in the above expression depend on the MO coefficients C. Because the MO coefficients depend on S we need to derive the explicit expressions for ∂C/∂S. The step-by-step derivations are presented in Appendix A, and we find With these, after a derivation presented in Appendix A, we find for derivatives of the one-and two-body terms with respect to the overlap matrix S The final expression for the energy derivative, after reindexing, is given by To calculate the force on the A th nucleus, F A = − dE dR A , we use the above expression to find the energy gradient at nuclear position R A . Following the Hellmann-Feynman theorem Eq. (5), we then convert the problem of calculating the energy gradient to the problem of calculating the expectation value of a derivative operator. We find that in second quantization the derivative operator is given by In this equation, the second terms in each bracket which include the derivative of the overlap matrix S correspond to the Pulay force [52]. For later reference, we write coefficients of this operator in the same form as the Hamiltonian

B. Force operators in plane wave bases
Plane waves are one of the most common basis sets used to model condensed matter systems. They are a natural basis for periodic systems and are independent of the atomic positions. However, their drawback is that many plane waves are typically needed to describe the wavefunctions accurately. When defined on a cubic reciprocal lattice, the plane wave basis functions take the form, where Ω is the computational cell volume and the reciprocal lattice vector in three dimensions is defined as with N being the number of plane waves. The molecular integrals can be evaluated analytically for the case of plane wave basis functions, leading to the following representation of the second-quantized electronic structure Hamiltonian in Eq. (10), Here and in what follows, we have omitted the electron spin for simplicity. An equivalent expression in first quantization can be written down [62], where |p q| j is a shorthand for I 1 ⊗ · · · ⊗ |p q| j ⊗ · · · ⊗ I η and G 0 is G from Eq. (26) excluding the zero mode. While a number of papers have analyzed the viability of quantum algorithms for simulating chemistry in second quantization with plane waves [63][64][65][66][67], that approach faces some significant challenges. In particular, in second quantization the number of qubits required scales as the number of plane waves. This is a problem because often hundreds of thousands of plane waves might be required to obtain a suitable wavefunction accuracy. However, there have been proposals for fault-tolerant algorithms using plane waves in first quantization [62,68]. In first quantization the number of qubits required scales only as the logarithm of the number of plane waves N and linearly in the number of electrons, η. Algorithms have been demonstrated [62] for time-evolution or state preparation of molecular systems that scale only as Due to the sublinear dependence on N , with these approaches one can conceivably perform simulations with millions of plane waves. An additional advantage of the plane wave basis is that the overlap matrix elements in the plane wave representation are reduced to and thus the overlap matrix contributions to the derivative operator from Eq. (22) are identically zero. This suggests that representing the electronic structure Hamiltonian in first-quantized plane waves basis is a promising avenue for calculating energy derivatives of chemical systems. In a plane wave basis, the only dependence of H on the nuclear positions R A is in the one-body term, which implies that (as expected for a non-atomic centered basis set) the force operator in plane waves is a strictly one-body operator. (The same is true for other first-order derivatives that do not affect the electron-electron Coulomb, such as an applied electric or magnetic field.) This operator may be further simply diagonalized by the fermionic fast Fourier transform (FFFT) [63,69,70], in a similar manner to the potential term of the original Hamiltonian. This is simplest to demonstrate in second quantization, so we will perform the calculation there first and then transform to our target first-quantized form. Differentiating Eq. (27) with respect to the A th nuclear co-ordinate R A gives us Note that this is a vector-valued derivative, here R A is the 3-dimensional nuclear position vector (individual components of this vector may be obtained by taking individual components of the wavevector k s in f (s, A)). In second quantization, the FFFT performs the following single-particle rotation, where r p = p(Ω/N ) 1/3 . Under this transformation, the gradient of the electronic structure Hamiltonian becomes Recognizing that (q − p) spans the full set of momentum vectors in our system due to aliasing, we can replace the sum over q = p and the indices q − p and q with a sum over s = 0 and p. Following this reindexing, our gradient operator diagonalizes immediately, where we have used the fact that the summation grouped on the right side of the first equation is equal to zero unless p = q . This is because the negative modes of k q will have exactly the opposite phase as the positive modes of k q . We now transform the derivative operator into a first-quantized representation. In first quantization, we store our wavefunction by having a computational basis that encodes configurations of the electrons in N basis functions such that a configuration is specified as |φ 1 φ 2 · · · φ η where each φ j encodes the index of an occupied basis function. Each φ j may be specified in binary, making the space complexity only O(η log N ). We can translate Eq. (35) into first quantization to give where the QFT is the quantum Fourier transform with aliased frequencies (the first quantized version of the FFFT from Eq. (35)). The QFT can be implemented with Toffoli gate complexity O(η) [63]. Note that this guarantees that all force operators are mutually diagonal under the FFFT/QFT, which in turn implies that all force operators commute.

C. Error tolerance for applications
The most widely-used energy derivatives of a molecular system are nuclear forces. The first application we consider is geometry optimization, where nuclear derivatives are used to find the geometry of the molecule with the lowest energy on the potential energy surface. The second application is molecular dynamics, where the nuclear positions are propagated through time by a classical differential equation within the Born-Oppenheimer approximation. At each time, the forces on the nuclei determine their next position. Both of these applications rely on the nuclear derivatives of the energy to repeatedly update the positions of the nuclei. This is a process where a small error in each step can quickly accumulate. The tolerable error on the forces is an important parameter in the scaling of the quantum algorithms to calculate them. In the next two subsections we investigate the error level that is acceptable.

Geometry optimization
The error tolerance of the forces for the geometric relaxation of a structure depends strongly on the system. The geometries of systems with a rather steep potential energy surface can be determined with relative low accuracy of the forces. For example, Gaussian sets the default thresholds for convergence of the maximum force to 0.9 mHa/Å and the RMS of the error of single force component to 0.6 mHa/Å [71]. However, for systems where forces are smaller because the potential energy surface is shallow, typically the geometries need to be determined by relaxing the atomic positions until the forces are one magnitude smaller [72].

Molecular dynamics
Error bounds on forces required for MD simulations will again depend strongly on the system studied. To find a simple baseline for a target accuracy for force components in this work, we focus on the required error tolerance for performing semi-classical MD simulations of a water system. A quantum device would be used in this situation as a   subroutine to provide accurate estimates of the classical potential, employing the TIP3P water model [73]. Here, the MD simulations were performed with Atomic Simulation Environment software package [74].
As a proxy for simulation convergence, we study the 2-particle radial distribution function of a N a = 3 × 216-atom system in a periodic box of volume V . Here, Z Na is the partition function of the N a -particles, and β = 1/299 K is the inverse temperature of the system. As the radial distribution function g (2) is independent of translations and rotations of R 1 and R 2 about the origin, the data it contains can be found solely in the radial term: To obtain a physically-relevant quantity, we isolate the pair distribution of the N a /3 = 216 oxygen atoms and ignore the hydrogen atoms. The 2-particle radial distribution function is a macroscopic quantity and is often used to benchmark different water models. The deviation from the ideal radial distribution function is a measure of the quality of the run. We test for which size of errors in the force we can still reproduce the error free radial distribution function. To achieve this we performed micro-canonical MD simulations with 36000 time steps, and at each 1 fs time step we added a random error term to the forces of water molecules. This error was sampled from a Gaussian distribution with a given RMS of the force error, with separate error terms drawn independently. The radial distribution function of Eq. (37) is averaged over all time steps of the simulation. From these MD runs we can determine the size of errors for which we can still reproduce the error-free radial distribution function which we take as the ground truth. Fig. 1 shows that the radial distribution function rapidly converges to the error-free radial distribution function. As a metric for convergence, we focus on the largest feature in the system (the peak around 2.8Å, and plot the error in the peak position ( Fig. 1 inset). From this we conclude that for water the RMS error for molecular dynamics needs to be smaller than 6.4 mHa/Å to reproduce macroscopic properties as the radial distribution function. Our findings are in agreement with studies where ab initio MD simulations based on quantum Monte Carlo calculations were employed [75]. Beyond radial distribution functions, quantities such as vibrational density of states are expected to require higher precision of the force evaluation. [76] The geometry optimization seems to be more stringent by a order of magnitude on required accuracy of the forces than MD simulations. In MD simulations errors can average out. Later in this paper we consider the 2-norm as the parameter for accuracy of the forces. For the 2-norm the sum is taken over all forces and its components. Therefore, the 2-norm is a extensive quantity and depends on the number of atoms. We can convert the RMS error to the 2-norm by multiplying RMS error with √ 3N a . For example, for the 216 water molecules we obtain a 2-norm of the error of 282.2 mHa/Å.  [51,77]. (b) Example of a single update of the nuclei coordinates R of two water molecules with red balls and grey balls denoting oxygen and hydrogen atoms respectively. Fi denotes the three-dimensional force vector on the i-th atom.

III. COMPUTATION OF FORCE VECTORS IN NISQ
To optimize quantum algorithms for NISQ quantum computers, we must reduce quantum circuit depth wherever practical. Near-term proposals for quantum chemistry achieve this by preparing approximate ground states [17,78,79], from which energies may be extracted by state tomography. As long as the approximate state is variationally optimized within the active space considered on the device, the energy derivatives yielded by the Hellman-Feynman theorem are accurate for the variational energy; no further corrections need to be made to the methods outlined in Sec. II. In this work, we assume access to the preparation of an initial state |ψ that satisfies the Hellman-Feynman theorem Eq. (5) ( dE dRi = ψ| dH dRi |ψ ), and focus on the optimization of the measurement of this state to extract an estimate of ψ| dH dRi |ψ . While many approaches exist for reconstructing, the expectation, tomographic approaches have a major role in recent quantum computing works [19,20,53,54,80]. However, to the best of our knowledge no-one has optimized estimation methods for the measurement of vectors of operators prior to now.
State tomography in NISQ is complicated by the fact that simultaneous direct measurement of multiple operators is only possible in quantum mechanics when all operators mutually commute. More broadly, the parameter estimation or partial tomographic protocols used to estimate a gradient consist of three key steps: 1. Define a set of basis rotations {Q j } for which low-depth quantum circuits are known.
2. For each basis rotation Q j , prepare the state |ψ M j times, apply the quantum circuit, and then destructively measures the system in the computational basis.
3. Estimate the set { ψ| dH dRi |ψ } from the observed measurement data. In a NISQ cost model, the target is to reduce the total number M = j M j of preparations of |ψ , or experiment 'shots', while targeting some error bound on the set of energy gradient estimates.
If the expectation value estimation (step 3 above) is linear and unbiased, the error on individual estimates ψ| dH dRi |ψ may be calculated by variance propagation. Such an estimation corresponds to the decomposition of the gradient operator as a linear combination where the D i,j are operators that are diagonalized by the jth basis rotation Q j . If Q j diagonalizes D i,j , ψ|D i,j |ψ may be estimated by averaging over the M j destructive measurements taken in the Q j basis. The variance in this estimation is given by As expectation values are linear, we have Then, as each D i,j is measured independently, the variance of the estimation propagates in the usual way to the variance in an estimation of dE These errors may be captured within a 3N a -dimensional error vector . In practice estimates of σ 2 i,j are not known in advance, making exact estimation of i and subsequent parameter optimization difficult. Instead, bounds on σ i,j are often substituted; we will introduce various such methods throughout this section. Some of these bounds are in practice quite weak, which implies that fair comparison of the results described in this section may not be possible.
The general state tomography method described above leaves open a large number of parameters for optimization: the rotations Q j , the shot allocation M j , and the choice of operators D i,j in the decomposition of dH dRi . In the following sections, we will describe and compare various methods that attempt to optimize these choices. Complete optimization of each of these choices is not practical due to the sheer number of parameters and the (classical) cost of evaluating cost functions. Optimizing basis rotations Q j to diagonalize multiple operators is in general an NP-hard problem [81]. Moreover, the lack of precise knowledge of σ i,j implies that the cost function may be difficult to estimate for the purposes of optimization. However, various heuristic techniques are widely known, and many of these can be shown to achieve asymptotically optimal results.

A. Basis rotation choices
When choosing the set of basis rotations Q j for a state tomography protocol, one must try find operators D i,j that are diagonal in the Q j basis. Calculating such operators is typically as difficult as simulating the circuit, so rotations Q j are typically chosen to be classically easy to simulate. One must take further care that the D i,j are not exponentially difficult to express in order for step 3 in the general method above to be computationally feasible. Drawing Q j from one of a few well-known sets of quantum circuits defined below typically solves this problem.
The commonly used Clifford circuits form the first example. These circuits preserve the Pauli group P N = {I, X, Y, Z} N modulo complex phases. If Q j is a Clifford circuit, so is Q † j and the algebra formed by the set Q † j Z n Q j yields all possible operators D i,j that are diagonal in this basis. Moreover, Clifford circuits can in principle be constructed to simultaneously diagonalize any set of mutually commuting elements of P N . It is relatively easy to design a set of Clifford basis rotations Q j in this manner: 1. Decompose all operators dH dRi into a linear combination of Pauli operators following the Jordan-Wigner, Bravyi-Kitaev, or alternative fermion-to-qubit transformation.
2. Subdivide the set S of Pauli operators that appear in at least one linear combination into commuting subsets S j (i.e. so that all Pauli operators within each S j commute).
3. For each subset, find an appropriate basis rotation Q j .
A disadvantage to the above is that Clifford circuits to diagonalize mutually-commuting operators can be relatively deep [82,83]. This can be simplified by adding the requirement that the subsets S j contain not commuting Pauli operators, but amenable Pauli operators. Two Pauli operators are amenable if, on each qubit the tensor factor I, X, Y, Z of both operators is the same or the tensor factor of at least one operator is the identity. (For example, Y Y and Y I are amenable, but Y Y and XX are not.) Amenable Pauli operators can be mutually diagonalized by single-qubit basis rotations Q j , making this a practical subdivision S j . In general subdividing S into the minimum number of S j is a known NP-hard problem [81], though relative success has been found in heuristics [20] or brute-force optimization methods [83]. To date these methods have focused on minimizing the number of subsets that contain all Pauli elements that make up the fermionic 1-and 2-RDM, for which an O(N 2 ) bound is known and has been achieved [20]. However, most of these methods have not considered the subsequent allocation of measurements M j and the subsequent cost in wall-clock time to estimate one or more expectation values to a given accuracy. A second set of circuits relevant for diagonalizing Hamiltonians and force operators in chemistry are Givens rotation circuits. These correspond to evolution by a one-body fermionic operator and are classically tractable to calculate as they map single creation and annihilation operators to each other.
where q j is the N × N Hermitian matrix with elements taken from Eq. (43). Givens rotation circuits are relatively low depth; an arbitrary Givens rotation may be implemented in depth 2N on a linear array using precisely N 2 two qubit gates [26,84]. The above may be slightly generalized to the set of fermionic Gaussian unitaries [54] Q j = e p,q g p,q j γpγq , where γ m and γ n are anti-commuting Majorana operators This strictly contains the set of Givens rotations, and also allows for Bogoliubov-style rotations between a n and a † n . A low-cost method for constructing Givens rotation circuits Q j to target a two-body fermionic operator is to factorize the operator [11,19]. Starting from the operator in its chemist formulation we reshape the 4-rank tensor V pqrs into a 2-rank tensor M (pq),(rs) . A direct diagonalization (or a Cholseky decomposition) of the flattened version of V pqrs yields with g ( ) pq and w representing the eigenvectors and eigenvalues respectively. Further diagonalization of the squared single-body operators yields [15] with f ( ) p the eigenvalues of W ( ) pq and U the unitaries performing the diagonalization, which can be expressed as a single-particle change of basis unitary and the κ p,q are obtained from the Givens rotation procedure in [84]. Similary, the one-body fermionic operator may be diagonalized by a single Givens rotation Q j , as one simply takes the rotation q j that diagonalizes the corresponding N × N one-body matrix (following Eq. (44)).
If the operator A given in Eq. (47) is the electronic structure Hamiltonian in a generic second-quantized basis, we have that L = O(N ) and M < N and in some special cases M l = O(log N ) [11]. This implies that one may estimate the expectation value of a Hamiltonian with O(N ) basis rotations Q j . As we show in Sec. IV B, this extends to a bound on the number of Givens rotations required to factorize a single derivative operator However, factorizations do not typically parallelize; the set of Givens rotation circuits that measure dH dRi will not typically allow estimation of dH dR i . Moreover, it was found recently that the L ∼ O(N ) scaling is relatively delicate; subtracting operators from one derivative will tend to yield an operator that is no longer low-rank [85]. This implies that it is likely not possible to significantly parallelize factorized methods, and the number of Givens rotations required to measure 3N a derivative operators likely scales as O(N a N ).
An obvious question to ask is whether the set of fermionic Gaussian unitaries and Clifford circuits intersect. The answer to this question is yes: the intersection of these operators are the fermionic Gaussian Clifford unitaries, which are generated by the set of Majorana swap operators and correspond to the symmetric permutation group Sym(2N ) (being permutations of indices of Majoranas). For the sake of measurement, the effect of a Majorana permutation Q j is to pair the set of 2N Majorana operators: to choose a set of disjoint pairs {(γ p , γ q )} containing all operators, and permute p → 2n, q → 2n + 1 for some n. This maps the Hermitian operator iγ p γ q → Z n , which implies that any linear combination of products of the pairs are diagonalized by Q j . This allows simultaneous measurement of 2N linearly-independent 2-body fermionic terms, which is optimal, and the basis for the best-known measurement schemes for the estimation of arbitrary 2-body fermionic operators [20,54].

B. Parallelized importance sampling
Once an optimal set of basis rotations Q j and operators D i,j have been chosen, it remains to allocate the number of shots M j to each Q j . Here, we target minimizing a given cost function f ({M j }, {σ i,j }) while keeping the total number of measurements M = j M j constant (or vice-versa). This may be achieved by Lagrangian methods [86,87], this methodology being a form of importance sampling over the expectation values D i,j . Such methods entail adding the total number of measurements as a constraint to the cost function with a Lagrangian multiplier λ, giving a Lagrangian The solution to the problem is then achieved by minimizing L with respect to all free parameters: M j and λ. (See Appendix B 1 for more details and explicit calculations of the optimizations used in the text.) Crucially, σ i,j is typically not known a priori. In principle σ i,j can be estimated during the expectation value estimation procedure, which could be used to adaptively optimize the distribution of the M j . However, typically in the literature a range of boundsσ i,j > σ i,j are used instead. We will discuss the known bounds in detail in the next section. In order to perform the above minimization procedure, we must define the cost function f ({M j }, {σ i,j }). This is complicated by the fact that we estimate 3N a force components, and must combine the error on each into a single cost function. This can be achieved by defining a norm on the error vector = E( dH dR − dH dR ). In Sec. II C, we saw that the 2−norm is a reasonable proxy to bound the error in molecular dynamics simulations. An additional issue presents itself as we should take into account the covariance between different force components (assuming that we do not measure these independently). However, this may be circumvented if we take the 2-norm squared as our cost function: We finally write f ({M j }, {σ i,j }) = 2 , where is the RMS error in our final force vector. The advantage of targeting the norm of the error vector for importance sampling is not just that we can allocate different numbers of shots to different gradient components depending on their relative need, but that we can account for basis rotations Q j that allow for multiple measurements. Substituting Eq. (54) into Eq. (53) and replacing the true deviation σ i,j with our estimateσ i,j yields the Lagrangian Minimizing with respect to M j and solving for M j yields a shot allocation with respect to λ, which may be simplified by enforcing our Re-substituting this into our definition of 2 and solving for M then achieves a relatively compact result, In Appendix C 4 we repeat this calculation to find a bound on the measurement count required to estimate the error vector to constant 1-norm instead of 2-norm. It is instructive here to consider the effect of parallelization; what do we gain from the ability to use one basis rotation Q j to measure components D i,j of multiple force operators? This is important as this ability is lost in schemes such as low-rank factorization, where basis rotations to diagonalize factors from dH dRi and dH dR i cannot be made to easily overlap while keeping all operators low-rank [85]. This can be studied by replacing M j → M i,j , and performing the same Lagrangian minimization as before. The effect of this minimization can be immediately written down, as we are effectively losing the i index from the second sum in Eq. (57) and replacing the j index by a pair (i, j). Thus, we can write As j x j ≥ j x 2 j , parallelization is clearly always favorable when possible (which we expect). The gain in efficiency going from Eq. (58) to Eq. (57) depends on how well parallel measurements can be grouped. The case with the largest difference in efficiency is when a set of basis rotations can be chosen for all 3N a force operators such that the magnitude of all errors in each component are roughly equal for each rotation, i.e.σ i,j ∼σ j . In this case, we have However, in a real setting the asymptotic gain may be significantly smaller. We can also consider the gain obtained from importance sampling in the parallel estimation case. This will be useful to predict the improvement that might be gained from importance sampling in methods where this is not natively performed. In the absence of importance sampling, we replace M j → M Nr , where N r is the total number of basis rotations (N r = j 1). The 2-norm of the error in Eq. (54) then becomes and rearranging yields As one would expect, in the limit thatσ i,j =σ Eq. (61) and Eq. (57) are identical. However, when iσ 2 i,j varies significantly as a function of j, the gain can be up to a factor of N r ; the number of basis rotations used. As full tomography of the fermionic 2-RDM requires N r ∼ N 2 a (and naive tomography N r ∼ N 4 a ), this can be a significant gain.

C. Fermionic shadow tomography
An alternative method for choosing basis rotations Q j and allocating shots M j is to choose them at random. This idea has been recently formalized by the notion of classical shadows [53]. Here, one considers the action of randomly drawing a basis rotation Q j from an ensemble Q, measuring in the computational basis, and observing basis state |b j . One could in principle now invert Q j on the measured state |b j to give a new state Q † j |b j b j |Q j . Assuming that the ensemble Q is informationally / tomographically complete (i.e. that every marginal of ρ is measured by at least one element of Q), the map is invertible. Moreover, the expectation value of the inverse map M −1 across sampled rotations Q j and consequently measured states |b j must be the initial state Given a finite set of basis rotations Q j and subsequent measurements |b j , this gives an estimator for dE dRi = Trace |ψ ψ| dH dRi [53,54] The key advantages to this method are that the ensemble Q may be easier to design than a specific set of rotations R j , and by averaging over the entire ensemble we may reduce the covariance between different terms. To suppress the tails on the distribution of the estimator dE dRi and achieve optimal scaling a median-of-means technique was originally used in [53], but it was shown in [54] that this is unnecessary for fermionic systems. A provably optimal choice of Q to estimate arbitrary 2-RDM elements is the ensemble Q FGU of fermionic Clifford Gaussian unitaries described in Sec. III A. We label the corresponding channel M FGU .
The variance of the above estimator may be calculated by representing the force operator in the algebra generated by the Majorana operators γ p (Eq. (46)) where C(2N, 2k) is the set of all possible combinations of 2k elements drawn from {1, . . . , 2N }. With this defined, the variance on the estimator constructed from a single choice of basis rotation Q j and observation of |b j is calculated in [54] to be where here · FGU is the shadow norm [53] under the fermionic Gaussian Clifford ensemble The shadow norm of a product of 2k Majorana operators in this ensemble was calculated in [54] to be 2n 2k / n k . Thus, by the central limit theorem, the variance of the estimator dE dRi after M different rotations Q j and measurements is Note that unlike other methods where one must take a bound on the variance of individual estimators, Eq. (68) is exact. It is also significantly easier to calculate than the variance on other estimation methods, as it does not require access to higher-order correlators that come with expectation values of H 2 .
We now extend the above analysis to estimate the error in the 2-norm 2 (see Eq. (54)) of the force vector dE dR . As we are drawing M basis rotations from our distribution at random, this time we do not need to optimize the distribution of our measurements via importance sampling. (Importance sampling over shadow tomography may be introduced by locally biasing the classical shadows [88], which has been seen to yield significant improvements for Hamiltonian tomography [54].) As we do not encounter covariances between terms when calculating the 2-norm (see Sec. III B), we have immediately that and so to bound 2 2 < 2 requires that we set

D. Numerical results
We now attempt to summarize and estimate the cost of the different optimizations made in this section for NISQ tomography of force vectors. The cost of estimating force vectors depends critically on the studied system. In this section we consider two different scenarios; one where we make the assumption that our force operators are relatively uniformly distributed across our system (allowing us to make general asymptotic estimates), and a set of numerical cost estimates for hydrogen chains of varying length, calculated in STO-6G using localized orbitals, see App B 2. Although near optimal methods of grouping fermionic operators are known [20], to simplify the results in this work we choose a naive grouping to compare the parallelized vs serial importance sampling discussed in Sec. III B. That is, we consider a set of rotations Q j that measure independent Pauli terms D i,j = h i,j P j , where dH dRi = j h i,j P j following a Jordan-Wigner transformation. We compare this in turn to results obtained for fermionic shadow tomography and to results for a low-rank factorization dH dRi = l W (i, )2 (Eq. (51)). To avoid the need to calculate expectation values of four-body terms (which are required for exact variance estimates using the methods of Sec. III B), we use standard methods for upper bounding the variance contributions σ 2 i,j ≤σ 2 i,j instead. Our variance bound for naive measurements is very simple; σ i,j ≤ h i,j . Note that this implies Γ (sep) 2 is just the square of the induced 1-norm of the force operator (in a qubit representation). By comparison, for the basis rotation grouping method, we take the worst-case bound on the variance for each given operator where | is the range of the spectrum of the corresponding factor σ,σ pq f q n p,σ n q,σ constrained to the appropriate particle number sector. In Fig. 3, we plot the resulting scaling coefficients Γ for these techniques, alongside the classical fermionic shadow scaling coefficient from Sec. III C. As the bounds used for different methods differ in their tightness, it is not possible to make a direct comparison between the lines in different plots. Instead, for comparison we plot the equivalent scaling factors for the Hamiltonians of the same system, allowing us to compare the cost of energy and derivative estimation. As each line is approximately straight on a log-log plot, we may extract scaling coefficients by a linear fit, which we report in Tab. II. We contrast this in the same table with an asymptotic analysis under the assumption that all forces are the same magnitude. We see that, the shift from separate to parallel force measurement in the naive case (left plot) decreases the cost of estimation by a factor N 0.2 a << N a . When parallelized, we further observe that the cost to estimate the force operator to a target precision (in Ha/Å) is roughly the same the cost to estimate energies to a target precision (in Ha). Similar results are found for the shadow tomography case (b). As the error required on forces in molecular dynamics (in the given units) is roughly 20 times larger than chemical accuracy (Sec. II C), we suggest that a single-shot force estimation using shadow tomography may be already a factor 100 cheaper than the corresponding energy estimation on a reasonablysized system. However, semi-classical molecular dynamics simulations typically requires millions of such estimations, which presents a significant additional multiplicative cost. Similarly, we see that the scaling of the cost of estimating forces via basis rotation grouping (c) is smaller for force operators as compared to the Hamiltonian. We note however that the here presented bounds are not tight (see [19] for a comparison) and that it is an open question how these numbers change when using the explicit variances σ i,j of the ground state.
Previous results [39,40,46] have focused their analysis on the scalings of the quantum circuits required to compute gradients without considering the contribution coming from the properties of the gradient operator and without Empirical scaling (H chain) performing any optimization of the measurements. In contrast, here we have computed bounds on the number of measurements required while also optimizing the measurement number with three different methods. For this reason, performing a direct quantitative comparison between this work and previous results is not possible.

IV. BLOCK ENCODING HAMILTONIANS AND FORCE OPERATORS
A. General block encoding of the Hamiltonian and force operator In the next section, we will present several methods to calculate energy gradients in a fault-tolerant quantum computing section. These methods all require access to block encodings of the Hamiltonian and derivative operators. Block encoding is a method of encoding a non-unitary operator on a quantum device; one block-encodes an operator O by adding additional qubits to a system and finding a unitary U that contains O in the block corresponding to the |0 state of the additional qubits. That is, if where the subscript on u i denotes the computational basis state of the additional qubits, we require This encoding typically requires the rescaling factor λ O 1, as sub-blocks of a unitary matrix must have eigenvalues with norm ≤ 1. These rescaling factors typically then appear as multiplicative costs for implementing a given algorithm. For example, one can calculate immediately given Eq. (73) that In this work, we use λ H and λ Fi respectively to refer to the rescaling factors for a Hamiltonian and force operator (or λ F to refer to the rescaling factor averaged across all forces). The cost of implementing these block encodings as quantum circuits is stated as T H and T F,i respectively (with T F the cost averaged over all force components again).
In this section, we give circuit implementations of these block encodings, and the corresponding T H , T F,i , λ H and λ F,i costs. These differ depending on the type of basis set used to solve the electronic structure problem, and whether the problem is solved in first or second quantization; we will give results for multiple commonly considered systems. The most commonly used methods to block-encode an arbitrary operator O are linear combination of unitaries (LCU) methods [89]. We give a brief overview of this class of techniques here. LCU methods involve writing O as a linear combination of unitary operators The condition that h i be real and positive can be accounted for by placing any complex phase on the operator P i (as P i e iφ is unitary whenever P i is unitary). The LCU block encoding then requires two oracles; a PREPARE oracle that constructs the state on an additional LCU register, and the SELECT oracle defined by One can then confirm that given states |ψ 1 , |ψ 2 on the system register, as required for this to be a block encoding, with a rescaling factor One has a great degree of freedom in choosing the decomposition (Eq. (75)) and constructing the PREPARE and SELECT oracles. Optimizing these is a critical requirement to lower fault-tolerant quantum costs.

B. Block encoding in second quantization with local basis sets
The critical factor in block-encoding Hamiltonians or derivative operators with a local basis is the requirement to transfer all of the data about the operator onto the device. A generic two-body fermionic operator requires O(N 4 ) pieces of information to describe, and in the absence of additional structure this bounds the cost of block-encoding below. Luckily, the electronic structure problem has additional structure that can be used to reduce this cost. As discussed briefly in Sec. III A, the two-body electron tensor may be factorized as the product of lower-order tensors that may be expressed with fewer indices. Two of the most well-known factorization methods are low-rank factorization [12] and tensor hypercontraction [13]. Both of these methods incur additional truncation errors, but numerically these have been shown to converge at a much lower cost than the O(N 4 ) bound above.
Tensor hypercontraction holds the current record as the best scaling method for block encoding Hamiltonians. We can write the THC form of the Coulomb operator as where χ is an M × N dimension matrix and ζ is an M × M dimension matrix. Crucially, for the electronic structure problem it is possible to truncate M = O(N polylog(1/ THC )) while achieving a truncation error THC . Ref. [13] demonstrated a method for block encoding H/λ H using a quantum walk with gate complexity We now argue that the same method can be used to block encode the force operator, with the same scaling. One can represent the derivative of the THC Hamiltonian with respect to nuclear coordinates as s , We see that the dimension of these operators is not increased however there are now four types of tensors in the expression (χ, ζ, dχ dRi and dζ dRi ) rather than just two (χ, ζ). However, in order to pursue the same strategy for realizing the block encoding that is employed by [13], it is necessary to yet define two more types of tensor χ (+) = dχ dR + χ and With these definitions we can rewrite the expression as We see that this expression also involves five tensors, (χ, ζ, There are now five terms instead of one but we see that within each of these terms, the two tensors with the µ index are always the same and the two tensors with the ν index are always the same. This is a critical requirement for the following step so that creation and annihilation operators transform in the same way. Associated with each of the three tensors related to χ, is the following projection of the fermionic ladder operators into three different larger auxiliary bases: where these creation and annihilation operators act on a larger space of 2M spin-orbitals rather than N spin-orbitals. Without loss of generality, we are taking the χ (µ) , (χ (µ) ) (+) and (χ (µ) ) (−) to be normalized vectors for each µ (because constant factors can be absorbed into ζ). Using this, we rewrite the full force operator as where are the number operators in the auxillary bases defined earlier. Having reduced the Hamiltonian to this diagonal form, it is now clear that one can use the same block encoding strategy as that defined in [13] with minimal additional modification. In particular, because there are now five different terms, we will need several additional ancilla bits that flag which term we mean to block encode. We will need a slightly larger QROM because we now need to access five tensors rather than just four. However, the ζ tensors which are dimension M × M tend to dominate the cost since the χ tensors only have dimension M × N . Furthermore, the Toffoli cost of the QROM increases only as the square root of the database size. Thus, the quantum walk will have a Toffoli cost that is increased by roughly a factor of T F ∼ √ 2T H . The rescaling factor should now be defined as Using this method, λ Fi will be larger than the original λ H by at least a factor of two. We do not believe λ Fi will be asymptotically larger than λ H because we would expect the d dRi ζ values to generally be smaller than the ζ values. Thus, we have shown that the block encodings of Ref. [13] also work for the force operator, with the same asymptotic complexity.
Our expectation is that this procedure would be suboptimal and that a better approach would be to use exactly the same procedure as [13] in order to block encode the force operator. By this, we mean that one could apply THC directly to the coefficients G pqrs , rather than take the derivatives of the THC tensors. Then, one ends up with a single χ and a single ζ which compress the force operator, rather than a mix of the χ and ζ compressing the original Hamiltonian and their derivatives. Doing this will likely lead to even smaller values of λ Fi than for the Hamiltonian simulation (perhaps even asymptotically so). However, it is more difficult to show analytically that M would be as small or smaller than the original M . We expect this to be the case due to the sparseness of the derivative operator; studying this numerically (or analytically) would be a clear target for future work. One additional complication that presents itself here is that the truncation error from factorizing the derivative operator is different to the truncation error in the Hamiltonian, which implies that the forces estimated by this approach do not quite match the energy manifold of the THC-factorized Hamiltonian. (This error is avoided when differentiating the THC-factorized Hamiltonian, as the truncation here is identical.) This implies that a lower tolerance for the truncation error on both the Hamiltonian and force operator may be necessary to make the systematic error here negligible; this would need to be properly bounded in any future work.
Although the THC algorithm is expected to be more efficient, a very similar argument can be made for the low rank block encodings of Ref. [12]. The form of the low-rank Hamiltonian is given in Sec. III A, in Eq. (49). Ref. [12] shows that one can use this form of the Hamiltonian in conjunction with qubitization [90] to block encode H/λ H with complexity scaling as T H ∼ O( √ N L) and in this case Let us now extend this factorization to the derivative. We define V as in Eq. (49); V = L l=1 W (l) W (l) † . Then, we can write the derivative as a difference of squares: The W (l) (Eq. 51) and their derivatives are one-body operators, so their sum is a one-body operator and can be block diagonalized following Ref. [12]. This gives a constant factor overhead and requires a single additional qubit to flag between two choices of basis rotations (in place of Eq. (50)) The corresponding scaling factor of the block encoding is in this case Assuming that dW (l) dRi is a similar scale to W (l) , we can expect that f p , and that λ F ∼ 4λ H . However, as noted before for the THC derivative factorization, we expect these scalings to be improved significantly by a direct factorization of the derivative operator instead of differentiating the factorization. This may yet lead to an asymptotic improvement, but with the same systematic bias as described for the THC method that comes from a different truncation error between the Hamiltonian and force operators.

C. Block encoding in first quantization in a plane wave basis
The relatively simple structure of the electronic structure Hamiltonian in a plane wave basis implies that the cost to block encode it is far below the O(N 4 ) bound for an arbitrary two-body operator. Block-encodings of the first-quantized plane wave Hamiltonian are already well-known [68], with a number of gates required scaling as where η is the number of electrons in the system, and N is the number of plane waves. The corresponding scaling factor is similarly known to be [62] λ H ∈ O(ηN 2/3 /Ω 2/3 ).
We now consider how to block-encode the force operator in its first-quantized plane wave form. We first rewrite Eq. (36) slightly to place it in a linear combination of unitaries form Here, we split our indexing of the 3N a -dimensional nuclear co-ordinate vector R into the nuclei index l and the position index α, and write R l = (R l,x , R l,y , R l,z ) for the 3-dimensional position of nuclei l. Then, we write k α ν for the projection of the 3-dimensional vector k ν onto the unit vector in the direction of R l,α . To see that the operator inside each bracket is unitary, recall that |p p| j projects the jth electron register onto the pth basis state, but acts as the identity on all other electron registers. This implies that |p p| † j |q q| j = δ p,q |p p| j , and p |p p| j = I, as required. This implies that our PREPARE operator requires us to prepare the state with a rescaling constant In Appendix D, we describe a circuit for the PREPARE operator that generates an initial approximation via a nested boxes approach, using inequality testing and a single amplitude amplification round to achieve precision at a cost O(log N log(N/ )). The SELECT operator for our block encoding takes the form Normally the QFT operation would be performed on each of the η electron registers. However, we only need to act on the momentum register j. We therefore first swap the jth electron register into the 0th electron register conditional on |j . That can be achieved with η Toffolis for unary iteration [64], together with η log N Toffolis for controlled swaps. At the end we invert the procedure to replace the register. The QFT on this register can be implemented with gate complexity O((log N ) 2 ). To implement the phase e ikν ·(R l −rp) , one would first subtract r p from the classically given value R l . The number of bits needed for R l can be found in the case of energy estimation to be O(log(N/ )), using Lemma 2 of [68]. Here we are estimating force rather than energy, but the problem is sufficiently similar that the same scaling is needed. The complexity of the subtraction is therefore O(log(N/ )). The components of both R l and r p would be given in two's complement, then one would convert the difference to signed integers.
The phasing can then be performed by using a phase gradient state [91,92]. Let us assume that the |ν register contains three separated components |ν x , ν y , ν z written in binary form; k α ν = 2πνα Ω 1/3 for α = x, y, z. Then each bit of each component of ν can be used to control addition of a component of R l − r p into the phase gradient register. There are sign bits of both, since they are given as signed integers. The product of the sign bits can be used to give an overall sign, and this sign can be used to control addition versus subtraction of R l − r p without further non-Clifford cost. Since there are O(log N ) bits of ν and O(log(N/ )) bits of R l − r p , the overall cost is O(log N log(N/ )) for the phasing. This is similar to the complexity of the state preparation.
Taking into account the O((log N ) 2 ) complexity of the QFT and the O(η log N ) complexity of swapping the momentum register, the complexity to block encode the derivative operator is D. Numerical studies of the force operator in molecular systems The Hamiltonian simulation cost depends on the representation of the Hamiltonian and can be, among other things, quantified by its spectral bandwidth, which we upper bound with a value λ. For electronic structure Hamiltonians there exist numerical insights into the upper bounds for various systems and representation techniques and with respect to their scaling when increasing the system size or the number of orbitals [12,13,15]. For the force operator, see Eq. (22), such results do not exist. In the following, we compare the values of λ obtained for the force operator and the Hamiltonian using two methods, the sparse method [12] and the double low-rank factorization [15]. The formula for the rescaling constant to block-encode an operator via the sparse method is where the coefficients T pq and V pqrs are taken from from Eq. (47) for the Hamiltonian, and using T from Eq. (24) for the force operators. These constants are induced 1-norms of the Hamiltonian (or force) operator, which makes them highly dependent on the choice of molecular orbital basis. As it has been observed that localizing the basis orbitals reduces the scaling for Hamiltonians in Refs. 93, 13, we also use localized orbitals for all of our calculations, see also App. B 2. In Fig. 9, we provide the same results for canonical molecular orbitals. Instead of directly following the method used in Sec. IV B to differentiate the double-low-rank-factorized Hamiltonian to prove a bound on the cost of block-encoding the derivative operator, in this section we directly block-encode the derivative operator instead. This is practically easier, and should in principle yield a lower rescaling factor λ, however care should be taken in practice to suppress the systematic error this produces (as described in Sec. IV B). Within the double low-rank factorization the lambdas are given by with t i the eigenvalues of T pq + r V pqrr , cf. Eq. (49) for the Hamiltonian, and the eigenvalues of T (24)) for the force operators.
As the force is a vector with 3N a elements for a system with N a atoms, we find 3N a lambdas. To compare the simulation cost of the Hamiltonian and the force operator, we use the mean of the lambdas of the force operators, with λ Fi the lambda of the i-th force operator. We note that, in contrast to electronic structure Hamiltonians, performing a double low-rank factorization of the force operator yields negative eigenvalues w l in Eq. (48). This can be accounted for in a block encoding by shifting the sign from the eigenvalue w l to the encoding itself, so we replace the eigenvalues by their absolute values |w l |. For all shown plots, we use localized orbitals (LMOs) from a Hartee-Fock calculation, see Sec. B 2.
In the following we provide numerical results on one-dimensional chains of hydrogen atoms with spacings of 1.4 Bohr radii (0.74084Å), calculated in STO-6G, a system often used to infer the scaling in the thermodynamic limit [12,13]. Moreover, to provide a benchmark for systems closer to relevance for biological systems, we investigate the scaling of λ (sparse) and λ (DF) for water clusters of increasing size. Water clusters are prototypes of hydrogen bonded networks and therefore of fundamental interest, and thoroughly investigated benchmarks by experimental and theoretical methods [94,95]. Hydrogen bonds are regarded as the key to life in general, due to the directionality, cooperativity and versatility [96], being crucially involved in the chemical basis of the genetic code and all catalytic processes. The strength of hydrogen bonds is highly dependent on the local surrounding and requires accurate quantum treatment [97]. As input for the calculations, we use the data from [98], presented also in Fig. 8, which include the geometries of water clusters which were optimized by using a TTM2.1-F ab-initio based interaction potential together with subsequent MP2 calculations in the aug-cc-pVTZ basis.
To study the scaling of λ (DF) in the continuum limit, i.e. when increasing the basis set size, we follow [12,13] and use H 4 on a square plaquette with a side length of 2 Bohr radii. We first calculate the Hamiltonian and force operators using the cc-pvtz basis, which yields N = 56 spatial orbitals. To study the effect of an increasing basis set we artificially truncate the two-body coefficients V pqrs , In Fig. 4, we show λ (sparse) and λ (DF) of the force operators and Hamiltonian for the hydrogen chains of length between N H = 10 and N H = 100. For both methods, we recognize that the force operator shows a better scaling compared to the scaling of the Hamiltonian, and indeed tends towards a constant or log(N H ) cost. This makes sense, as we expect the force on a given nuclei in a 1D chain to scale logarithmically in the system size.
In Fig. 5(a), we show the scaling of λ (sparse) and λ (DF) for the water clusters in STO-3G with increasing system size. For technical details on the calculation we refer to Appendix E 1. Similarly to the hydrogen chain, we find a significant better scaling of the average λ of the force operators in comparison to the scaling of the Hamiltonian.
In Fig. 5(b), we show the scaling of λ (DF) when increasing the basis set size for H 4 . In this case, the scaling of λ (DF) for the Hamiltonian and the force is comparable, which is contrary to the results in the previous section. This makes sense: increasing the basis set increases the range of local energy densities on each atom, which suggests the spectrum of local operators such as derivatives should be increased also.

V. ALGORITHMS TO COMPUTE FORCES IN FAULT TOLERANCE
We detail three different methods in this section by which one might estimate forces in a fault-tolerant setting. We first study a semi-classical finite difference calculation, where the quantum computer is called as a subroutine to estimate energies at different positions R. We then study the direct estimation of forces as the expectation value of the derivative of the Hamiltonian via the Hellman-Feynman theorem, using the overlap estimation algorithm to make this estimation at the Heisenberg limit. We finally study expectation value estimation of the Hamiltonian derivative by the new expectation value estimation algorithm of [50], which re-encodes this as a separate derivative estimation problem that can be solved by the quantum derivative estimation algorithm of [99]. We estimate the asymptotic cost for each method (up to logarithmic factors) for plane waves, hydrogen chains, and small water cluster systems.
A. Numerical differentiation of the energy by higher order finite differences Finite difference methods estimate gradients as a linear combination of the energies E(R) from different atomic configurations R. Central finite differences formulas offer a quadratic advantage in the discretization error compared to backward and forward differences, so here we will focus only on the central difference forms. The simplest central finite difference formula is the first-order form, which requires energy calculations at two points R: Here, f d ∼ dR 3 is the error due to the finite difference approximation, and v i is the unit vector along the component R i . This can be generalized to the degree-2m (central) finite difference formula [99]: where the coefficients a (m) l are given by: and (m) f d ∼ dR 2m+1 is the degree-2m finite difference error. The easiest way to implement higher-order finite difference methods on a quantum device is to use the device as a subroutine that calculates the required energies E(R + l dR · v i ). This then sets a competition between the finitedifference error and the error PE from phase estimation, as the latter scales inversely in dR. Specifically, PE ∼ dR −1 due to the division in Eq. (106). Optimizing higher-order finite difference methods on a quantum device then requires balancing these two sources of error by optimizing our choices of dR and the degree m of the finite-difference method used. It also requires that we optimize our quantum circuits for finite difference estimation: as we are repeatedly estimating eigenvalues of similar (but not identical) quantum states, under certain conditions it may be preferable to reuse our state register in between phase estimation routines. In this section we will optimize all the above, and calculate bounds on the asymptotical scaling of the resulting algorithm.
In Appendix C 2 we calculate the error PE due to quantum phase estimation on a single component dE dRi . We draw on results from [64,100,101], and then propagate this error through Eq. (106). Then, in Appendix C 4, we optimize our resource allocation to the estimation of different E(R + l dR · v i ) via Lagrangian techniques. This yields a bound where λ H is the cost of block-encoding the Hamiltonian (Sec. IV A). In order for this bound to hold, we require that λ H be the worst-case block-encoding cost over all points R + ldR · v i . Equation (108) gives the first part of the total error for estimating the gradient with higher order finite differences. To complete the estimation of the error we need also to upper-bound the second component, (m) fd . To do this, we invoke Theorem 24 of [99]. This requires imposing the condition that the energy and its derivatives are bounded, with c a constant with units length −1 and e a constant with the units of energy. When this condition is satisfied, the discretization error m fd , for one component of the gradient can be upper bounded by: (m) As the error shrinks exponentially with m, the value of dR needed to ensure that the error is at most O( ) can be taken to be much larger as m increases. Since the cost of phase estimation scales inversely with dR, we will see below that using higher order difference formulas will actually reduce the cost of phase estimation.

Optimization of the finite difference step size
It remains to optimize the remaining free parameters in our finite difference method. Substituting Eq. (110) and (108) f d ) 2 ,we obtain: If we simply make the assumption that 8 m c dR ≤ 1/2 and we combine the two errors quadratically, then we have the following upper bound on the total error of the numerically estimated gradient: We can also find a second upper bound on the value of dR, by imposing that the 2 f d is at most 2 /2 in the upper bound of Eq. (112) and obtain whereas the second requirement is a direct consequence of the requirement that 8 dR cm ≤ 1/2. Substituting this into Eq. (108) will yield a comparable bound for the time.
One remaining point that needs to be addressed is the sensitivity of the estimation protocol on the ground state preparation. In particular, the above discussion holds if the ground state preparation success probability per point is . This is because we assume in the above analysis that the probability of the algorithm projecting out of the groundspace at each of the 3N a estimations is asymptotically negligible. From the union bound our previous claim on the success probability implies this because (1 − O(1/N a )) 3Na ∈ Ω(1).
While this may appear to be a substantial drawback of the finite difference method, the ground state will sometimes only need to be prepared once for the FD gradient estimation process. This is because if the eigenvalue gap is large relative to the perturbation in the energy needed over our finite sized mesh then the ground states at two near by configurations will be nearly identical. We show this fact in in Appendix C 3 and specifically claim that for any ) (where the max over R is taken over a ball of radius max( ∆ 1 , ∆ 2 ) around R), the difference between the approximate ground states ψ 0 (R + ∆) and the true ground state at ∆ = 0 is where E i (R) refers to the i th eigenvalue of H(R) and We will choose δ to be in Θ( ) since the logarithmic costs of making the state preparation accurate are negligible relative to the costs of the phase estimation step. Under these assumptions we have that the δ contribution is negligible compared to the other two terms and so we can use the following lower bound In order to ensure that the probability of failing to project is at most 1/3 during the protocol it suffices to take the failure probability of each step to be 1/9mN a . An appropriate probability can be met if we choose Then, as our choices of y in our finite difference algorithm are at max mdR from R, a sufficient choice of |dR| to guarantee this scaling is This potentially creates a problem since in order to ensure that the ground state remains stable over all the measurements (and thereby making the ground state projection cost additive rather than multiplicative) we need to have ∆ i constant as a function of . We can therefore choose ∆ i to be a constant by taking

Optimization of the finite difference order
By substituting this expression for dR in Eq. (108), and inverting in T we have that the total amount of time (in number of simulations) needed to compute a component of the gradient with error is bounded by: After the above choices have been made, the only variable that we have access to, impacting the number of simulation steps needed in the phase estimation is m. This is clearly optimized in Eq. (120) by setting the two terms in the curly brackets equal The solution to this equation is given by a Lambert W -function and the asymptotics of the function reveals that it suffices to choose This leads to the conclusion that the query complexity of computing one of the components of the gradient within error is at most

Gradient vector estimation
In the above we have considered the cost to estimate single components of the gradient vector. It remains to convert our resulting expression for the query complexity for these single components (Eq. (123)) into one for the error on the gradient vector. This involves repeating this calculation for each of the 3N a components of the gradient vector. If we want to ensure that the error in the gradient calculation (as quantified by the 2-norm of the error vector) is at most it suffices to take (In principle this could be improved by importance sampling, however we do not believe this will provide a significant gain here.) We then choose the error in individual gradient component estimations to be ≤ / √ 3N a . The asymptotic cost of the overall number of queries made in the simulation, excluding the cost of initial state preparation, is then If we define the initial state preparation process to be an algorithm that prepares a state using no queries (such as a Hartree-Fock state) and has a probability of success a 2 0 then the cost of implementing an approximate ground state projector using [102] is in Recall that state preparation is only needed on average O(1) times, due to the choice of ∆ i made in Eq. (117). This implies that the total query complexity of using the finite difference method is The above cost is given in the number of oracle calls to a block encoding of the Hamiltonian (Sec. IV A). To convert this into wall-clock time, we have to multiply this by the corresponding oracle cost in first or second quantization. The results of [15] show that the PREPARE operation for second quantized simulations of systems in a non-planewave basis can be implemented using O(N 2 ) Toffoli gates; however, with the use of tensor hypercontraction this becomes O(N ) asymptotically [13]. The SELECT operation can be implemented in O(N ) operations, where N is the number of spin orbitals used in each simulation [13]. Thus the overall cost of the walk operator is in O(N ), and we have Numerical estimates follow from the data in Sec. IV D. The analytic results for plane waves follows from taking the volume of the unit cell Ω to be proportional to the number of plane waves N . Note that for the Hydrogen chains and water clusters the we choose N ∝ Na because a minimal basis is used within these simulations. The cost of state preparation is ignored in these scalings, which is true if |a0|γ .
The scaling of this algorithm depends on the scaling of the value of λ H . In first quantization, the cost of the simulation using qubitization is based on the discussion in Section IV C is in [68] ToffCount Here the result arises from the fact that the select operation incurs a cost that is in O(η log(N )) and the prepare operation has a cost that is in O(η log 2 (N ) + log 3 (N )). As a final point, we can consider the case where we make no attempt to reuse the quantum state evaluated for the finite difference formulas. The advantage of this approach is that we do not need to reduce dR to ensure that the groundspace is sufficiently stable under perturbations of the Hamiltonian across the different points of interest. However, it will often be less efficient due to the many state preparation steps needed and also it will require a set of different state. We assume below that each such state has success probability at least |a i | 2 . Repeating the same analysis given above leads to a number of oracle queries that scales as Combining Eq. (131) with Eq. (128) yields a joint expression We examine this scaling in Sec. IV D and the results of the scalings given either empirically or analytically is given in Table III. In all cases we assume that min i |a i |γ . While these choices are made for simplicity, they do privilege the state preparation method given here because it means that we do not need to artificially reduce the value of dR to ensure that the state preparation costs are additive. This point is important to consider when comparing these costs with the methods in the following sections.
The analytic results given in Table III are straightforward to verify. In the plane-wave basis, it is difficult to reason about ionic systems because of the periodic nature of the basis functions leading to infinite energy. For this reason we focus on the case where the number of electrons in the system is equal to the nuclear charge. This implies that if we define Z i to be the charge of nucleus i then the number of electrons can be bounded above by Next the value of λ H for such systems is given by Eq. 25 of [68] to be in O(N 2/3 η 1/3 ) in the limit where the number of plane-waves is much greater than the number of electrons. Thus This implies that the cost of performing this process within the required error (under the assumptions on dH dRi , c, e made above) is The case of second quantized planewave simulations is much simpler to analyze. From Eq. (53) of [103] we find that λ H ∈ O(N 2 ) when we take Ω ∝ N . Eq. (129) then gives us that the Toffoli count scales as O(N 3 N 3/2 a / ). This completes our justification of the analytic scalings given in Table III.

Considerations on system dependent constants
One of the biggest disadvantages about using high-order finite difference formula is that their performance depends on the higher-order derivatives of the energy. Specifically, Eq. (109) gives the requirements that these constants much satisfy for each of the derivatives. While a simple expression for c and e cannot be easily extracted from such an expression, lower bounds can be extracted from the requirements. Using Eq. (109) we have from substituting k = 1 into the expression that if |ψ 0 is the true ground state (after choosing the global phases of the eigenstates such that their derivatives are orthogonal to themselves) we then have that From this expression we see that while the product of the two constants is lower bounded by the expected derivative in the ground state, that does not determine either of the two constants that constitute it. However, we can see that dimensionally e can be thought of as an energy scale and c can be thought of an inverse length scale for the system. A reasonable approximation may be made for the constant c in the case of first-quantized plane waves. If we assume that the eigenvalue gap is large and the high-order derivatives of the Hamiltonian dominate the expectation value of the energy then it is straight forward to see from perturbative arguments that Thus we have in this restricted case that If we choose e = ψ 0 | H |ψ 0 to be the ground state energy then it follows that any such scaling that satisfies this must obey While the derivative with respect to the coordinate of the Hamiltonian in general is difficult to compute, we can argue about it in plane wave bases which have an analytical expression as illustrated by Eq. (35). Specifically, we have that if R i = R l,α is one of the position coordinates of a nuclei with charge Z l we have (using the same notation as Sec. IV C) that Terms in this result scale like O(N k/3 ), but as the phase e iks·(R l +rp) oscillates rapidly as we change p and s, the final sum over both variables can be as small as o(1). This suggests that physical circumstances might exist where c ∈ o(1), but it could in other cases scale as c ∈ O(N 1/3 ) or potentially higher. Numerical explorations will likely be needed to better understand the scaling of e and c.
B. Expectation value estimation of the force operator at the Heisenberg limit with overlap estimation On a fault-tolerant quantum device one can estimate expectation values at the Heisenberg limit scaling as −1 with the unbiased error (which is tight [101,104]). This allows us to estimate gradients via the Hellman-Feynman theorem, similarly to how one would achieve this in a NISQ setting (Sec. III). The Heisenberg limit was originally achieved by the overlap estimation algorithm (OEA) of [55]. In our case the expectation values that we wish to measure are real-valued, enabling a slight optimization over the original algorithm. In this section we perform an asymptotic cost analysis of our implementation of the OEA on a block-encoded derivative operator, including optimizing the preparation and reflection subroutines for the purposes of derivative estimation, and compare it to previous methods.
The OEA aims to perform unbiased estimation of ψ|U |ψ for an arbitrary unitary operator U and state |ψ . It does so by calling a subroutine, the amplitude estimation algorithm (AEA), which estimates | ψ|U |ψ | via phase estimation of the Szegedy walk unitary The eigenvalues of this walk operator, constrained to the subspace spanned by |ψ and U |ψ provide us with all the information we need to estimate the amplitude. To see this, let We have from Jordan's Lemma that the action of S can be broken into a direct sum of irreducible two-dimensional subspaces. Therefore P ψ SP ψ = SP ψ . It then follows that for any |φ such that P ψ |φ = |φ that phase estimation will return an eigenvalue of the form e ±iθ for some angle θ ∈ [−π, π]. Further analysis in [105] shows that this eigenphase satisfies cos(θ) = 2| U | 2 − 1. This implies that ifθ is our estimate of theta, then where ∆ =θ − θ and V(θ) = E(∆ 2 ) using the phase estimation method reviewed in Appendix C 2 which is unbiased. Under the assumption that V(θ)/θ 1, the fourth moment of the distribution ofθ is negligible and that the distribution ofθ has negligible support over the branch cut V( | U |) = cos 2 (θ/2)V(cos(∆/2)) + sin 2 (θ/2)V(sin(∆/2)) V(sin((∆/2))|) Under the above assumptions, an estimatorθ of the eigenphases of S in the support of the input state |ψ that is approximately unbiased and has variance of 2 θ can be achieved using the techniques of Appendix C 2 in approximately π 2 θ calls to S. This variance inθ then propagates to an estimate of | U | with a standard deviation that is approximately bounded by θ 2 from Eq. (144). To learn the sign of U , we use the fact that We can then estimate the expectation value using the fact that, in our context ψ| U |ψ is real valued. (Note that this would not be the case if one encoded U = e −itdH/dRi as was originally proposed in [55].) That ψ| U |ψ is real implies | +| ψ|c−U |ψ |+ | = +| ψ|c−U |ψ |+ , and thus and we can find an estimator for the overlap as Note that one could in principle estimate ψ|U |ψ directly from amplitude estimation of c − U .
The variance of the estimate of U is then found from the additive property of variance of independent variables: (148) We have from Jensen's inequality that for any convex function Φ and random variable X with variance σ 2 , Since Φ(X) = X 2 is a convex function with second derivative 1 for all X ∈ [−1, 1] we have from Eq. (149) that Specifically, we will choose V( U ) ≤ 2 OEA . This motivates the following choices of the variance targets for the two operations We now estimate the complexity of computing the requisite estimators within the variance requirements of Eq. (151). Our circuit primitives and phase estimation routine for the amplitude estimation algorithm (including control) are given in Fig. 6. We note that control can be added to S by either controlling the implementations of U and U † or by controlling the implementation of R. We assume that the cost to implement controlled-S compared to the cost to implement S without control is negligible in the number of Toffoli gates required (which is typically the case). We make one further improvement to our phase estimation routine in the same vein as [64]: instead of performing phase estimation by controlling S 2 n by the nth qubit in the QPE register, we implement the unitary This may be achieved by realizing that and so one need only append S 2 n by an application of R on either side, controlled by the nth qubit in the QPE register being in the |0 state to implement Eq. (152). This technique halves the number of applications of S required, and significantly reduces the control overhead. With this implemented, using the variance targets in Eq. (151) for our two separate rounds of amplitude estimation will achieve an estimate of ψ|U |ψ with standard deviation approximately bounded by OEA , using calls to a (controlled) circuit implementation of S.
To convert the cost of the OEA to the cost of gradient estimation, we need to account for the fact that dH dRi is not a unitary operator. This is an issue, as the OEA explicitly requires unitary U . This issue may be resolved by block encoding dH dRi (Sec. IV A): adding additional qubits and finding a unitary operator U i on the larger Hilbert space such that where BE is the block-encoding error. This block encoding of dH dRi requires rescaling by some constant λ Fi for U i to be unitary. We may propagate this rescaling to the error in estimating dE dRi ; an error OEA in the estimation of ψ|U i |ψ . However, one complication involved in doing so arises from the fact that many of the estimation errors given above are  [64]. Black dots denote a controlled operation implemented when the control qubit is in the |1 state, while white dots denote a controlled operation implemented when the control qubit is in the |0 state. stated in terms of the variance of the estimate. In contrast, errors such as the block-encoding errors are deterministic errors. To solve this, we can use the Chebyshev inequality to say that for an estimate y with variance σ 2 Thus, up to a constant factor, we can with probability of failure in O(1), replace the uncertainty in our estimates with their standard deviation. From the above discussion, we then yields an estimate of dE dRi , denoted dE dRi such that with constant probability of failure ≤ 1/4 (i.e. k = 2 above) in the estimation of dE dRi . Here, R is the error in implementing the reflection operator R = I − 2|ψ ψ|, and we index the errors in the overlap estimation and block encoding by i (as these will be different for different gradient operators). To bound the error in dE dRi by some i , we can require the error in both the block encoding and overlap estimation be bounded as BE , OEA , R ≤ i 3λF i . The bound on OEA then gives us the cost of executing the gradient estimation algorithm where T R is the cost of implementing the reflection about |ψ , and T P,i is the cost of preparing |ψ for the ith gradient estimation. (We will re-use our system register between gradient estimation applications to significantly reduce the cost of T P,i for i > 1.) Note that from the Chernoff bound, the probability of failure of estimating the component of the gradient within the required error tolerance can be reduced from 1/4 to δ at cost that is in O(log(1/δ)). We allow here for the fact that when performing multiple different gradient estimations we may have different block encodings (see Sec. IV A) and state preparation methods (see Sec. V B 2). However, we assume that the reflection oracle (see Sec. V B 2) is implemented the same for each estimation.

Parallel estimation and importance sampling
The cost estimation in Eq. (158) is the cost of estimating a single gradient element. To expand this cost to a vector of gradients dE dRi , we sum individual terms Following the methods developed in Sec. B, we may select i to optimize the 1-or 2-norm of the gradient vector by importance sampling, where the resources are allocated optimally, spending more time on the components that require a higher accuracy. The final term in Eq. (158) is negligible and the second term is not affected by our choice of i (we will investigate the cost of state preparation in the following section). Thus, we consider the problem of minimizing the sum of the first terms: with either a fixed 1-norm i i or 2-norm i 2 i . This can be solved by the same Lagrangian methods as in previous sections. Solving for a fixed 1-norm yields the condition The Cauchy-Schwarz inequality and the Chernoff bound can then be used to bound the total cost of the gradient evaluation within error grad−OEA with probability of failure 1/4 using a total cost of If we promise that λ Fi ≤ λ F and T F,i ≤ T F , this becomes We can repeat the same arguments for the case where error grad−OEA is desired in the 2-norm.
for a total cost and in the case where all forces are equal cost and λ Fi ≤ λ F , The cost of implementing the overlap estimation algorithm now depends on the circuit costs T F , T R , and T P,i , and on the rescaling factor λ F . Asymptotic costs for T F and λ F were calculated in Sec. IV A. It remains to calculate T R and T P,i in order to obtain a final costing of the algorithm.

Implementation of preparation and reflection unitaries
As part of the expectation value estimation algorithm, we need to implement a reflection I − 2|ψ ψ| about the ground state |ψ . In lieu of other access to |ψ , this reflection can be constructed by an inequality test on its energy E [56]. If we have some µ such that E < µ < E + γ, where γ is the gap between the ground and first excited state with E denoting the ground state energy, and we assume the ground state is non-degenerate, we have that we can define an operator of the form One can approximate the sign function above by approximating it by as a sum of exponentials t c t exp(iHt) and block encoding this using LCU techniques [102], or by making a polynomial approximation and applying the quantum singular value transformation [56]. In this work we consider the latter. Quantum signal processing requires a block encoding of H − µ where λ H is an appropriate rescaling factor; this is at most µ larger than a corresponding block-encoding of H. The technique also requires an approximation to the sign function that is within on the eigenvalues of 1 λH (H − µ). Near µ this must break down for a finite polynomial series, so it is crucial that µ be chosen near the middle of the gap (E, E + γ). Known polynomial approximations to the sign function that achieve these requirements exist, with polynomial degree d = O( λH γ log( 1 )) [106]. Let us specify the smallest such constant by defining c sgn such that Given this sign function approximation and an exact block encoding of H − µ, we can construct a block encoding of sign[H − µ] via quantum phase estimation with exactly d repetitions of the block encoding (or its inverse), d single-qubit z-rotations and 2d N -qubit Toffoli gates [56,107]. For this to have an error less than r , we require the block encoding of H to be implemented with error less than r /d. Let the number of Toffoli gates needed to implement the block encoding of the H to an error BE ≥ r csgnγ −1 log( −1 )λH be T H (For many block encodings T H will be logarithmically dependent on BE , and so the rescaling here will not affect any O scaling results). Further, assume that we are provided a phase gradient state of the form 1 2π δz 2π δz −1 j=0 e 2πik/ 2π δz |k where δ z is the required precision for the single-qubit z-rotations. The number of Toffoli gates needed to perform these using controlled swaps on the above phase gradient state is Here we resort to O(·) notation to simplify the result given the doubly-logarithmic terms that arise from using the lower bound on BE . We construct the reflection operator 1 − 2|ψ ψ| to prepare the ground state |ψ from an initial state |φ through the use of fixed-point amplitude amplification [107,108]. This requires we have access to a preparation unitary U φ that prepares |φ , and a bound 0 < a 0 ≤ | φ|ψ |. The fixed point amplitude amplification algorithm then prepares an approximation to |ψ with error A with log(2/ A ) a0 calls to U p and 1 − 2|φ φ|. A call to the reflection 1 − 2|φ φ| can be implemented by which requires two queries to U φ and a single application of a reflection about |0 which can be implemented using 2N Toffoli gates. Therefore, we have where T φ is the cost of implementing U φ . This does not require an accurate estimate of | φ|ψ |, but just that a 0 ≤ | φ|ψ |. In many cases a bound a 0 will be known that is relatively tight, in which case the scaling of this algorithm is optimal [108]. When the bound is loose, it may be improved by performing initial rounds of amplitude amplification followed by direct estimation of a 0 on the device with little overhead [109].
As we are repeating the overlap estimation algorithm multiple times on the same initial state, we can recycle the system register between each iteration, reducing the dependence of our asymptotic cost on the initial state overlap. Let us define the action of our block encoding U i of dH dRi on our ground state |ψ as and then we have the well-known result that the Szegedy walk operator S i in Eq. (141) is block-diagonal on the two-dimensional subspace spanned by |ψ and |χ i , where it takes the form Regardless of the value of 1 λF i dE dRi , the eigenstates of S i are simply |s ± i = 1 √ 2 (|ψ ± i|χ i ). This implies that following each implementation of the overlap estimation algorithm, the overlap of the system register with the ground state is always s ± i |ψ = 1 √ 2 . This probability can be raised to 1 by a round of amplitude amplification, but this is not practical here because we do not have easy access to a reflection about |s ± i . This is especially true when 1 λF i dE dRi is very small (implying the gap between the eigenvalues of S i is quite small), which we expect to be the case. Instead, one may measure the reflection operator 1 − 2|ψ ψ| via a Hadamard test, which entails performing this operator conditional on a control qubit prepared in the |+ state, and measuring the control qubit in the X-basis. In terms of Toffoli gates, this costs exactly T R . With 50% probability, the control qubit is flipped to the |− state and the system register is prepared in |ψ (up to the error given by the reflection operator). When the control qubit remains in the |+ state, we know that our system register is now in the state |χ i . To avoid throwing away this state, we consider the operation of U † i on |χ i . If we define we have Thus, if our initial measurement of the reflection operator fails, we may iterate performing either U † i or U i to the system register, followed by re-measuring the reflection operator 1 − 2|ψ ψ| till it reports a success. The probability of this failing till the nth round (for j > 1) and then succeeding is , and so the average Toffoli gate cost of the operation can be calculated to be This cost approaches its minimal value of 3 2 T R as dE dRi → 0, and diverges as dE dRi → λ Fi . This is curious as it is the opposite behaviour that one finds when considering the cost of estimating expectation values in NISQ. The divergence can be avoided by artificially increasing λ Fi , though we expect in practice that this will never be required. Assuming 1 λ dE dRi ≤ 0.5, we have T P,i>1 < 2T R . This yields a total preparation cost of The dependence on a −1 0 in this equation is additive to the dependence on N a . For practical applications, we expect T R >> T φ + N , and so the last term can be neglected.

Total costing of the overlap estimation algorithm
The full cost of the overlap estimation algorithm can be found by substituting Eq. (178) into Eq. (166). This yields an expression in terms of the costs T R and T F to implement the reflection algorithm and block-encoded force operator respectively, as well as the rescaling factor λ F for the block encoding of the force operator, and some other physical parameters. In general the first term of Eq. (178) will dominate the second (we assume T R T φ + N , so we drop it for simplicity, yielding The last two terms in the square brackets come from the cost of state preparation. We see that the middle term is completely dominated by the first term in the square brackets. In practice, we expect the final term to be similarly dominated. Then, in Eq. 170 we expect T H >> 2N, log(δ −1 z ). Ignoring constants that do not scale with the system size, we can then write which yields a simplified asymptotic estimate As our methods for implementing the block-encoded derivative operator are either similar to or better than our methods for implementing the block-encoded Hamiltonian, we assume This implies the second term in Eq. (181) will be the dominant contribution to an asymptotic resource estimate due the additional multiplicative factors. We tabulate the estimated scalings for the overlap estimation algorithm in Tab. IV. Let us now compare the asymptotic cost of gradient estimation via the overlap estimation algorithm to the result of the finite difference method, Eq. (128) and Eq. (131), including terms assumed to be constant in Tab. III and Tab. IV. As we do not have tight lower bounds on either method, a direct universal comparison is not possible. In place of this we compare the asymptotic upper bounds. Given the assumption in Eq. (182), all terms in these bounds depend linearly on λ H and T H so it is relevant to compare against the three different potentially dominant terms in the finite difference method Eq. (132) Note that all of the above terms have units of inverse energy (as we have dropped the universal dependence on λ H ). Eq. (132) chooses between two different algorithms depending on which is faster, so we should compare τ OEA to the best-scaling algorithm. The appropriate quantity to compare to is We have from Sec.
The comparison between τ OEA and τ FD,1 is stark. Here, the relative comparison is between γ −1 λ F and c log(e) respectively. We expect c ∈ o(1) (Sec. V A 4), and we have only a logarithmic dependence on e. By contrast, λ F can be quite large and scale badly in the system size (see Sec. IV A, in particular Eqs. (88), (93) and (98)). The gap γ to the first excited state varies in molecular and condensed matter systems, but it can also be quite small. This implies that the best-scaling τ OEA (e.g. for double-factorized force operators in systems with large gaps) scale at best equally with the corresponding τ FD,1 . This is a somewhat surprising result: τ FD,1 contains the cost of performing phase estimation which is typically expected to dominate the finite difference algorithm, while τ FD,2 and τ FD,3 detail the cost of state recycling or state re-preparation. The reason behind this is that the overlap estimation algorithm contains a sub-routine (the reflection operation) that uses exactly the same circuitry as phase estimation, and the OEA requires then repeating this sub-routine −1 times. This multiplicative factor is interesting as for NISQ methods the cost of derivative estimation is completely independent of the Hamiltonian 1-norm. The comparison between τ FD,2 and τ OEA is more complex: here the rescaling factor λ F competes with N a max R ,i For a fixed R, the spectral norm dH(R) dRi is a lower bound for λ Fi(R) (being the rescaling factor for a block encoding of dH dRi at point R), though many block encodings are not this efficient. For example, in Hydrogen chains we expect dH(R) dRi to be roughly independent of the system size N a . This is matched by a double-factorized encoding, but a sparse block encoding has λ F ∼ N 2 a (Fig. 4). On the other hand, the finite difference algorithm requires that we take the maximum dH(R) dRi across the range of points used for finite difference, which significantly grows this cost.
Assuming that an efficient block encoding is chosen so that λ F ∼ dH(R) dRi , we have τ FD,2 N a τ OEA . This implies that when the cost of state re-preparation is bad enough that the recycling algorithm becomes preferred, the overlap estimation algorithm will likely become favourable again.
We finally compare τ FD,3 to τ OEA . Ignoring logarithmic factors, the comparison here is between −1 λ F and (min i |a i |) −1 . This is highly dependent on the initial state preparation, of which only few studies have been made (e.g. [110]). In lieu of exact statements that can be made here, we note that the above comparison suggests Then, investigating Eq. (185), whenever τ FD,3 < τ FD,1 the finite difference algorithm will likely be asymptotically faster, while whenever τ FD,3 > τ FD,1 the overlap estimation algorithm may have an advantage. This can be summarized in the following statement: whenever repeat state preparation becomes the dominant cost for estimating energies at any point of the finite difference method, the overlap estimation algorithm likely becomes a more efficient algorithm. This is because as the overlap estimation algorithm can efficiently recycle states with unit probability, while the finite difference method has to pay a significant cost in its choice of step size dR to guarantee the same. For applications in chemistry it is typical to assume that state preparation is not the dominant cost of the calculation (due to the relative size of the target error to the gap γ). This suggests that for typical use-cases the finite difference method may indeed be an optimal choice.

C. Simultaneous estimation of force operators using a gradient-based expectation value estimation
Previously in this work we have considered using a quantum computer as an energy-estimation subroutine to evaluate forces via higher-order finite difference methods. One might wonder whether a speedup can be gained from performing the difference estimation on the quantum device itself. This was originally considered in [38], and later significantly improved by [99], which proposed a quantum algorithm for evaluating the gradient of a function encoded as an expectation value. However, in order to apply this algorithm to estimate the gradient, the coherent evaluation of the function at a superposition of points is required. The cost of applying this approach directly to estimate the energy gradients as a function of nuclear co-ordinates appears prohibitive, especially for second-quantized Hamiltonians employing basis sets that are, themselves, dependent on the positions of the nuclei. Recently however, in [50] some of the authors of this work proposed a strategy for measuring a collection of expectation values that applies the gradient algorithm of [99] to a simple and easy to implement auxiliary function where the |θ encode M binary numbers representing coordinates in a M -dimensional box centered on the origin. This function can be encoded in a parameterized quantum circuit by performing a Hadamard test for the imaginary component of the unitary Substituting O j := 1 λF j dH dRj into Eq. (187), and assuming that |ψ is the ground state of H (at fixed R) yields a function f (θ) that is different from the energy -f (θ) = E(R + θ) -but that satisfies Crucially, implementing a quantumly-controlled version of this auxiliary function is more straightforward than a quantumly-controlled energy as a function of the nuclear coordinates, essentially because the auxiliary function is only required to reproduce the correct gradient at a single point. Implementing this as an oracle in the gradient-based approach of [99] allows for multiple expectation values to be simultaneously estimated at the Heisenberg limit with some degree of parallelization. In this section we will estimate the asymptotic Toffoli cost of implementing this oracle in the gradient based algorithm, and compare it to the methods developed in previous sections. We begin by reviewing the essential features of the gradient-based expectation value estimation algorithm. We do not present a complete review of the quantum algorithm for the gradient presented in [99], nor the extension of this to estimate expectation values given in [50]. Instead, we highlight those details that are necessary to describe the modifications required for our purposes and refer the reader to [99] and [50] for a comprehensive description. The gradient-based expectation value estimation algorithm targets the evaluation of the expectation values of M operators with respect to the state |ψ to within a precision grad in the infinitynorm. It does this by invoking the gradient estimation algorithm of [99] as a subroutine, which requires constructing a quantum oracle of function f . This is done so in terms of a probability oracle U p , where |x denotes a collection of registers encoding binary approximations to the inputs to f and |ψ good (x) and |ψ bad (x) are arbitrary normalized states. In our case, x = θ, which requires a number of ancilla qubits scaling as O(M log( −1 grad )) for storage. One of the main results in [50] is the construction of a probability oracle U p for the function f using a single query to the state preparation unitary U ψ . This probability oracle acts on the input register (the register containing the input state |x above), the system register (the Hilbert space containing |ψ ), and one additional ancilla qubit. The result of [99] allows us to solve our estimation using O( √ M / grad ) calls to the resulting phase oracle U p . It was further shown in [50] that this may be achieved while upper-bounding the maximum time evolution required to implement a call to the probability oracle ( θ ∞ ) in the following way, The failure probability of the algorithm of [99] is bounded only by 1/3, but this can be reduced to δ using the Chernoff bound at a cost that is logarithmic in δ.
There are several details to address in order to optimally apply this gradient-based approach to the estimation of forces, and to estimate the asymptotic cost of the final result. Firstly, it is desirable to take advantage of the fact that we can implement the reflection operator I − 2 |ψ ψ| more efficiently than U ψ (by a factor a 0 ). We will modify the parallel expectation value algorithm to remove all but one of the calls to the state preparation unitary U ψ , replacing this component of the algorithm with a comparable number of reflections about the ground state. This allows us to obtain a benefit similar to the one we found for the serial approach to overlap estimation presented in Section V B (see specifically Section V B 2 for the implementation of the reflection operator). Secondly, we need to account for the cost of implementing the initial state preparation step and the controlled time-evolution by the force operators in terms of the number of calls to the appropriate block-encodings. Thirdly, we need to translate the error bounds from [50] to our case, accounting for the normalization of the block-encoded force operators, the various sources of error in the state preparation, reflection, and time evolution subroutines, and the fact that we desire a error in the 2-norm rather than the infinity norm. We now aim to explain how we can replace all but one of the calls to the state preparation unitary U ψ and its conjugate by the more affordable reflection about the ground state, thereby avoiding most of the dependence on the overlap of the initial state with the ground state (a 0 ). To do so we require some intermediary details of the gradientbased expectation value algorithm, which we recall from [99] and [50]. The quantum algorithm for the gradient makes use of the probability oracle of Eq. (190) by using the walk operator from amplitude amplification to convert the probability of a good state into a phase, where 2Π 1 − I is a reflection about the zero state of the system register and the ancilla qubits of the probability oracle, and (2Π 2 − I) is a reflection about the zero state of the specific ancilla qubit that flags the good and bad states in the probability oracle. The eigenvalues of G U are of the form exp(±i sin −1 ( f (x))), where f (x) is the probability of observing a good outcome as given by (190).
The gradient algorithm of [99] uses a quantum singular value transformation to convert from G U to a block encoding of the form ( 0| ⊗ I)QSVT(G U )(|0 ⊗ I) = C wherein C (within some target error ) has eigenvalues e −irf (x) within the eigenspace of G U that is in the support of |x . The block encoded operation can be viewed as a fractional query to the phase oracle, and the block encoding procedure can be implemented using either Linear Combination of Unitaries methods or the more ancilla efficient methods of [107] (which use controlled queries to G U interspersed with single qubit rotations).
By querying such fractional phase oracles at a superposition of carefully chosen points, the gradient algorithm uses phase kickback and the quantum Fourier transform to obtain a high-order finite difference approximation to the gradient. For our purposes, the crucial detail is that the only interaction that the gradient algorithm has with the system register is through the action of (one of several) walk operators (or their conjugates) that take the same form as the one from Eq. (191) (with some varying number of ancilla qubits or additional single-qubit gates acting on the ancilla). In the rest of this section, we will show how one may shift all calls to the preparation unitary U ψ within a circuit-based implementation of G U to the edges of the unitary, such that G U takes the form The reader may confirm that all other sub-routines G A used in the gradient estimation algorithm of [99] take the same form, making it possible to write where G A may be implemented without calls to U ψ . This implies that we may concatenate any combination of these sub-routines and eliminate the product U ψ U † ψ that forms, i.e.
This will reduce the total number of preparations required to 1 (the final call to U † ψ may be discarded as we do not use the system register afterwards).
The queries to U ψ used in each iteration of G U can be similarly removed using a basis transformation (at the price of requiring two queries to transform in and out of the basis of |ψ ). Specifically, we consider the walk operator as a product of two components, 2Π 1 − I and U † p (2Π 2 − I)U p . We may define the projector so that 2Π 1 − I is a reflection operator about the |ψ state of the system register together with the |0 state of the ancilla qubits associated with Π 1 . Then, the reader may confirm that the probability oracle U p constructed by [50] takes the form where U p contains no calls to U ψ , but instead queries the force unitary U θ . Now we can rewrite our expression for the reflections by adding the following resolutions of the identity to the two components of G U : This yields our new form of G U , and in the above notation, as required. More generally, the reader may see by a careful inspection of [50] that the gradient-based expectation value estimation algorithm only interacts with the system register via the application of the components 2Π 1 − I and U † p (2Π 2 − I)U p , interleaving them with other operators that act as the identity on the system register. Therefore, if we consider the whole sequence of steps that comprise the algorithm, we can see that all of the intermediate state preparation unitary calls can be removed by cancellation with their adjacent conjugates. Furthermore, the system register is discarded at the end of the expectation value algorithm, so the final call to U † ψ can also be removed. Next with these pieces in place we can discuss the cost of the algorithm. We take the cost here, as in previous sections, to be the total number of non-Clifford operations needed for each of the operations carried out. Noting that the cost of the reflection (2Π 2 − I) is negligible (being a simple Pauli-Z gate), we can calculate a total cost for the gradient-based phase estimation algorithm of where Q denotes the number of repetitions of G U and any related unitary, and T P , T R , and T time denote the cost of implementing U ψ , (2Π 1 − I), and U p respectively; we will estimate these in the next sections.

Asymptotic costs of reflection, force evolution, and preparation
We now provide asymptotic upper bounds on the scaling of T P , T R , T time defined in the previous section. The cost T R of reflecting about our ground state |ψ depends on the precision to which we need to reflect since the a measurement of the groundspace is not generally known. To calculate the cost of the reflection, let R be a desired bound on the total contribution to the error from implementing the reflection operators, and let R prep denote the number of calls to the reflection operator used by the initial state preparation circuit. We can achieved the desired overall bound on the error by implementing the Q/2 − 1 + R prep reflection operators each to within a precision R Q/2−1+Rprep in the spectral norm. The Toffoli count for block encoding the Hamiltonian (required by the construction of the reflection operator) and implementing the reflections both scale logarithmically with the desired precision. Assuming that Q and R prep both scale at most polynomially with the system size and any other relevant parameters of the problem (we shall later determine that they satisfy this assumption), we can implement each reflection about |ψ for a cost T R which scales identically (up to logarithmic factors) to the case of serial amplitude estimation (see Eq. (180)): where γ is a lower bound on the eigenvalue gap between the groundstate and the rest of the spectrum. We now address the cost of the initial state preparation T P . We can consider the cost for this step given in Eq. (172) of Section V B 2, where it was addressed in the context of serial expectation value estimation with the overlap estimation algorithm. We can also make the same simplifying assumption later used in Section V B 3, namely that T R T φ + N . Here T φ is the Toffoli-count for implementing the state preparation unitary U φ , and will often be negligible in cases where an elementary ansatz such as a Hartree-Fock state is used for state preparation.
The contribution to the overall error from imperfectly implementing the reflection operators is already accounted for in the section above, provided that the initial state preparation step requires a number of calls to the reflection operator I − |ψ ψ| that grows at most polynomially with the problem parameters. This conditions hold, as discussed in Section V B 2. The only remaining source of error is the error from amplitude estimation, which only appears logarithmically in the cost of the state preparation step. Making these simplifications and substitutions, and neglecting all logarithmic factors, we find that (similarly to the initial preparation in Sec. V B 2), the initial state preparation step requires a number of Toffoli gates scaling as The cost T time of implementing U p is dominated by the cost of implementing the subroutine U θ (Eq. (188)). Interestingly, it suffices here to take a naive implementation of U θ to demonstrate a speedup; i.e. we aim to implement the e 2iθiOi in series for each O j = 1 λ F i dH dRi . As in Section IV A, for the ith force operator, we let T F,i denote the cost of implementing the circuit for the block encoding and we let λ Fi denote the associated rescaling factor. For simplicity, we actually consider the case where the operators are all block-encoded with the same rescaling factor λ F , where λ Fi ≤ λ F for all i. These block encodings are also block encodings of the operators 1 λF dH dRi , and we have 1 λF dH dRi ≤ 1 for all i. As discussed in [50], for each of the Q calls to U p , we perform time evolution by each of the observables for O(log( −1 grad )) different lengths of time, with a maximum length of time for any such evolution θ ∞ ≤ t max =Õ(log( −1 grad )/ √ 3N a ). We can use optimal Hamiltonian simulation algorithms to accomplish each of these evolutions to within an accuracy using O(t max + log(1/ )) calls to the block encoding for the appropriate force operator [106,107]. Let T F be an upper bound on the cost of implementing any of the block-encoded force operators to within a precision . Now we wish to determine the asymptotic scaling of the Toffoli count for performing the time evolution steps while contributing an error of at most time to the final output of the gradient-based estimation algorithm. The complexity of optimal Hamiltonian simulation depends logarithmically on and we make the further assumption that the complexity of implementing the block-encoded force operator depends logarithmically on . The total number of time evolution steps performed, and the total number of calls to the block encoding for each force operator, both scale linearly with QN a if we neglect logarithmic factors. Therefore, if we wish to bound the total error that arises from implementing time evolution by the force operators to time it suffices to set and polynomially smaller than time . When we neglect the logarithmic factors, we then see that we can implement all of the time evolution steps required by a single query to U p using a number of Toffoli gates which that as

Total cost for gradient-based estimation
Now we are ready to determine the overall cost for the force estimation using the gradient-based approach to measurement. We aim to measure the 3N a force operators to within a precision in the 2-norm. We can achieve this by setting our desired precision in the infinity norm to / √ 3N a . For simplicity, we assume that each of the force operators is block-encoded with a rescaling factor λ Fi ≤ λ F . Then we satisfy the normalization conditions of the gradient-based estimation algorithm by considering the task of estimating of the 3N a operators O i = 1 λF dH dRi to within a precision λF √ 3Na . We need to consider four different sources of error. Let grad denote the error from the gradient-based estimation algorithm itself. Let R denote the error from the reflection operator about the ground state that we shall use as a subroutine. Let P denote the error from the initial ground state preparations step. Finally, let time denote the total error contributed by our implementations of the various time evolution operators for each of the forces. In order to achieve our desired error in the 2-norm, we require that each of these components of the overall error be bounded above by 4λF √ 3Na . (The cost of R , P , and time are already accounted for in our choice of T R , T P , and T time respectively.) Taking the above bound grad ≤ 4λF √ 3Na , and substituting this into the above analysis, we find that we need queries to the probability oracle U p . The gradient-based estimation algorithm requires a number of Toffoli gates which is dominated by the cost of the initial state preparation, the Q queries to U p , and the associated reflection operators. Subsituting Eqs. (205), (202), (203), and (204) into Eq. (201), we find that As in the case of serial expectation value estimation, we expect that the cost of the initial state preparation will be negligible (as it does not depend on −1 ) in practice and we can therefore drop the first term. We also assume that T F = O(T H ), i.e., that the cost of implementing the block-encoded derivative operators scales no worse than the cost of implementing the block-encoded Hamiltonian. Making these simplifications, we have The norm of the Hamiltonian should scale at least linearly with each added nuclei, therefore we expect that N a = O(λ H ). We can use this to perform a final simplification, yielding This improves strictly over the bound for the overlap estimation algorithm (Eq. (181)) by an asymptotic factor of N 1/2 a . This can be understood as a direct square root speedup from the parallelization of the estimation of multiple gradients. It was proven in [50] that, for certain cases, gradient-based estimation achieves the optimal (up to logarithmic factors) scaling in the number of calls to the state preparation unitary U ψ . This suggests that further improvements in our case may be difficult and would necessarily come from exploiting additional structure not present in the general case. The comparison to the finite difference method (Sec. V A) is slightly more complex: following the comparison made in Sec. V B 3, the relevant comparison is between τ FD,1 = −1 c log 5/2 (e) and It was argued in Sec. V B 3 that there exist systems for which γ −1 and λ F are both constant (or scale logarithmically) in the system size (the latter based on the numerical analysis in Sec. IV A). In this case, the gradient-based expectation value estimation algorithm would be asymptotically faster by a factor N 1/2 a , even assuming the higher order gradient bound c is constant. However, for systems with a particularly small gap, or for which the block encoding of the force operator is large, the finite difference algorithm may be optimal. We give asymptotic scalings of the systems that we studied in this work in Tab. V, following the analysis used to generate Sec. III.

VI. CONCLUSION
In this work we performed an in-depth study of the cost of estimating forces using current NISQ and future faulttolerant devices. We optimized strategies for estimating forces on a NISQ device, and determined that for some state-of-the-art methods, estimating the entire force vector to a fixed error (as measured by the 2-norm of the error vector) may be roughly equal or a lower cost than estimating the energy of the system. This suggests that force vectors would effectively come 'for free' in a NISQ electronic structure calculation. We note however that as NISQ methods tend to be variational, which requires calculating the electronic energy repeatedly to minimize it with respect to variational parameters, it does not immediately follow that force estimation is easier than energy estimation for a system.
We further present, optimize and cost several algorithms for calculating forces within a fault-tolerant quantum computing framework. Specifically, we propose a method based on finite difference algorithms for computing gradients as well as another method based on estimating expectation values of force operators. We find that in situations where the expected cost of state preparation is low, the finite difference algorithms provide asymptotically better scaling. In contrast, the gradient operator approach provides better scaling if the cost of state preparation is high. In the best case scenario, the scaling of these algorithms for first quantized plane waves are a factor of N 3/2 a worse than the cost of the best first quantized simulations.
The cost estimates for block-encoding forces for atomic-centered basis orbitals in second quantization is more complex because it requires a numerical study. We find that in the examples considered, block encodings of the gradient operators does not translate into a relative reduction in cost for estimating forces compared to estimating energies for the electronic structure problem. This is because all methods studied are dominated by the cost of Hamiltonian simulation, either directly or as part of a reflection around the ground state. As such a reflection is a crucial part of targeting the ground state of an electronic structure problem, we do not expect this problem to be easily circumvented for direct force estimation on a quantum device.
Although we estimate that the cost of reaching chemical accuracy for forces (6.4 mHa/Å error per force component) is comparable to that of energies (1.6 mHa), the bottleneck of practical MD simulations is that they typically require on the order of 10 6 − 10 9 unique force calculations [51]. Wall-clock time estimates for single point energy calculations on fault-tolerant devices in the beyond-classical regime are typically on the order of multiple hours; repeating this millions of times to perform a MD simulation is not practical. This suggests that to enhance MD simulations with quantum computers, new approaches need to be found. However, our methods look promising for applications that do not require so many repeated force or derivative calculations, e.g. geometry optimization and dipole estimation appear more practical following this work.
We suggest some directions that we believe may be fruitful for future research. A clear target for further research in force optimization is to improve the bounds on the variance that we obtained in the NISQ section (e.g. by implementing locally-biased classical shadow techniques [88]), though we do not believe that this will be a sufficient improvement to achieve beyond-classical molecular dynamics simulations in the NISQ era. It is also of critical importance to study how generalizable error tolerances are for forces; it is unclear whether our result of 6.4 mHa/Å applies to systems other than water. In a similar vein, it would be of interest to determine the error tolerances required for other first-order derivative quantities, for example see Table I; our methods extend immediately to these systems, and easier targets than molecular dynamics are likely to exist.
Moreover, instead of performing a semi-classical simulation where one extracts force information from a quantum device to perform a molecular dynamical step, it should be possible to perform a fully-quantum simulation -which is performed entirely on the device. This is relevant because one of the major drivers of complexity is theÕ(1/ ) cost of computing the gradient. A more practical approach would entail either simulating the differential equation governing the dynamics on the quantum computer itself (while retaining a classical description of the nuclear motion within the Born-Oppenheimer approximation), or by a fully-quantum simulation that treats the nuclei and electrons on a similar footing such as a first quantized simulation. Alternatively, one can consider taking a small number of measurements from a quantum device and fitting a semi-empirical model to the results. This could be done using a semi-classical molecular mechanical model, for example machine-learning techniques that take a few high-accuracy and many lowaccuracy data points to generate high-accuracy force-fields [111]. Equivalently, one could consider reusing points from finite difference calculations, or sparsely sampling gradients to reduce these constant factor overheads overhead.
The larger point from this work is that it has identified a new problem within chemistry simulation that needs further research. In this sense, our work should be seen similar to early work in simulating groundstate energies, such as [14,112]. These earlier works may not have yielded results that are as practical as later methods such as [12,13] but they identified the existence of an important problem that needs to be optimized to realize the full potential of quantum computation within that domain. From that perspective, this work represents a first step towards the discovery of the first practical methods for simulating MD on quantum computers.
Because only the MO coefficients C carry the dependence on S, to derive explicit expressions for these terms, we will need to derive expressions for ∂C/∂S. Herein, we will assume that the MO coefficients are real-valued. We begin by writing a general parameterization [113] of the molecular orbitals where C 0 are the (unperturbed) reference orbitals-here taken to be from the solution of the Hartree-Fock equations, M ensures the orthogonalization of the metric S, and the Hartree-Fock energy is minimized by the unitary matrix U, which is a function of orbital rotation parameters Θ. Because U is unitary, it may take the form provided that T is anti-symmetric. The elements of T(Θ) are given by which, again, is anti-symmetric and therefore ensures unitarity of U. Therefore, we can write the expression for U more clearly as Θ is a (virt × occ)-dimensional matrix where the elements Θ ai are the orbital rotation angles that minimize the Hartree-Fock energy. When canonical Hartree-Fock orbitals are used as the reference C 0 , the rotation angles Θ ai are zero-and U(Θ) is the identity matrix-by virtue of the variational optimization of the orbitals.
An expression for M can be determined from the Hartree-Fock orthogonality condition U is unitary and may be removed by left and right multiplication by U and U † , respectively. Requiring that M is both symmetric and invertible allows us to end up with the expression which is by no means a unique solution for M. Putting the previous results together, the final parameterization of the molecular orbitals yields the expression where the elements C µp are given as For a reference set of orbitals, C 0 , the parameterization of the orbitals depends on S and Θ. We will want to consider partial derivatives of the orbitals with respect to these quantities. First we will consider derivatives of the orbitals with respect to the atomic orbital overlap matrix S Without loss of generality, we can consider the case where the MO coefficients are small perturbations of the canonical Hartree-Fock orbitals. Therefore we consider the limit where C 0 → C and Θ → 0, such that U → I. This allows us to set M = I and U = I, respectively, when evaluating expressions after differentiation. Therefore, the above expression simplifies to ∂C µp ∂S λσ C 0 =C,Θ=0 = − 1 2 q C µq C λr C σr δ qr δ rp (A10) with these, we are ready for handling general derivatives of ab initio energies involving the overlap, as in Eq. (18). For the one body terms: The two body terms are slightly more complicated. For the two body terms, we can rearrange to isolate the terms depending on the overlap αβ ∂ ∂S αβ pqrs Γ pqrs g pqrs dS αβ dx = αβ pqrs Γ pqrs   ∂ ∂S αβ µνλσ C µp C νq C λr C σs g µνλσ   dS αβ dx (A14) next, we can focus on evaluating the terms in brackets above: ∂ ∂S αβ µνλσ ∂ ∂S αβ C µp C νq C λr C σs g µνλσ = µνλσ ∂C µp ∂S αβ C νq C λr C σs g µνλσ + µνλσ C µp ∂C νq ∂S αβ C λr C σs g µνλσ + µνλσ C µp C νq ∂C λr ∂S αβ C σs g µνλσ + µνλσ C µp C νq C λr ∂C σs ∂S αβ g µνλσ = − 1 2 µt C µt C αt C βp g µqrs − 1 2 νt C νt C αt C βq g pνrs − 1 2 λt C λt C αt C βr g pqλs − 1 2 σt C σt C αt C βs g pqrσ = − 1 2 t C αt C βp g tqrs − 1 2 t C αt C βq g ptrs − 1 2 t C αt C βr g pqts − 1 2 t C αt C βs g pqrt (A15)

Localization of molecular orbitals
Canonical molecular orbitals (CMOs) from a Hartree-Fock calculation are only one of many possibilities to express the Hamiltonian and the force operators -any other orthonormal molecular orbitals might be used too. A prominemt example for another possible choice are localized molecular orbitals (LMOs). In contrast to CMOs, which typically have support over the entire molecule, LMOs are restricted to a certain spatial region of the molecule [114][115][116]. Given a set of CMOs, one finds LMOs by applying a unitary transformation U , φ q (r) = p φ p (r)U pq .
The unitary U is altered with the goal to optimize the locality of the molecular orbitals. Different schemes with different cost functions exists, such as Edmiston-Ruedenberg (ER) [114], Pipek-Mezey (PM) [115] or Foster-Boys (FB) [116]. In [93], it was observed that LMOs in comparison to CMOs have a positive, lowering, effect on the 1-norm of the qubit Hamiltonian. As the force operators force operators include terms with O(1/r 2 ) while the Coloumb potential is O(1/r), we expect that using LMOs will also lower the 1-norms of the force operators. As different localization schemes have similar effects on the 1-norm of the Hamiltonian [93], we restrict ourselves to the ER localization scheme in the following. Using ER, we aim to maximize L ER = p |φ p (r 1 )| 2 1 r 12 |φ p (r 2 )| 2 dr 1 dr 2 (B12) within the occupied/virtual molecular orbitals in order to find localized molecular orbitals. We then use the new set of LMOs to express the Hamiltonian and the force operators. We note that the fermionic shadow tomography scheme and the basis rotation grouping strategy, see Sec. III, are not affected by any localization. The basis rotation grouping scheme is based on a diagonalization, which is independent from the basis used, while the variance of the estimator of the fermionic shadow tomography depends on the 2-norm, see Eq. (68), which is preserved under any unitary transformation.
It is equivalent to take Eq. (23) from [64]: neglect the contribution to the errors prep and QFT and then propagate assuming the covariances are zero to obtain Eq. (C12). prep and QFT are expected to be considerably smaller than the 1/T l contribution, though they cannot be neglected in general, their consideration would require to find numerical solutions for estimating the scaling of the error.

Robustness of phase estimation success probability
The aim of this section is to provide a result that can be used in order to argue how the success probability for phase estimation. This result is especially important for the finite difference algorithm discussed above because this result will show that if the eigenvalue gap is sufficiently large then with high probability all the values used for the state preparation can be taken from the same ground state. If this is not true, then in the worst case scenario the ground state preparation will need to be applied for each perturbed Hamiltonian evaluated. The result is stated below in the following Lemma, and its proof is a direct consequence of matrix calculus and elementary norm inequalities. Lemma 1. Let H : R → C 2 n ×2 n be a parameterized family of Hamiltonians and let |ψ i (R) be the i th eigenstate of H(R) evaluated at parameters R ∈ R 3Na and let E i (R) be the corresponding eigenvalue. If ψ 0 (R) is known such that ψ 0 (R) − |ψ 0 (R) ≤ δ and ∆ ∈ R 3Na such that ∆ < (E 1 (R) − E 0 (R))/(4 max R ,i Here, the maximum is taken over all points in a ball of radius ∆ around R and over all derivative components d dRi . Proof. First we have from the fundamental theorem of calculus, the assumption that E j (R + s) > E 0 (R + s) that |ψ(R + ∆) = |ψ(R) + ψ(R)|ψ j (R + s∆) ψ j (R + s∆)| dH(R+s∆) ds |ψ(R + s∆) E j (R + s∆) − E 0 (R + s∆) ds ds |ψ j (R + s∆) ψ j (R + s∆)| dH(R+s∆) ds |ψ(R + s∆) (E 1 (R + s∆) − E 0 (R + s∆)) 2 ds ≥ 1 − The number of boxes scales only like O(log N ), so a superposition over all nested boxes can be prepared efficiently. Meanwhile, the momenta within each nested box do not vary significantly in size, which means that the coefficients are nearly uniformly distributed and a superposition over those momenta can also be efficiently prepared. However, we now have the additional weighting of √ ν x for block-encoding the derivative operator, whose value changes dramatically within each box. As a result, the preparation subroutine would succeed with a very small probability which would need to be amplified with significant overhead. To address this issue, we consider grouping the momenta into nested boxes for (ν x , ν y ) alone: In order to avoid divisions, this inequality test can be rewritten as 2 µ−1 |ν x |M > m(ν 2 x + ν 2 y + ν 2 z ).
This inequality test can be evaluated using only additions and multiplications. The Toffoli complexity of the multiplications is the product of the number of bits, and we have O(log N ) bits for the components of ν. Using Lemma 3 of [68] we can give the number of bits for m as O(log(N/ )) for the case of energy estimation. The case of force estimation is sufficiently similar that the scaling is the same. The squares of components of ν will have complexity O((log N ) 2 ), and the multiplication by m will have complexity O(log N log(N/ )). The additions have linear complexity in the number of bits, so the total complexity is O(log N log(N/ )) Toffolis for the complete inequality test.
The resulting state, with success of the inequality test, can be written as with For M sufficiently large, one can verify that the above state is proportional to the momentum state we want to prepare. Indeed, for a fixed choice of µ and ν, the corresponding amplitude is proportional to up to the error due to discretization. Now, the probability of success has the asymptotic scaling of The integral can be computed to arbitrary precision numerically, giving To boost the success probability close to unity, we can use just one step of amplitude amplification because the probability is greater than 1/4. The dominant complexity is involved in the squares and multiplications in the inequality test, and the amplitude amplification only gives a multiplicative factor, for a total complexity of O(log N log(N/ )) Toffolis. When calculating the Hamiltonian and the force operators of the water clusters, shown in Fig. 8, in STO-3G, we recognize that the matricized two-body coefficients M (pq),(rs) have degenereted eigenspaces. The resulting eigenvectors are thus only defined up to some gauge. While this is not a practical issue for applying any of the methods, repeated calculations will lead to slightly different lambdas. To avoid this ambiguity, we break the symmetry by shifting the positions of the atoms, x i → x i + N (0, σ) with σ = 10 −6 .
For the largest water cluster we study here, with N H2O = 20 water molecules, the spatial two-body-integrals of the Hamiltonian and the force operators have a dimension of (140, 140, 140, 140) and consequently the matrix M (pq),(rs) is of dimension (19600,19600). As a direct diagonalization of a matrix of such size is getting computationally expensive, we instead use Scipy's implementation of Lanczos algorithm to find the most important eigenvalues and eigenvectors. To find the error of this approximation, we calculate the expectation value of the Hamiltonian and the force operator with respect to the CISD wavefunction found under the reconstructed ERIs. In Fig. 7(a), we show the error on the CISD energy per atom CISD E = (E full − E k lanczos )/(3N H2O ). We find that keeping the 4N most important eigenvalues and eigenvecotors yields an error below 50µHa per atom. In Fig. 7(b), we show the error on the expectation value of the force operator with respect to the CISD wavefunction. As the force is a vector of dimension 3 × 3 × N H2O , we use CISD F = || F full − F Lanczos || as error. As visible from the plots, keeping 4N eigenvalues, as for the Hamiltonian, is a reasonable choice for the error on the force vector.

Structure of individual force operators
Any fermionic operator can be represented after a Jordan-Wigner transformation by an operator defined in qubit space, A = c i P i , where P i is a n-qubit Pauli operator and c i its coefficient. In Fig. 10, we show the distribution of the Pauli coefficients c i after a Jordan-Wigner transformation of the Hamiltonian and the force operators for a hydrogen chain with 6 hydrogen and a water molecule. For both system, the operators were obtained from canonical Hartree-Fock orbitals in the STO-6G basis.

Additional calculations using canonical molecular orbitals
In Fig. 9, we provide additional results on λ (sparse) for the Hamiltonian and the force operators using CMOs instead of LMOs. For both systems, we find a worse scaling compared to the scaling obtained by using LMOs. Geometries from [98]. For visualization, the visual molecular dynamics software package (VMD) was used [118].