Heisenberg-limited ground state energy estimation for early fault-tolerant quantum computers

Under suitable assumptions, the quantum phase estimation (QPE) algorithm is able to achieve Heisenberg-limited precision scaling in estimating the ground state energy. However, QPE requires a large number of ancilla qubits and large circuit depth, as well as the ability to perform inverse quantum Fourier transform, making it expensive to implement on an early fault-tolerant quantum computer. We propose an alternative method to estimate the ground state energy of a Hamiltonian with Heisenberg-limited precision scaling, which employs a simple quantum circuit with one ancilla qubit, and a classical post-processing procedure. Besides the ground state energy, our algorithm also produces an approximate cumulative distribution function of the spectral measure, which can be used to compute other spectral properties of the Hamiltonian.


Introduction
Estimating the ground state energy of a quantum Hamiltonian is of immense importance in condensed matter physics, quantum chemistry, and quantum information. The problem can be described as follows: we have a Hamiltonian H, acting on n qubits, with the eigendecomposition where Π k is the projection operator into the λ k -eigensubspace, and λ k 's are increasingly ordered. Each eigenvalue may be degenerate, i.e. the rank of Π k can be more than one. We assume we can access the Hamiltonian H through the time evolution operator e −iτ H for some fixed τ . Our goal is to estimate the ground state energy λ 0 to within additive error .
Some assumptions are needed as otherwise this problem is QMA-hard [2,32,34,50]. We assume we are given a state described by its density matrix ρ. Let p k = Tr[ρΠ k ]. Then if p 0 (i.e. the overlap between the initial state and the ground state) is reasonably large we can solve the ground state energy estimation problem efficiently. This assumption is reasonable in many practical settings. For example, in quantum chemistry, the Hartree-Fock method usually yields an approximate ground state that is easy to prepare on a quantum computer. At least for relatively small molecular systems, the Hartree-Fock state can often have a large overlap with the exact ground state [68]. Therefore we may use the Hartree-Fock solution as ρ in this setting. Other candidates of ρ that can be relatively easily prepared on quantum computers have been discussed in Refs. [6,65,68], and an overview of methods to choose ρ can be found in [44,Section V.A.2].
The computational complexity of this task depends on the desired precision . Even in the ideal case where the exact ground state is given, this dependence cannot be better than linear in −1 for generic Hamiltonians [5]. This limit is called the Heisenberg limit [27,28,72,73] in quantum metrology. This notion is closely related to the time energy uncertainty principle [3,4,5,20]. This optimal scaling can be achieved using the quantum phase estimation (QPE) algorithm [33], which we will discuss in detail later.
Much work has been done to develop the algorithms for ground state energy estimation both for near-term quantum devices [31,45,52,54], and fully fault-tolerant quantum computers [1,24,39,55]. Relatively little work has been done for early fault-tolerant quantum computers [8,14,16,38] , which we expect to be able to accomplish much more complicated tasks than current and near-term devices, but still place significant limitations on the suitable algorithms. Refs. [16,36] carried out careful resource cost estimation of performing QPE for the Hubbard model using surface code to perform quantum error correction. These are to our best knowledge the only works that addressed ground state energy estimation in the context of early fault-tolerant quantum computers.
To be specific, we expect such early fault-tolerant quantum computers to have the following characteristics: (1) The number of logical qubits are limited. (2) It is undesirable to have a large number of controlled operations. (3) It is a priority to reduce the circuit depth, e.g. it is better to run a circuit of depth O(D) for O(M ) times than to run a circuit of depth O(DM ) for a constant number of times, even if using the shorter circuit entails some additional poly-logarithmic factors in the total runtime.
In this context, the textbook version of QPE (see e.g. Refs. [22,48]), which uses multiple ancilla qubits to store the phase and relies on inverse quantum Fourier transform (QFT), has features that are not desirable on early fault-tolerant quantum computers. Some variants of QPE have been developed to achieve high confidence level [37,47,56], which can be important in many applications. However, such modifications require even more ancilla qubits to store multiple estimates of the phase and an additional coherent circuit to take perform logical operations. Another possible way to achieve high confidence level is to utilize a resource state ([7, Section II B]) to implement a Kaiser window filter [60]. This approach requires the same number of ancilla qubits as the textbook version of QPE.
Due to the above considerations, we focus on the variants of QPE that use only very few ancilla qubits (in fact, all algorithms below use only one ancilla qubit). Kitaev's algorithm (see e.g. [34]) uses a simple quantum circuit with one control qubit to determine each bit of the phase individually. However this method, together with many other algorithms based on it [69,70], are designed for phase estimation with an eigenstate given exactly, which is different from our goal. The semi-classical Fourier transform [29] can simulate QFT+measurement (meaning all qubits are measured in the end) with only one-qubit gates, classical control and post-processing, thus trading the expensive quantum resource for inexpensive classical operations. One can replace the inverse QFT with the semi-classical Fourier transform, and this results in a phase estimation algorithm that uses only one ancilla qubit [9,30]. This approach can be seen as a simulation of the multiple-ancilla qubit version of QPE, and is therefore applicable to the case when ρ is not exactly the ground state. Because of these attractive features this is the version of QPE used in Refs. [16,36]. However, as we will explain below in Section 1.1, this type of QPE requires running coherent time evolution for time O(p −1 0 −1 ). This leads to large circuit depth when p 0 is small. Moreover, this approach cannot be used together with the resource state discussed earlier because the resource state is not a product state.
In this work, the complexity is measured by the time for which we need to perform time evolution with the target Hamiltonian H. We will use two metrics: (1) the maximal evolution time, which is the maximum length of time for which we need to perform (controlled) coherent time evolution, and (2) the total evolution time, which is the sum of all the lengths of time we need to perform (controlled) coherent time evolution. They describe respectively the circuit depth and the total runtime. Moreover, we will be primarily concerned with how they depend on the initial overlap p 0 and the precision . The dependence on the system size n mainly comes indirectly through p 0 and the conversion between the total evolution time and runtime, which we will discuss in more detail later. We present an algorithm that achieves the following goals: (1) Achieves Heisenberg-limited precision scaling, i.e. the total time for which we run time evolution is O( −1 poly(p −1 0 )); (2) Uses at most one ancilla qubit; (3) The maximal evolution time is at most O( −1 polylog( −1 p −1 0 )).
To our best knowledge our algorithm is the first to satisfy all three requirements. In our algorithm, we sample from a simple quantum circuit, and use the samples to approximately reconstruct the cumulative distribution function (CDF) of the spectral measure associated with the Hamiltonian. We then use classical post-processing to estimate the ground state energy with high confidence. Besides the ground state energy, our algorithm also produces the approximate CDF, which may be of independent interest. In the discussion above we assumed the controlled time evolution can be efficiently done. If controlled time evolution is costly to implement, then based on ideas in Refs. [31,43,49,59], we offer an alternative circuit in Appendix E which uses two ancilla qubits, with some additional assumptions. The problem of ground state energy estimation is closely related to that of ground state preparation, but there are important differences. First, having access to a good initial state ρ (with large overlap with the ground state) does not make the energy estimation a trivial task, as even if we have access to the exact ground state the quantum resources required to perform phase estimation can still be significant. Second, ground state energy estimation algorithms do not necessarily involve ground state preparation. This is true for the algorithm in this work as well as in Refs. [24,39]. Consequently, even though the ground state preparation algorithms generally have a runtime that depends on the spectral gap between the two lowest eigenvalues of the Hamiltonian, the cost of ground state energy estimation algorithms may not necessarily depend on the spectral gap.
We remark that although we characterize the scaling as depending on the overlap p 0 , in practice we need to know a lower bound of p 0 , which we denote by η. The dependence on p 0 should more accurately be replaced by a dependence on η. To our best knowledge, in order to obtain rigorous guarantee of the performance, the knowledge of η (and that η is not too small) is needed in all previous algorithms related to QPE. This is because in QPE we need the knowledge of η to obtain a stopping criterion. We will briefly explain this using a simple example. Suppose we have a Hamiltonian H on n qubits with eigenvalues λ k (arranged in ascending order), and eigenstates |ψ k , and |φ 0 is an initial guess for the ground state. Furthermore we assume p 0 = | φ 0 |ψ 0 | 2 = 0.01, p 1 = | φ 0 |ψ 1 | 2 = 0.5. We may idealize QPE as exact energy measurement to simplify discussion. If we have no a priori knowledge of p 0 , then performing QPE on the state |φ 0 will give us λ 1 with probability 1/2. If we repeat this 100 times most likely all energies we get will be ≥ λ 1 . Only when we measure 100 times can we reach the correct ground state energy λ 0 . Hence if we do not know about a lower bound of p 0 , we can never know whether we have stopped the algorithm prematurely.
The main idea of our algorithm is to use a binary search procedure to gradually narrow down the interval in which the ground state energy is located. The key component is a subroutine CERTIFY (Algorithm 2) that distinguishes whether the ground state energy is approximately to the left or right of some given value. This, however, can only be perform up to certain precision, and can fail with non-zero probability. Therefore our search algorithm needs to account for this fuzzy outcome to produce a final result that is correct with probability arbitrarily close to 1. In the CERTIFY procedure, we use a stochastic method to evaluate the cumulative distribution function associated with the spectral density, and this is the key to achieving the Heisenberg scaling. This stochastic method is described in detail in Section 3. To use QPE, either with fixed or O(p −1 0 ) maximal evolution time, to estimate the ground state energy, we run QPE for O(p −1 0 ) times and take the minimum in energy measurement outcomes as the ground state energy estimate. The error is averaged over multiple runs, and the failure rate is the percentage of runs that yield an estimate with error larger than the tolerance 0.04. The Hamiltonian H is the Hubbard Hamiltonian defined in Eq. (40) with U = 10, and the overlap p 0 is artificially tuned.

Related works
We first briefly analyze the cost of the textbook version of QPE using multiple ancilla qubits. Although this method has features that are not desirable on early fault-tolerant quantum computers, this analysis will nevertheless be helpful for understanding the cost of other variants of QPE. For simplicity we assume ρ = |φ φ| is a pure state, and the ground state |ψ 0 is non-degenerate. Approximately, the QPE performs a projective measurement in the eigenbasis of H. With probability p 0 , |φ will collapse to the ground state |ψ 0 . If this happens the energy register will then give the ground state energy λ 0 to precision . Therefore we run phase estimation for a total of O(p −1 0 ) times, and take the instance with the minimum value in the energy register. With high probability this value will be close to λ 0 . Each single run takes time O( −1 ). The total runtime cost is therefore O(p −1 0 −1 ). For simplicity here we do not consider the runtime needed to prepare |φ . The above analysis, however, is overly optimistic. Since we need to repeat the phase estimation procedure for a total of O(p −1 0 ) times, for an event that only has O(p 0 ) probability of happening in a single run, the probability of this event occurring at least once in the total O(p −1 0 ) repetitions is now O(1) (which means we cannot ensure that the error happens with sufficient low probability). In our setting, suppose the maximal evolution time is T , then each time we measure the energy register there is a O(T −1 −1 ) probability that the output will be smaller than λ 0 − . If we choose T = O( −1 ) as discussed above, and we let = /p 0 , then the probability of the minimum of the O(p −1 0 ) energy register measurement outputs being smaller than λ 0 − /p 0 is only upper bounded by O(1), and we can no longer control over the probability of the error being larger than . This means there might be a high probability that the error of the ground state energy in the end will be of order /p 0 instead of . For a more formal analysis see [24,Appendix A]. We numerically demonstrate that this is indeed the case in Figure 1, in which we show the error increases as p 0 decreases and there is a larger probability of the estimate deviating beyond a prescribed tolerance if the maximal evolution time, or equivalently the circuit depth, for QPE is fixed.
To avoid this, one can instead choose the maximal evolution time to be T = O(p −1 0 −1 ). After repeating O(p −1 0 ) times, the total runtime then becomes O(p −2 0 −1 ). The increase in maximal evolution time can prevent the increase of error (see Figure 1). However, the extra p −1 0 factor increases the circuit depth and is undesirable.
There are several other algorithms based on phase estimation using a single ancilla qubit [51,69,70] that are designed for different settings from ours: they assume the availability of an exact eigenstate, or are designed for obtaining the entire spectrum and thus only work for small systems. Ref. [61] proposes a method for estimating the eigenvalues by first estimating Tr[ρe −itH ] and then performing a classical Fourier transform, but no runtime scaling is provided. The semi-classical Fourier transform [29] simulates the QFT in a classical manner, and the QPE using single ancilla qubit and semi-classical Fourier transform has the same scaling in terms of the maximal evolution time and the total evolution time.
In order to improve the dependence on p 0 , we may use the high-confidence versions of the phase estimation algorithm [37,47,56]. In this method, the maximal evolution time required can be reduced to O( −1 log(p −1 0 )), through taking the median of several copies of the energy register in a coherent manner. However, this requires using multiple copies of the energy register, together with an additional quantum circuit to compute the medians coherently that can be difficult to implement. Note that semi-classical Fourier transform can only simulate the measurement outcome and does not preserve coherence, and therefore to our knowledge, the high-confidence version of phase estimation cannot be modified to use only a single qubit. In Ref. [24], the authors used a method called minimum label finding to improve the runtime to O(p −3/2 0 −1 ), but the implementation of the minimum label finding with limited quantum resources is again difficult.
Besides these algorithms based on phase estimation, several other algorithms have been developed to solve the ground state energy problem. Ref. [24] proposed a method based on the linear combination of unitaries (LCU) technique that requires running time evolution for duration O(p −1/2 0 −3/2 ) and preparing the initial state O(p −1/2 0 −1/2 ) times. 1 Assuming the Hamiltonian H is available in its block-encoding [17,42], Ref. [39] uses quantum signal processing [26,41] with a binary search procedure, which queries the block-encoding O(p −1/2 0 −1 ) times and prepares the initial state O(p −1/2 0 log( −1 )) times. To our knowledge, this is the best complexity that has been achieved. However the block-encoding of a quantum Hamiltonian of interest, LCU, and amplitude estimation techniques (used in [39]) are expensive in terms of the number of ancilla qubits, controlled operations, and logical operations needed.
A very different type of algorithms for ground state energy estimation is the variational quantum eigensolver (VQE) [45,52,54], which are near-term algorithms and have been demonstrated on real quantum computers. The accuracy of VQE is limited both by the representation power of the variational ansatz, and the capabilities of classical optimization algorithms for the associated non-convex optimization problem. Hence unlike aforementioned algorithms, there is no provable performance guarantees for VQE-type methods. In fact some recent results show solving the nonconvex optimization problem can be NP-hard [12]. Furthermore, each evaluation of the energy expectation value to precision requires O( −2 ) samples due to Monte Carlo sampling. This can to some extent be remedied using the methods in [37,69] at the expense of larger circuit depth requirement.
There are also a few options that can be viewed to be in-between VQE and QPE. The quantum imaginary time evolution (QITE) algorithm [46] uses state tomography turning an imaginary time evolution into a series of real time Hamiltonian evolution problem. Inspired by the classical Krylov subspace method, Refs. [31,53,63] propose to solve the ground state energy problem by restricting the Hilbert space to a low dimension space spanned by some eigenstates that are accessible with time evolution. Similar to VQE, no provable complexity upper bound is known for these algorithms, and all algorithms suffer from the −2 scaling due to the Monte Carlo sampling. In fact, the stability of these algorithms remains unclear in the presence of sampling errors.
A more ambitious goal than ground state energy estimation is to estimate the distribution of all eigenvalues weighted by a given initial state ρ [23,51,62]. Using a quantum circuit similar to that in Kitaev's algorithm as well as classical post-processing, Ref. [62] proposed an algorithm to solve the quantum eigenvalue estimation problem (QEEP). We henceforth refer to this algorithm as the quantum eigenvalue estimation algorithm (QEEA). Suppose H ≤ 1/2, and the interval [−π, π] is divided into M bins of equal size denoted by B j = [−1/2 + j/M, −1/2 + (j + 1)/M ]. Then QEEA estimates the quantities q j = k:λ k ∈Bj p k . Although QEEA was not designed for ground state energy estimation, one can use this algorithm to find the leftmost bin in which q j ≥ p 0 /2, and thereby locate the ground state energy within a bin of size M −1 . While the maximal evolution time required scales as O( −1 ), the total evolution time of the original QEEA scales as O( −6 ). We analyze the cost of QEEA in Appendix C, and show that the total runtime can be reduced to O( −4 ) for the ground state energy estimation in a straightforward way, yet this is still costly if high precision is required.
To the extent of our knowledge, none of the existing algorithms achieves all three goals listed on Page 3. Some can have better maximal evolution time or total evolution time requirement, but the advantage always comes at the expense of some other aspects. In Table 1 we list the quantum algorithms discussed in this work and whether they satisfy each of the requirements.
No precision guarantee This work Table 1: Quantum algorithms for estimating the ground state energy and whether they satisfy each of the three requirements on Page 3. We recall that the requirements are (1) achieving the Heisenberglimited precision scaling, (2) using at most one ancilla qubit, and (3) the maximal evolution time being at most O( −1 polylog( −1 p −1 0 )).
In Table 2, we compare the maximal evolution time, the number of repetitions (the number of times we need to run the quantum circuit), and the total evolution time needed, using the three qubit-efficient methods that require only one ancilla qubit.
Finally, in a gate-based setting, the exact relations between the maximal evolution time and  Table 2: Comparison of the maximal evolution time, the number of repetitions (the number of times we need to run the quantum circuit), and the total evolution time needed for estimating the ground state energy to within error , using the three methods that require only one ancilla qubit: the method in this work, QPE with semi-classical Fourier transform that uses only one ancilla qubit, and the QEEA in Ref. [62]. The overlap between the initial state and the ground state is assumed to be p 0 . The number of repetitions is also the number of times we need to prepare the initial state. An analysis of the QEEA in Ref. [62] can be found in Appendix C.
the circuit depth, and between the total evolution time and the total runtime, can be affected by the method we use to perform time evolution. Suppose we have access to a unitary circuit that performs e −iτ H exactly for some fixed τ . Then in order to run coherent time evolution for time T we only need to use a circuit of depth O(T ). Therefore the circuit depth scales linearly with respect to the maximal evolution time. Similarly the total runtime scales linearly with respect to the total evolution time. However, if we can only perform time evolution through Hamiltonian simulation, then these relations become more complicated. If advanced Hamiltonian simulation methods [10,41,42] can be used, the additional cost would be asymptotically negligible, since to ensure an error for time evolution for time T the cost is O(T polylog(T −1 )). Hence the cost is only worse than that in the ideal case by a poly-logarithmic factor. However, for early fault-tolerant quantum computers, as discussed in Refs. [16,36], Trotter formulas [66] are generally favored. Running time evolution for time T with error at most would entail a runtime of O(T 1+1/p −1/p ). The additional cost will therefore prevent us from reaching the Heisenberg limit, though high-order Trotter formulas (i.e. with a large p) can allow us to get arbitrarily close to the Heisenberg limit. If one does not insist on having a Heisenberg-limited scaling, then randomized algorithms [11,15,18] may lead to lower gate count when only low precision is required.
In Appendix D we analyze the circuit depth and the total runtime of our algorithm with time evolution performed using Trotter formulas. We also compare with QPE based on Trotter formulas. We found that when using Trotter formulas, our method has some additional advantage over QPE, achieving a polynomially better dependence on p 0 (i.e. η in Appendix D) in the total runtime. The total runtime scales like −1−o(1) using our algorithm with Trotter formulas, and this only approximately reaches the Heisenberg limit −1 in terms of the total runtime. However, it is worth noting that none of the other methods can strictly reach the Heisenberg limit using Trotter formulas. Otherwise we can instead perform Hamiltonian simulation with the exponentially accurate methods to go below the Heisenberg limit, which is an impossible task. Despite the sub-optimal asymptotic scaling, with tight error analysis [19,21,67,71] Trotter formulae may outperform the advanced Hamiltonian simulation techniques discussed above in terms of the gate complexity, especially when only moderate accuracy is needed.

Organization
The rest of the paper is organized as follows. In Section 2 we introduce the quantum circuit we are going to use, and introduce the CDF which is going to play an important role in our algorithm, and give an overview of the ground state energy estimation algorithm. In Section 3 we discuss how to approximate the CDF. In Section 4 we show that the ground state energy can be estimated by inverting the CDF, and present the complexity of our algorithm (Corollary 3). In Section 5 we present the details of our algorithm for post-processing the measurement data and analyze the complexity.

Overview of the method
We want to keep the quantum circuit we use as simple as possible. In this work we use the following circuit where H is the Hadamard gate. We choose W = I or W = S † where S is the phase gate, depending on the quantity we want to estimate. The quantum circuit is simple and uses only one ancilla qubit as required. The quantum circuit itself has been used in previous methods [34,62]. However, our algorithm uses a different strategy for querying the circuit and for classical post-processing, and results in lower total evolution time and/or maximal evolution time achieving the goals (1) and (3) listed on Page 3. This circuit requires controlled time evolution, which can be non-trivial to implement. The idea of removing controlled operation in phase estimation has also been considered in [13]. Here we can use ideas from Refs. [31,43,49,59] to remove the need to perform controlled time evolution. But this type of approach requires an eigenstate of H with known eigenvalue that is easy to prepare. In a second-quantized setting we can simply use the vacuum state. We will discuss this in detail in Appendix E.
Using the circuit in (1), in order to estimate Re Tr[ρe −ijτ H ], where j is an arbitrary integer and τ is a real number, we set W = I. We introduce a random variable X j and set it to be 1 when the measurement outcome is 0, and −1 when the measurement outcome is 1. Then Similarly for Im Tr[ρe −ijτ H ], we set W = S † , and introduce a random variable Y j that depends in the same way on the measurement outcome. We have The parameter τ is chosen to normalize the Hamiltonian. Specifically, we choose τ so that τ H < π/3. We remark that τ should be chosen to be O( H −1 ), and to avoid unnecessary overheads we want its scaling to be as close to Θ( H −1 ) as possible.
We can define a spectral measure of τ H associated with ρ. The spectral measure is Here K is the number of different eigenvalues, λ k 's are the distinct eigenvalues arranged in ascending order, and each p k is the corresponding overlap, as defined in the Introduction. We extend it to a 2π-periodic function by p(x + 2π) = p(x) so that the Fourier transform can be performed on the interval [0, 2π] instead of the whole real line, which leads to a discrete Fourier spectrum. Note that because of the assumption τ H < π/3, within the interval [−π, π], p(x) is supported in (−π/3, π/3).
Next we consider the cumulative distribution function (CDF) associated with this measure. We define the 2π-periodic Heaviside function by where k ∈ Z. The CDF is usually defined by C(x) = k:λ k ≤x p k . This is however not a 2π-periodic function and thus will create technical difficulties in later discussions. Therefore instead of the usual definition, we define where * denotes convolution. There is ambiguity at the jump discontinuities, and we define the values of C(x) at these points by requiring C(x) to be right-continuous. We check that this definition agrees with the usual definition when x ∈ (−π/3, π/3), which is the interval that contains all the eigenvalues of τ H: Consequently C(x) is a right-continuous non-decreasing function in (−π/3, π/3).
If we could evaluate the CDF then we would be able to locate the ground state energy. This is because the CDF is a piecewise constant function. Each of its jumps in the interval (−π/3, π/3) corresponds to an eigenvalue of τ H. In order to find the ground state energy we only need to find where C(x) jumps from zero to a non-zero value. However, in practice we cannot evaluate the CDF exactly. We will see that we are able to approximate, in a certain sense as will be made clear later, the CDF using a function we call the approximate CDF (ACDF). To this end we first define an approximate Heaviside function F (x) = |j|≤dF j e ijx such that The construction of this function is provided in Lemma 6, whereF j is written asF d,δ,j . Here the parameters d and δ need to be chosen to control the accuracy of this approximation, and their choices will be discussed later. We also omit the d and δ dependence in the subscripts for simplicity. With this F (x) we define the ACDF by In Section 3 we will discuss how to evaluate this ACDF using the circuit in (1). The ACDF and CDF are related through the following inequalities for any |x| ≤ π/3, 0 < δ < π/6 and > 0. We prove these inequalities in Appendix B. Given the statistical estimation of the ACDF C(x), these inequalities enable us to estimate where the jumps of the CDF occur, which leads to an estimate of the ground state energy. By approximately evaluating the ACDF C(x) for certain chosen x, and through Eq. (9), we can perform a binary search to locate the ground state energy in smaller and smaller intervals. The algorithm to do this and the total computational cost required to estimate the ground state energy to precision at a confidence level 1 − ϑ are discussed in Sections 4 and 5.  (16). The ground state energy estimate can be obtained through post-processing as discussed in Section 4. Only Step (2) needs to be performed on a quantum computer.

Evaluating the ACDF
In this section we discuss how to evaluate the ACDF C(x). We first expand it in the following way: where the spectral measure p(x) is defined in (4). In going from the first line to the second line in the above equation we have used the fact that One might want to evaluate each Tr[ρe −ijτ H ] using Monte Carlo sampling since this quantity is equal to E[X j + iY j ]. If we want to evaluate all Tr[ρe −ijτ H ] to any accuracy at all, we need to sample each X j and Y j at least once. Then the total evolution time is is at least τ |j|≤d |j| = Ω(τ d 2 ). Later we will see we need to choose d = O( −1 polylog( −1 p −1 0 )) to ensure the ground state energy estimate has an additive error smaller than . Hence this total evolution time would give rise to a −2 dependence in the runtime.
In order to avoid this −2 dependence, instead of evaluating all the terms we stochastically evaluate (10) as a whole. The idea we are going to describe is inspired by the unbiased version of the multi-level Monte Carlo method [57,58]. We define a random variable J that is drawn from {−d, −d + 1, . . . , d}, with probability where the normalization factor F = |j|≤d |F j |. We let θ j be the argument ofF j , i.e.F j = |F j |e iθj . Then where we have used (2) and (3). For simplicity we write X J and Y J into a complex random variable Therefore we can use as an unbiased estimate of C(x). The variance can be bounded by: Here we have used the fact that |X j |, |Y j | ≤ 1.
From the above analysis, we can generate N s independent samples of (J, Z), denoted by (J k , Z k ), k = 1, 2, . . . , N s , and then take the averagē which can be used to estimate C(x) in an unbiased manner. The variance is upper bounded by 2F 2 /N s . In order to make the variance upper bounded by a given σ 2 , we need N s = O(F 2 /σ 2 ). The expected total evolution time is Furthermore, by Lemma 6 (iii) we have |F j | ≤ C|j| −1 for some constant C. Therefore The number of samples and the expected total evolution time are therefore respectively. We can see that in this way we have avoided the d 2 dependence, which shows up in a term-by-term evaluation.
In Figure 3 we show the plot of the ACDF obtained through our method for the Fermi-Hubbard model. The details on this numerical experiment can be found in Appendix F. We can estimate the ground state energy from the ACDF in a heuristic manner: we let x = inf{x :Ḡ(x) ≥ η/2}, and x /τ is an estimate for the ground state energy λ 0 . Here η is chosen so that p 0 ≥ η. In Section 5 we describe a more elaborate method to achieve the prescribed accuracy and confidence level. However, this heuristic method seems to work reasonably well in practice. In Figure 4 we show the scaling of the ground state energy estimation error, the total evolution time, and the maximal evolution time, with respect to δ = τ (δ here is the parameter needed to construct {F j } using Lemma 6), where is the allowed error. Both the total evolution time and the maximal evolution time are proportional to −1 . The details on this numerical experiment can also be found in Appendix F.

Estimating the ground state energy
In this section we discuss how to estimate the ground state energy with guaranteed error bound and confidence level from the samples generated on classical and quantum circuits discussed in Sections 2 and 3. First we note that the CDF C(x) = 0 for all −π/3 < x < τ λ 0 , and C(x) > 0 for all τ λ 0 ≤ x < π/3. Therefore getting the ground state energy out of the CDF can be seen as inverting the CDF: we only need to find the smallest x such that C(x) > 0. One might consider performing a binary search to find such a point, but we run into a problem immediately: we only have access to estimates of C(x) with statistical noise, and we cannot tell if the estimate is greater than zero is due to C(x) > 0 or is merely due to statistical noise. We therefore need to make the search criterion more robust to noise. Note that the CDF cannot take values between 0 and p 0 : C(x) ≥ p 0 for τ λ 0 ≤ x < π/3 and C(x) = 0 for −π/3 < x < τ λ 0 . Now suppose we know p 0 ≥ η, then for any x, rather than distinguishing between C(x) = 0 and C(x) > 0, we instead distinguish between C(x) = 0 and C(x) ≥ η/2 (here η/4 is chosen to be consistent with later discussion and it can be any number between 0 and 1 times η). In this setting, if the estimate of C(x) is larger than η/4 then we tend to believe that C(x) ≥ η/2, and if the estimate is smaller than η/4 then we tend to believe that C(x) = 0. Thus we can tolerate an error that is smaller than η/4.
It may appear that we can find the ground state energy by performing a binary search for the point at which C(x) first becomes larger than η/2. However, we can only estimate the continuous function C(x), which cannot uniformly approximate C(x). This is because C(x) has many jump discontinuities (each of which corresponds to an eigenvalue). As a result, we cannot perform this binary search procedure directly.
Using the samples {J k } and {Z k } generated on classical and quantum circuits respectively, we are able to solve Problem 1.
Theorem 2 (Inverting the CDF). With samples {J k } M k=1 satisfying |J k | ≤ d and {Z k } M k=1 , generated according to (11) and (13) respectively, we can solve Problem 1 on a classical computer with probability at least 1 − ϑ, for d = O(δ −1 log(δ −1 η −1 )) and M = O(η −2 log 2 (d)(log log(δ −1 ) + log(ϑ −1 ))). The classical post-processing cost is To generate the samples {Z k } M k=1 on a quantum circuit, the expected total evolution time and the maximal evolution time are and respectively.
Usually the Heisenberg limit is defined in terms of the root-mean-square error (RMSE) of the estimate. In this paper we focus on ensuring the error of the ground state energy to be below a threshold with probability at least 1 − ϑ. From Corollary 3, our algorithm only has a logarithmic dependence on ϑ −1 , and the error can be at most 2 H , we can easily ensure the RMSE is O( ) using the result by choosing ϑ = O( 2 H −2 ). We can see the total evolution time scaling with respect to is still O( −1 ).
Remark 4 (System size dependence). One might notice the absence of an explicit system size dependence in the evolution time scaling in Theorem 2 and Corollary 3. This is because, as mentioned before in the Introduction, the total evolution time depends on the system size indirectly through two parameters τ and η. Moreover, if we consider the dependence of the total runtime on the system size, we also need to account for the overhead that comes from performing Hamiltonian simulation. This overhead and the scaling of η with respect to the system size are highly problem-specific and are independent from the tasks we are considering in this paper, and hence we will not discuss them in more detail. Because the Hamiltonian norm can generally be upper bounded by a polynomial of the system size, and the total evolution time dependence on τ −1 is poly-logarithmic, τ contributes a poly-logarithmic overhead in the system size dependence.

Inverting the CDF
In this section we prove Theorem 2 by constructing the classical post-processing algorithm to solve Problem 1 using samples from a quantum circuit. Since we want to search for an x satisfying the requirement (18), a natural idea is to use binary search. Our setting is somewhat different from the usual binary search setting, but we will show that a similar approach still works. The current setting differs from the setting of binary search mainly in two ways: first any x ∈ [τ λ 0 − δ, τ λ 0 + δ] satisfies the requirement (18) and can therefore be a target. When performing binary search we want to be able to tell if the target is to the left or right of a given x, but here the targets may be on both sides of x. When this happens there is some uncertainty as to how the algorithm will proceed next. However in our algorithm we will show that this does not present a problem. Also, because this algorithm is based on random samples, there is some failure probability in each search step. We will use a majority voting procedure to suppress the failure probability so that in the end the algorithm will produce a correct answer with probability arbitrarily close to 1.
We suppose we are given independent samples of (J, Z) defined in (11) and (13) generated from a quantum circuit. We denote these samples by This division is for the majority voting procedure we mentioned above. The maximal evolution time needed to generate these samples is proportional to max k |J k | ≤ d. The expected total evolution time we will need is proportional to M E[|J|].
We first reduce Problem 1 into a decision problem. For any x ∈ (−π/3, π/3), one of the following must be true: If there is a subroutine that tells us which one of the two is correct, or randomly picks one when both are correct, then we can use it to find x . We assume such a subroutine, which uses {(J k , Z k )} M k=1 , exists and denote it by the name CERTIFY(x, δ, η, {(J k , Z k )}). The subroutine returns either 0 or 1: 0 for C(x + δ) > η/2 being true, and 1 for C(x − δ) < η being true.

Algorithm 1 INVERT CDF
then the former implies C(x + δ) > η/2 and the latter C(x − δ) < η. Therefore the CERTIFY subroutine only needs to decide which one of the two is correct or to output a random choice when both are correct. As discussed in Section 3,Ḡ(x) is an unbiased estimate of C(x). We use {J k } and {Z k } to get N b samples forḠ(x), denoted byḠ r (x), viā G(x; J (r−1)Ns+k , Z (r−1)Ns+k ) for r = 1, 2, . . . , N b . Here G(x; J, Z) is defined in (14). For each r, we compareḠ r (x) with (3/4)η. IfḠ r (x) > (3/4)η for a majority of batches, then we tend to believe C(x) > (5/8)η and output 0 for C(x + δ) > η/2. Otherwise, we tend to believe C(x) < (7/8)η and output 1 for C(x − δ) < η. This is the majority voting procedure we mentioned earlier. For the pseudocode for the subroutine see Algorithm 2.

Algorithm 2 CERTIFY
Input: In the CERTIFY subroutine, an error occurs when C(x) > (5/8)η yet a majority of estimates G r (x) are smaller than (3/4)η, or when C(x) < (7/8)η yet a majority of estimatesḠ r (x) are larger than (3/4)η. We need to make the probability of this kind of error occurring upper bounded by ν.
First we assume C(x) > (5/8)η. Then for each r, by Markov's inequality, we have We want to make this probability at most 1/4. Therefore we need var[Ḡ r (x)] ≤ η 2 /256. To ensure this, by (17) in which we let σ 2 = η 2 /256, we can choose Then by the Chernoff bound the probability of the majority of estimatesḠ r (x) being smaller than (3/4)η is at most e −C N b for some constant C . In order to make this probability bounded by ν we only need to let N b = O(log(ν −1 )).
In the algorithm INVERT CDF, the subroutine CERTIFY is used L = O(log(δ −1 )) times. If an error occurs in a single run of CERTIFY with probability at most ν then in the total L times we use this subroutine the probability of an error occurring is at most Lν. Therefore in order to ensure that an error occurs with probability at most ϑ in INVERT CDF, we need to set ν = ϑ/L. Therefore The above analysis shows that in order to solve Problem 1 the total evolution time is (17) in which we let σ 2 = η 2 /256 as discussed before when we estimate how large N s needs to be in (25). Multiplying this by N b we have (20). Note here we do not need to multiply by L because in each CERTIFY subroutine we can reuse the same {J k }, {Z k }. The maximal evolution time required is τ d and this leads to (21). The main cost in classical postprocessing comes from evaluatingḠ r (x). This needs to be done LN b times. Each evaluation involves O(N s ) = O(η −2 log 2 (d)) arithmetic operations. The total runtime for classical post-processing is therefore LN b N s = LM , which leads to (19). Thus we have obtained all the cost estimates in Theorem 2 and proved the theorem.

Discussions
In this paper we presented an algorithm to estimate the ground state energy with Heisenberg-limited precision scaling. The quantum circuit we used requires only one ancilla qubit, and the maximal evolution time needed per run has a poly-logarithmic dependence on the overlap p 0 . Such dependence on p 0 is exponentially better than that required by QPE using a similarly structured circuit using semi-classical Fourier transform, as discussed in Section 1.1. Both rigorous analysis and numerical experiments are done to validate the correctness and efficiency of our algorithm.
Although our algorithm has a near-optimal dependence on the precision, the dependence on p 0 (more precisely, on its lower bound η), which scales as p −2 0 in Corollary 3, is far from optimal compared to the p −1/2 0 scaling in Refs. [24,39]. Whether one can achieve this p −1/2 0 scaling without using a quantum circuit with substantially larger maximal evolution time, and without using such techniques as LCU or block-encoding, remains an open question.
The probabilistic choice of the simulation time according to Eq. (11) plays an important role in reducing the total evolution time. However, we may partially derandomize the algorithm following the spirit of the multilevel Monte Carlo (MLMC) method [25] in the classical setting. The method we developed for computing the approximate CDF in Section 3 is in fact a quite general approach for evaluating expectation values from matrix functions. This method can act as a substitute of the LCU method in many cases, especially in a near-term setting. Using this method to compute other properties of the spectrum, such as the spectral density, is a direction for future work.

A Constructing the approximate Heaviside function
In this appendix we construct the approximate Heaviside function satisfying the requirement in (7). We need to first construct a smeared Dirac function, which we will use as a mollifier in constructing the approximate Heaviside function. To our best knowledge this particular version of smeared Dirac function has not been proposed in previous works.
where T d (x) is the d-th Chebyshev polynomial of the first kind, and for some constants C 1 and C 2 that do not depend on d or δ.

Lemma 6.
Let H(x) be the periodic Heaviside function defined in (5). For any δ ∈ (0, π/2) such that tan(δ/2) ≤ 1 − 1/ √ 2 and > 0, there exists d = O(δ −1 log(δ −1 −1 )), and a 2π-periodic function Proof. We first construct the function F d,δ (x). Let M d,δ (x) be the mollifier in Lemma 5. Because of Lemma 5 (i) and (ii) M d,δ (x) can be used as to mollify non-smooth functions. Also because T d (x) is a polynomial of degree d, the Fourier coefficientŝ We construct F d,δ by mollifying the Heaviside function with M d,δ (x): We then show we can choose d = O(δ −1 log(δ −1 −1 )) to satisfy (ii). We have For any x such that |x| ∈ [δ, π − δ], first we consider the case where |x | < δ. In this case H(x − x ) = H(x) and therefore the integrand M d,δ (x )|H(x − x ) − H(x)| = 0. Then we consider the case where |x | ≥ δ. By Lemma 5 (i) we have M d,δ (x ) ≤ 2/N d,δ , and as |H( Thus for any x such that |x| ∈ [δ, π − δ], If we want to keep the approximation error for x ∈ [−π + δ, −δ] ∪ [δ, π − δ] to be below , we will need, by Lemma 5 (i) and (30), It can be checked that we can choose d = O(δ −1 log( −1 δ −1 )) to achieve this. We then show this choice of d ensures (i) as well. From Eq. (26) and by the second inequality in Lemma 5 (i) Finally we prove our construction satisfies (iii). Because F d,δ (x) is defined through a convolution, its Fourier coefficients can be obtained througĥ Since when k = 0Ĥ

B The relation between the CDF and the approximate CDF
In this appendix we prove (9). Let 0 < δ < π/6. First we have a 2π-periodic function F (x) from Lemma 6 that satisfies and F (x) ∈ [0, 1] for all x ∈ R. We further define F L (x) = F (x − δ) and F R (x) = F (x + δ). They satisfy We define the some functions related to the ACDF as follows: Then we have The functions C L (x) and C R (x) can be used to bound C(x). Because of (31), the fact that p(x) is supported in (−π/3, π/3) in [−π, π], δ < π/6, and that H(y) and F L (y) both take value in [0, 1], for x ∈ (−π/3, π/3) we have Similarly we have Combining these two inequalities with (33), we have This proves (9).
C Obtaining the ground state energy by solving the QEEP Here we discuss how to obtain the ground state energy using algorithm in Ref. [62] to solve the QEEP. The cost of solving the QEEP as analyzed in Ref. [62] scales as −6 . However, the cost can be much reduced for the problem of ground state energy estimation. For simplicity we assume H < π/3 and τ is chosen to be 1. In order to find the interval of size 2 containing the ground state energy , we first divide the interval [−π/3, π/3] into M bins of equal size smaller than 2 . We then define the indicator function associated with an interval [a, b] to be In . We omitted polylogarithmic factors in the complexity.
However if the analysis is done more carefully the dependence on could be improved. First one should notice that the error for each Tr[ρe −ijH ] is independent, and the estimate is unbiased (if we do not consider the Fourier approximation error), as is the case in our algorithm (Section 3). Therefore the total error for estimating Tr[ρ1 [a,b] (H)] accumulates sublinearly. More precisely, let the error for estimating Tr[ρe −ijH ] be ε j with variance σ 2 j , and let the coefficient for Tr[ρe −ijH ] be A j . Then the total error j A j ε j has variance j A 2 j σ 2 j . Therefore the total error is roughly j A 2 j σ 2 j instead of the linearly accumulated error j A j σ j . These two can have different asymptotic scaling depending on the magnitude of A j . Because of this one can in fact choose to estimate Tr[ρe −ijH ] to within error O(η/ √ N term ) = O(η −1/2 ). This saves a −1 factor in the total runtime. Furthermore, one can choose to evaluate the approximate indicator function in a stochastic way, like we did in Section 3. By taking into account the decay of Fourier coefficients, similar to Lemma 6 (iii), it is possible to further reduce the complexity.

D Complexity analysis for using Trotter formulas
In this appendix, instead of using the maximal evolution time and the total evolution time to quantify the complexity, we directly analyze the circuit depth and the total runtime when the time evolution is simulated using Trotter formulas. We suppose the Hamiltonian H can be decomposed as H = γ H γ , where each of H γ can be efficiently exponentiated. A p-th order Trotter formula applied to e −iτ H with r Trotter steps gives us a unitary operator U HS with error where C Trotter is a prefactor, for which the simplest bound is C Trotter = O(( γ H γ ) p+1 ). Tighter bounds in the form of a sum of commutators are proved in Refs. [21,64].

D.1 The algorithm in this work
Our algorithm requires approximating Eq. (10) to precision η (as in Theorem 3 η is a lower bound of p 0 /2) using Trotter formulas. Suppose we are using a p-th order Trotter formula, then we want The maximal evolution time in Corollary 3 tells us how many times we need to use the operator U HS (multiplied by a factor τ ). Multiply this by r we have the maximal circuit depth we need, which is Similarly we have the total runtime If we fix H and let , η → 0, then we can see this gives us an extra −1/p η −1/p factor in the circuit depth and total runtime, compared to the maximal evolution time and the total evolution time respectively.

D.2 Quantum phase estimation
We then analyze the circuit depth and total runtime requirement for estimating the ground state energy with QPE, where the time evolution is performed using Trotter formulas. We analyze the multi-ancilla qubit version of QPE and the result is equally valid for the single-ancilla qubit version using semi-classical Fourier transform.
In QPE, when we replace all exact time evolution with U HS , we would like to ensure that the probability of obtaining an energy measurement close to the ground state energy remains bounded away from 0 by Ω(η). Therefore the probability distribution of the final measurement outcome should be at most O(η) away from the original distribution in terms of the total variation distance.
Because the only part of QPE that depends on the time evolution operator is the multiplycontrolled unitary J−1 j=0 |j j| ⊗ e −ijτ H , which is replaced by J−1 j=0 |j j| ⊗ U j HS when we use Trotter formulas, we only need to ensure the difference between the two operators to be upper bounded by O(η) in terms of operator norm. Therefore we need J e −ijτ H − U HS = O(η).
As discussed in Section 1.1, we need to choose J = O(τ −1 −1 η −1 ) (we need the τ −1 factor to account for rescaling H, and p 0 in Section 1.1 is replaced by η). Following the same analysis as in the previous section, we need to choose the number of Trotter steps for approximating e −iτ H to be r = max{1, O(J 1/p η −1/p C 1/p Trotter τ 1+1/p )} Therefore the circuit depth needed is and the total runtime is O(max{τ −1 −1 η −2 , −1−1/p η −2−2/p C 1/p Trotter }).
Again, if we fix H and let , η → 0, then we can see this gives us an extra −1/p η −2/p factor in the circuit depth and total runtime, compared to the maximal evolution time and the total evolution time respectively. This is worse by a factor of η −1/p than the cost using our algorithm.
Based on the above analysis, we construct the random variable Z in the following way: we first run the circuit with K = I, and denote the measurement outcomes of the first two qubits by (b 1 , b 2 ). If the third register returns all 0 when measured, then we let X = (−1) b1+b2 . Otherwise we let X = 0. Similarly we define a random variable Y for K = S. We have Therefore we can define Z = 2e −iλ R t ( X − i Y ). Then Thus we can see this new random variable Z satisfies (38). Compared to the Z in the main text this new random variable has a slightly larger variance: This however does not change the asymptotic complexity.

F Details on the numerical experiments
In Figure 3, we apply the procedure described in Section 3 to approximate the CDF of the Fermi-Hubbard model, described by the Hamiltonian where c j,σ (c † j,σ ) denotes the fermionic annihilation (creation) operator on the site j with spin σ ∈ {↑, ↓}. ·, · denotes sites that are adjacent to each other. n j,σ = c † j,σ c j,σ is the number operator. The sites are arranged into a one-dimensional chain, with open boundary condition.
We first evaluateḠ(x) defined in (16), and the result is shown in Figure 3. We use a classical computer to simulate the sampling from the quantum circuit. The initial state ρ is chosen to be the Hartree-Fock solution, which has an overlap of around 0.4 with the exact ground state. We can see thatḠ(x) closely follows the CDF, and even though there is significant noise from Monte Carlo sampling, the jump corresponding to the ground state energy is clearly resolved.
Then we consider estimating the ground state energy fromḠ(x). In this numerical experiment we use a heuristic approach, and the rigorous approach that comes with provable error bound and confidence level is discussed in Sections 4 and 5. We obtain the estimate by x = inf{x :Ḡ(x) ≥ η/2}, and x /τ is an estimate for the ground state energy λ 0 . We expect x ∈ [τ λ 0 − δ, τ λ 0 + δ]. Here η is chosen so that p 0 ≥ η.
The error of the estimated ground state energy, the total evolution time, and the maximal evolution time are shown in Figure 4, in which we have chosen U/t = 4 for the Hubbard model. In the right panel of Figure 4 we can see the line for total evolution time runs parallel to the line for the maximal evolution time. Because the maximal evolution time scales linearly with respect to δ −1 , and this plot uses logarithmic scales for both axes, we can see the total evolution time has a δ −1 scaling, and is therefore inversely proportional to the allowed error of ground state energy estimation.

G Frequently used symbols
Symbol Meaning

H
The Hamiltonian for which we want to estimate the ground state energy.
ρ The initial state from which we perform time evolution and measurement.
p k The overlap between ρ and the k-th lowest eigensubspace.

p(x)
The spectral density associated with τ H and ρ.

C(x)
The cumulative distribution function defined in (6).

C(x)
The approximate CDF defined in (8).

G(x)
An unbiased estimate of the ACDF C(x) defined in (14).

G(x)
The average of multiple samples of G(x), defined in (16).
J k An integer drawn from the distribution (11) signifying the number of steps in the time evolution. |J k | ≤ d.

Z k
A sample generated on a quantum circuit from two measurement outcomes. Defined in (13). Can only take value ±1 ± i. d The maximal possible value of |J k |. δ In the context of Corollary 3 we choose δ = τ where is the allowed error of the ground state energy. ϑ The allowed failure probability.