Even shorter quantum circuit for phase estimation on early fault-tolerant quantum computers with applications to ground-state energy estimation

We develop a phase estimation method with a distinct feature: its maximal runtime (which determines the circuit depth) is $\delta/\epsilon$, where $\epsilon$ is the target precision, and the preconstant $\delta$ can be arbitrarily close to $0$ as the initial state approaches the target eigenstate. The total cost of the algorithm satisfies the Heisenberg-limited scaling $\widetilde{\mathcal{O}}(\epsilon^{-1})$. As a result, our algorithm may significantly reduce the circuit depth for performing phase estimation tasks on early fault-tolerant quantum computers. The key technique is a simple subroutine called quantum complex exponential least squares (QCELS). Our algorithm can be readily applied to reduce the circuit depth for estimating the ground-state energy of a quantum Hamiltonian, when the overlap between the initial state and the ground state is large. If this initial overlap is small, we can combine our method with the Fourier filtering method developed in [Lin, Tong, PRX Quantum 3, 010318, 2022], and the resulting algorithm provably reduces the circuit depth in the presence of a large relative overlap compared to $\epsilon$. The relative overlap condition is similar to a spectral gap assumption, but it is aware of the information in the initial state and is therefore applicable to certain Hamiltonians with small spectral gaps. We observe that the circuit depth can be reduced by around two orders of magnitude in numerical experiments under various settings.

III. Complexity analysis with a large overlap 13 IV. Ground-state energy estimation with a small initial overlap 15 IV.1.Algorithm 15 Phase estimation is one of the most important quantum primitives.The problem of phase estimation can be equivalently stated as estimating the eigenvalue of a quantum Hamiltonian H, under the assumption that we can query H via the Hamiltonian evolution operator U = e −iτ H for some real number τ .There are two important performance metrics for the phase estimation: the maximal runtime denoted by T max , and the total runtime T total , which is the sum of the runtime multiplied by the number of repetitions from each circuit in the algorithm.T max and T total approximately measures the circuit depth and the total cost of the algorithm, respectively, in a way that is independent of the details in implementing U .If we are also given an eigenvector |ψ⟩ associated with an eigenvalue e −iτ λ , the Hadamard test is arguably the simplest algorithm for estimating the phase λ ∈ [−π/τ, π/τ ).It uses only one ancilla qubit and a single query to U controlled by the ancilla qubit, i.e., T max = τ .This makes the Hadamard test ideally suited for early fault-tolerant quantum devices, which is expected to have a very limited number of logical qubits and may have difficulty in handling circuit beyond a certain maximal depth.The Hadamard test has many drawbacks too: it requires |ψ⟩ to be an exact eigenstate, which is a stringent condition that cannot be satisfied in most scenarios.It also requires N s = O(ϵ −2 ) repetitions to estimate λ to precision ϵ, and hence the total runtime is O(ϵ −2 ).
Both problems can be addressed by the quantum phase estimation (QPE) and its many variants [4,16,21,22,29,34].Generically, estimating the phase to ϵ accuracy with high success probability requires T max to be at least π/ϵ for QPE [30, Section 5.2.1] 1 .Additionally, the total runtime of QPE is O(ϵ −1 ) and achieves the Heisenberg-limited scaling [13,44,45], which is the optimal scaling permitted by quantum mechanics.The standard version of QPE (see e.g., [30,Chapter 5]) uses at least log 2 (π(τ ϵ) −1 ) ancilla qubits and is not suitable for early fault-tolerant devices, but the semi-classical version of QPE [4,14,16] can achieve the same task and uses only one ancilla qubit.
To our knowledge, in all existing works on QPE satisfying Heisenberg-limited scaling, the maximal runtime π/ϵ is non-negotiable, in the sense that the preconstant in front of ϵ −1 cannot be significantly reduced in general.This is because QPE type of methods construct, directly or indirectly, a filtering function that transitions from 1 to 0 on an interval of width ϵ.This can be a severe limitation in practice, since estimating λ to precision 0.001 means that U needs to be coherently queried for approximately 3000 times in the quantum circuit (assuming τ = 1).It is therefore desirable to have a phase estimation method that satisfies the following properties: (1) Allow |ψ⟩ to be an inexact eigenstate with one ancilla qubit.
When |ψ⟩ is an exact eigenstate, the maximal runtime in QPE type methods may be reduced by means of a tradeoff between the circuit depth and the number of repetitions.However, if the initial state is not an exact eigenstate, this strategy is no longer directly applicable.In this paper, 1 The analysis in [30, Section 5.2.1] assumes the phase λ ∈ [0, 1).For a general H whose eignevalues are contained in [−∥H∥ 2 , ∥H∥ 2 ), we first define a scaled Hamiltonian H = 1  2 (H/ ∥H∥ 2 + I), and apply QPE with a forward Hamiltonian evolution (i.e., applying e −it H with t > 0).In order to estimate λ to precision ϵ, we need to estimate λ = 1 2 (λ/ ∥H∥ 2 + 1) to precision ϵ = ϵ/(2 ∥H∥ 2 ) with high success probability, and the scaled maximal evolution time satisfies Tmax ≥ 2π/ ϵ.Therefore the unscaled maximal evolution time (with respect to U = e −itH ) is Tmax = Tmax/(2 ∥H∥ 2 ) = 2π/ϵ.This analysis also implies that if we use QPE with both forward and backward Hamiltonian evolution (i.e., applying U = e −itH with t ∈ R), then the maximal evolution time can be halved and Tmax ≥ π/ϵ.
we introduce algorithms that can satisfy the properties (2) and (3), without assuming that |ψ⟩ is an exact eigenstate.

I.1. Main idea
To achieve this, our quantum circuit (Fig. 1) is the same as the circuit used in the Hadamard test but replaces U = e −iτ H by U n = e −inτ H for a sequence of integers n.This is a simple circuit, uses only one ancilla qubit, and is suitable for early fault-tolerant quantum computers.The most challenging component may be the implementation of the controlled time evolution.Under additional assumptions, the controlled time evolution for certain unitaries may be replaced by uncontrolled time evolution (see e.g.[9,18,24,25,31]).
Let t n = nτ for n = 0, . . ., N − 1, and for simplicity we refer to T max := N τ as the maximal running time (the actual maximal running time is (N − 1)τ ).The circuit provides an estimate of the value of ⟨ψ|e −itnH |ψ⟩ by measuring the success probability of the first qubit.Repeated measurements at different n provides a (complex) time series where Z n is a complex-valued random variable such that E(Z n ) = ⟨ψ| exp(−it n H)|ψ⟩.In the intuitive analysis, we may assume Z n ≈ ⟨ψ| exp(−it n H)|ψ⟩.We give the detailed construction of Z n in Section II.1.Without loss of generality, let us denote the target eigenstate by |ψ 0 ⟩, the target eigenvalue by λ 0 (in this context, λ 0 does not need to be the smallest eigenvalue of H), and the overlap between the initial state |ψ⟩ and the target eigenstate by p 0 = |⟨ψ|ψ 0 ⟩| 2 .We also assume λ 0 ∈ [−π, π).If p 0 = 1 (i.e., |ψ⟩ is the target eigenstate) and the number of samples for each n is sufficiently large, we have Z n ≈ e −itnλ0 , which is an exponential function.If p 0 < 1, we may still fit the input data provided by Eq. (1) using a complex exponential r exp(−it n θ), where r ∈ C, θ ∈ R.
A main step of our method is a subroutine that solves the following nonlinear least squares problem: and θ * gives the approximation to the phase λ 0 .Note that once we obtain the data set from the quantum circuit in Fig. 1, minimizing L(r, θ) only requires classical computation.This subroutine is dubbed quantum complex exponential least squares (QCELS).The minimization problem can be efficiently solved on classical computers (see Section II.2).An illustrative example of QCELS using the spectrum from TFIM model (see Section V.1 for detail) is shown in Fig 2a .In the graph, the initial overlap p 0 = 0.8, the scatter points are the data points Z n and the curve represents the fitting function r exp(−itθ).
If p 0 = 1 and N = 2, the behavior of QCELS is very similar to the Hadamard test, and we can estimate θ ≈ λ 0 mod [−π/τ, π/τ ) to any precision and the circuit depth is independent of ϵ.However, there are some immediate issues with this approach: (a) It is not so clear whether Eq. ( 2) can estimate λ 0 accurately with a short-depth circuit, even if p 0 is close but not equal to 1.Moreover, this method clearly fails if there exists some eigenstate |ψ i ⟩ such that | ⟨ψ|ψ i ⟩ | 2 > p 0 .So p 0 should be larger than some minimal threshold.
T im e t Figure 2. (a) Fitting the noisy input data with p0 = 0.8 using a complex exponential function.The mismatch between the data and the best fit reflects that the input data is more complex than a single complex exponential function.Despite this mismatch even in the absence of any Monte Carlo sampling error, QCELS is able to accurately estimate the phase under proper conditions.(b) Comparison of the theoretical upper bound of δ = Tmaxϵ for QCELS (Tmax is the maximal runtime) with the lower bound of δ for QPE type methods when p0 ≥ 0.71.The Hadamard test is only applicable when p0 = 1 and in this case δ can be chosen to be arbitrarily small.
(b) For each n, the number of measurements is at least 1.If N = Θ(ϵ −1 ), the total runtime is at least N (N − 1)/2 = Θ(ϵ −2 ), and the method does not satisfy the Heisenberg-limited scaling.
Our main body of work is to address these issues, and to develop an efficient algorithm for post-processing the input time series generated by quantum computers.
The answer to Question (a) is given by Theorem 1. Roughly speaking, when p 0 > 0.71, we may choose a proper δ > 0 so that the maximal runtime is T max = N τ = δ/ϵ, and the global minima to Eq. ( 2) can estimate λ 0 to precision δ/T max = ϵ (mod [−π/τ, π/τ )).Moreover, when Z n is sufficiently concentrated around its expectation and as p 0 → 1, δ can be chosen to be arbitrarily small.Therefore the maximal runtime (and the circuit depth) can be continuously reduced as the input state approaches an exact eigenstate, and QCELS maintains the desirable behavior of the Hadamard test when p 0 < 1.
To address Question (b), we can start from a small value of τ which allows us to estimate λ 0 to precision δ/(N τ ).If δ, N are fixed, then this estimate can only reach limited precision.Similar to the binary search strategy for refining the estimate of the eigenvalues [9,23,24], we can refine this estimate by increasing the maximal runtime.Specifically, we can multiply τ by constant and repeat the process with fixed δ, N .We only need to repeat the process for J = log 2 (δ/(N ϵ)) times.At the last step, we have τ J = δ/(N ϵ) and the maximal circuit depth is T max = N τ J = δ/ϵ.This procedure is called the multi-level QCELS and is described by Algorithm 1.According to Theorem 2, when p 0 ≈ 1, we may choose δ = Θ( √ 1 − p 0 ) ≪ 1 and estimate λ 0 to precision ϵ.The maximal runtime is T max = N τ = δ/ϵ, and the total cost is O(δ −1 ϵ −1 ).Both Theorem 2 and numerical results verify that Algorithm 1 satisfies the desired properties (1)(2)(3) at the beginning of the paper.In particular, the circuit depth can be continuously adjusted by the parameter δ (see a comparison of the theoretical circuit depth of different methods in Fig 2b .), and the algorithm satisfies the Heisenberg-limited scaling for all choices of δ within the allowed range determined by p 0 and the noise level due to measurements.

I.2. Ground-state energy estimation
As an application, we consider the problem of estimating the ground-state energy (the algebraically smallest eigenvalue) of an n-qubit quantum Hamiltonian H. Here, we assume ground-state energy λ 0 ∈ [−π, π), and |ψ 0 ⟩ is the associated eigenvector.In the absence of additional assumptions, the task can be QMA-hard [2,19,21].Hence we assume that an initial quantum state |ψ⟩ = U I |0 n ⟩ can be prepared via a unitary U I , and the overlap p 0 = |⟨ψ|ψ 0 ⟩| 2 > 0. If p 0 ≥ 0.71, we can readily apply Theorem 2 to estimate λ 0 using a short-depth circuit.
If p 0 is small, we propose an algorithm combining the multi-level QCELS algorithm with the Fourier filtering technique developed in Ref. [24] to estimate λ 0 .To demonstrate the efficiency of the algorithm, we assume that there is an interval I containing λ 0 , a slightly larger interval I ′ ⊃ I with a positive distance D separating I and (I ′ ) c (see Eq. ( 36)).We introduce a concept called the relative overlap of the initial vector |ψ⟩ with the ground state with respect to the intervals I, I ′ : Here the denominator is assumed to be non-vanishing, and The most straightforward scenario is that the system has a spectral gap ∆ = λ 1 −λ 0 .We can then choose I = [−π, λ prior +∆/4], I ′ = [−π, λ prior + 3∆/4], and D = ∆/2, where λ prior is a rough estimation of λ 0 such that |λ prior − λ 0 | ≤ ∆/4.The relative overlap in this case will be 1.It should be noted that the preceding discussion considers a worst-case scenario.In a real application, even if the spectral gap is very small, it may be feasible to choose suitable values for I and I ′ that result in a distance D significantly larger than the spectral gap, while still achieving a large relative overlap p r (I, I ′ ).Theorem 3 states that as long as the relative overlap is larger than 0.71, we can estimate the ground-state energy to precision ϵ, where the maximal runtime is T max = Θ(D −1 ) + δ/ϵ, and the total runtime T total is approximately O(p −2 0 δ −2 (D −1 + δ/ϵ)).Hence this algorithm is particularly useful when D ≫ ϵ.As the relative overlap approaches 1, δ can be chosen to be arbitrarily small.

I.3. Related works
Based on the generalized uncertainty relation [6], there exists a uniform complexity lower bound for the problem of phase estimation [12], i.e., the square of the error is always Ω(N −1 s N −2 ) in the expectation sense, where N is the query depth (with τ = 1) and N s is the number of repetitions.In our case, to estimate the ground-state energy with precision ϵ, we have N s T 2 max = Ω(ϵ −2 ), where T max is the maximal runtime.From this perspective, the Hadamard test (with N s = O(ϵ −2 ) and T max = O(1)) and QPE (with N s = O(1) and T max = O(ϵ −1 )) are at the two ends of the spectrum.It is possible to achieve a measurement-depth trade-off by setting N s = O(ϵ −2(1−α) ), T max = O(ϵ −α ) for some 0 < α < 1 [40].However, the total cost will be at least N s T max = O(ϵ α−2 ).Hence when α < 1, this strategy does not satisfy the Heisenberg-limited scaling.
Our work is related to the robust phase estimation (RPE) in the context of quantum metrology for single qubit systems, which was first proposed in Ref. [17].RPE satisfies the Heisenberg-limited scaling and allows the input state to be an inexact eigenstate, as long as the overlap with the desired eigenstate is larger than a certain constant [20].Due to these advantages, RPE has been applied in quantum metrology, as well as other systems that can be viewed as effective single qubit systems [3,20,26,35,36].Empirical observations also suggest that the maximal runtime of RPE may be smaller than π/ϵ.However, we are not aware of the theoretical analysis on this aspect of the algorithm.It is possible to generalize the RPE to perform phase estimation of general nqubit systems.That being said, QCELS and multi-level QCELS can also be applied for parameter estimation in quantum metrology, with provably short circuit depth.
There are a few other phase estimation algorithms that also use a single ancilla qubit.The efficiency of the algorithms are so far demonstrated numerically.Ref. [32] develops a post-processing technique to extract eigenvalues from phase estimation data based on a classical time-series (or frequency) analysis.[37] proposes a method that estimates ⟨ψ| exp(−itH) |ψ⟩ first and then performs a classical Fourier transform to estimate the eigenvalues.A very different type of algorithms for ground state energy estimation is the variational quantum eigensolver (VQE) [27,33,40], which constructs a variational ansatz |ψ(θ)⟩ to approximate the lowest eigenvector |ψ 0 ⟩ and the parameter θ of the ansatz is adjusted to minimize the energy ⟨ψ(θ)| H |ψ(θ)⟩.The advantage of the VQE is that the quantum circuit is very simple because short depth circuits (often without using ancilla qubits) are enough to estimate ⟨ψ(θ)| H |ψ(θ)⟩.However, the efficiency and accuracy of VQE largely depend on the representation power of the variational ansatz ψ(θ), and the solver of the non-convex optimization problem.Similar to VQE, there are also other algorithms that try to perform phase estimation using the quantum states generated in the time evolution, such as quantum imaginary time evolution (QITE) algorithm [28] and some methods based on classical Krylov subspace method such as the quantum subspace diagonalization [18,38].However, these methods also lack provable complexity upper bound, and existing theoretical analysis on quantum subspace diagonalization methods [10] has not been able to reveal the advantage of such methods compared to classical QPE methods.
For ground-state energy estimation, a number of quantum algorithms [1,9,11,24,39,41] have been developed for ground-state energy estimation using the Hamiltonian evolution input model.However, the maximal runtime of all existing works satisfying the Heisenberg-limited scaling is at least C/ϵ for some constant C = Ω(1) that is independent from the overlap p 0 .Take the method in Ref. [24] for instance, which uses the same quantum circuit as in Fig. 1 to generate the input data, and can estimate λ 0 with Heisenberg-limited scaling for any p 0 > 0. The method uses a Fourier filter to approximate the shifted sign function.To resolve the ground-state energy to precision ϵ, the shifted sign function defined on [−π, π) should make a transition from 1 to 0 within a small interval of size ϵ/2.The maximal runtime is O(ϵ −1 ) and the preconstant is larger than π (see [24, Appendix A]).A similar mechanism of constructing filtering functions is used in the nearoptimal ground state preparation and ground-state energy estimation algorithm based on the block encoding input model [23], the quantum eigenvalue transformation of unitary matrices (QETU) using the Hamiltonian evolution input model [9], and the statistical approach with a randomized implementation of Hamiltonian evolution [39].
More recently, Ref. [41] introduces a method that uses the Fourier filtering techniques from [24] to generate a rough estimation λ 0 for λ 0 in the first step.Then, it uses a derivative Gaussian filter around λ 0 to refine the estimation of λ 0 .The main result [41,Corollary 1.3] is that if the system has a spectral gap ∆, for any α ∈ [0, 1], the maximal runtime can be chosen to be O(ϵ −α ∆ −1+α ), and the total cost is O(∆ 1−α ϵ −2+α ).When α = 1, this reduces to the previous result in [24], i.e., both the maximal runtime and the total cost scale as O(ϵ −1 ).When α = 0, the maximal runtime becomes O(∆ −1 ) which can be much smaller than O(ϵ −1 ) when ∆ ≫ ϵ, and this is compensated by an increase in the total cost to O(ϵ −2 ∆).We show in Corollary 4 that under the same assumption, we may choose δ = ϵ/∆ in Theorem 3, and the maximal runtime our method is also O(∆ −1 ) and the total runtime is O(ϵ −2 ∆).While the maximal runtime allowed by [41,Corollary 1.3] should be at least O(∆ −1 ), our Theorem 3 allows an even shorter maximal runtime of O(1) under proper conditions.For example, in many quantum systems, although the spectral gap ∆ is very small, the relative contribution of the ground state to the initial state is significant in some large interval [λ 0 , λ 0 + D), where D = Ω(1) ≫ ∆.Applying the result of Theorem 3 to this system, the maximal runtime is O D −1 = O(1).It may still be difficult to estimate this relative overlap in practice.But unlike the spectral gap, the relative overlap condition is aware of the information in the initial state, and this relaxed condition may significantly increase the applicability range of our algorithm in practice especially for certain Hamiltonians with a small spectral gap (see numerical examples in Section V.2).

I.4. Organization
The rest of the paper is organized as follows.In Section II, we introduce QCELS and multi-level QCELS by assuming p 0 is larger than a certain constant threshold.We also provide an intuitive analysis why QCELS can reduce the maximal runtime when p 0 is close to 1.In Section III, we analyse the maximal runtime and the total cost of our methods.We then extend the method to any p 0 > 0 in Section IV.The numerical simulation of our method is provided in Section V, where we mainly compare our method with QPE, followed by discussions and future directions in Section VI.

II.1. Generating input data from quantum circuit
In Fig. 1, we may (1) Set W = I, measure the ancilla qubit and define a random variable X n such that X n = 1 if the outcome is 0 and (2) Set W = S † , measure the ancilla qubit and define a random variable Y n such that Y n = 1 if the outcome is 0 and Given two preset parameters N, N s > 0 and time step τ > 0, we use the quantum circuit in Fig. 1 to prepare the following data set: where Z n is calculated by running the quantum circuit (Fig. 1) N s times.More specifically, Here X k,n , Y k,n are independently generated by the quantum circuit (Fig. 1) with different W and satisfy (4), ( 5) respectively.Hence in the limit N s → ∞, we have To prepare the data set in Eq. ( 6), the maximal simulation time is T max = (N − 1)τ and the total simulation time is N (N − 1)N s τ /2 = N N s T max /2.To reduce the complexity of our algorithm, it suffices to find an efficient way to post-processing the data set ( 6) so a good approximation to λ 0 can be constructed with proper choice of N, N s and τ .

II.2. QCELS and its intuitive analysis
Using the data set (6), we define the mean square error (MSE): where r ∈ C and θ ∈ R. The approximation to λ 0 is constructed by minimizing the loss function L(r, θ): then θ * is an approximation to λ 0 , and this defines the QCELS algorithm.Note that once we obtain the data set from the quantum circuit, minimizing L(r, θ) only requires classical computation.
The mean square error L(r, θ) is a quadratic function with respect to r.For a fixed value of θ, minimization with respect to r gives and Therefore minimizing L(r, θ) is equivalent to maximizing which is a nonlinear function with respect to θ.According to the definition of Z n in (7) (see Appendix A for the rigorous statement) Thus, when N s ≫ 1 and p 0 ≈ 1, intuitively we have  13), and a number of possible choices of the initial guess θ0 with Tmax = 80 and eight sites TFIM model (see detail in Section V.1).Here p0 = 0.8, and the landscape for other values of p0 is similar.The BFGS algorithm is used to maximize the objective function f (θ) for different initial guess θ0.The error threshold is set as 0.01, meaning the optimization problem is successfully solved if Recall T max = N τ .When N is large enough, the maximum of sin((λ0−θ)N τ /2) occurs at θ = λ 0 and the closest local maximal θ * satisfies |θ * − λ 0 | ≥ π Tmax .Therefore to find the maximal value of f (θ) on the interval [−π, π), we may choose a uniform grid of size up to ⌈T max ⌉, and perform gradient ascent from each grid point.By maximizing over the values from all the local maxima, we can robustly find the global maxima of f (and hence the global minima of the loss function L).As an illustration, in Figure 3, we give an example of the landscape of the loss function and compare the optimization results with different initial guess.
When N, N s ≫ 1 and p 0 is sufficiently large, we can show θ * is a good approximation to λ 0 with relatively small depth.The rigorous theoretical results are presented in Section III.Here, we briefly introduce the intuition and leave more details to the next section.
Let {(λ m , |ψ m ⟩)} M −1 m=0 be the set of eigenpairs of the Hamiltonian H, and the distance of Our goal is to prove that R 0 is small.When N s ≫ 1, intuitively we have where p m = |⟨ψ m |ψ⟩| 2 is the overlap between the initial quantum state and the m-th eigenvector.
Hence in the limit N s → ∞, Similar to the computation above, we find that minimizing the right hand side is equivalent to maximizing the following function: Therefore When the overlap between the initial state and |ψ 0 ⟩ dominates, i.e., p 0 > M −1 m=1 p m , we can use the first equality in Eq. ( 19) to obtain Notice The second equality of Eq. ( 19) together with Eq. (22) gives As a result, Eq. ( 24) implies that there must exist some m * such that p 0 − M −1 m=1 p m R m * ≤ π/N and θ * must be close to one of the eigenvalues.Since p 0 > M −1 m=1 p m , it is reasonable to expect this eigenvalue should be λ 0 (m * = 0) and we first have When p 0 is very close to 1, we can further improve the bound (25).Since θ * is the maximal point, where the last inequality comes from the first equality of ( 22).This implies that Define δ = R 0 N , we have sin(N (δ/2N )) sin(δ/2N ) ≥ (3p 0 − 2)N .When δ < π, using the Taylor expansion, we have Combining this with ( 27), we have Therefore as or In other words, it suffices to choose the maximal runtime T max = δ/ϵ.The intuitive analysis above summarizes the reason why QCELS can estimate λ 0 with a shortdepth circuit.The precise statement is given in Theorem 1, and the behavior of the preconstant δ is demonstrated in Fig. 2b.

II.3. Multi-level QCELS
Even though QCELS can reduce the maximal runtime, it does not satisfy the Heisenberg-limited scaling.To see this, note that T max = (N − 1)τ = O(ϵ −1 ).If τ is a constant, then the total simulation time T total is Ω(ϵ −2 ).We may also attempt to choose N to be a constant and let τ = O(ϵ −1 ).However, the loss function is a periodic function in θ with period 2π/τ .So we can only obtain In this section, we provide a multi-level QCELS algorithm, which maintains the reduced maximal runtime, and satisfies the Heisenberg-limited scaling.Roughly speaking, we will construct a sequence of data sets {D H,j } using an increasing sequence of {τ j }.The maximal simulation time of the algorithm T max = N τ J and the total simulation time of the algorithm The parameters in the algorithm should be chosen properly.The increasing speed of τ j should also be chosen properly.If τ j increases too slowly, we need more iteration steps which increases the total cost.If τ j increases too rapidly, there might exist more than one candidates of the estimation interval for λ 0 in each iteration.We propose the choice of τ j+1 = 2τ j (see Eq. ( 35) for the precise choice of {τ j }), and this procedure is similar to Kitaev's algorithm [21].Each solution of the optimization problem based on {D H,j } helps us shrink the estimation interval and finally we obtain a small estimation interval for λ 0 .The pseudocode of the multi-level QCELS algorithm is given in Algorithm 1.
Algorithm 1 Multi-level quantum complex exponential least squares Generate the data set in Eqs. ( 6) and ( 7) using the circuit in Fig. 1 with t n = nτ j .

III. COMPLEXITY ANALYSIS WITH A LARGE OVERLAP
To analyze the complexity of the multi-level QCELS method in Algorithm 1, we need to find an upper bound of the maximal/total simulation time for finding an ϵ-approximation to λ 0 .In this section we assume the initial overlap is large, i.e., p 0 > 0.71.The extension to the small p 0 regime will be discussed in next section.
For each 0 ≤ n ≤ N − 1, we define which corresponds to the error that occurs in the expectation estimation (7).Note that {E n } are independent complex random variables with zero expectation and bounded magnitude.Using classical probability theory, we can give a sharp tail bound for E n with respect to N, N s , which is important for us to derive a choice of N, N s in our algorithm.The detailed discussion and tail bounds for E n can be found in Appendix A. Using a proper tail bound for E n , we can analyze the performance of QCELS in Theorem 1.This also corresponds to an iteration in Algorithm 1.
Theorem 1 (Complexity of QCELS, informal).Let θ * be the solution of QCELS in Eq. (10). and so that The precise statement of Theorem 1 and the proof are given in Appendix B. To show Eq. ( 34), two parts of the error need to be controlled.First, as discussed before, we should control E n by increasing the number of samples N s , so that it does not change the loss function too much.This is particularly important as p 0 → 1.When the condition ( 33) is satisfied, the probability of The second part of error comes from the pollution from eigenvalues other than λ 0 when p 0 < 1.As a result, δ cannot be arbitrarily small and needs to satisfy the relation in Eq. ( 32) as p 0 → 1.
Theorem 1 can be used to describe the maximal runtime of Algorithm 1.Using Eq. ( 34), we can obtain many candidates of the estimation interval for λ 0 after solving each minimization problem.On the other hand, we can choose τ j properly so that only one of this candidate survives in each iteration.After eliminating other candidates, the estimation in Eq. ( 34) can be directly written as |θ * − λ 0 | < δ T , which implies T max = δ ϵ is enough to obtain ϵ precision in our algorithm.Now, we are ready to introduce the choice of the parameters and the main complexity result of Algorithm 1: Theorem 2 (Complexity of multi-level QCELS, informal).Let θ * be the output of Algorithm 1.Given p 0 > 0.71, 0 < η < 1/2, 0 < ϵ < 1/2, we can choose δ according to Eq. (32), )) .Then The precise statement of Theorem 2 and the proof are given in Appendix C. Theorem 2 shows that as p 0 → 1, the multi-level QCELS algorithm satisfies the Heisenberg-limited scaling, and the maximal runtime can be much smaller π/ϵ.On the other hand, there is a trade-off between the maximal simulation time and the total simulation time.In particular, N s N = Θ δ −2 diverges as δ → 0. This implies that, although T total achieves the Heisenberg-limited scaling, the preconstant may become too large if the circuit depth is forced to be very small.

IV. GROUND-STATE ENERGY ESTIMATION WITH A SMALL INITIAL OVERLAP
When p 0 is smaller than the threshold value of 0.71, our strategy is to find a way to "increase p 0 " in the input data.If the system has a spectral gap ∆ = λ 1 − λ 0 ≫ ϵ, we can then use the algorithm from a previous work [24] to construct an eigenvalue filter to effectively filter out the contribution above λ 0 + ∆/2 in the initial state, using a circuit with maximal runtime Θ(∆ −1 ).The effective value of p 0 in the filtered data can be approximately 1, and the multi-level QCELS algorithm becomes applicable.
The spectral gap is a property of the Hamiltonian.For many quantum systems of interest, the spectral gap ∆ can be very small.Since QCELS can accurately estimate the eigenvalues starting from an inexact eigenstate, the filtering step does not need to be perfect either if p 0 is small.Consider an interval I containing λ 0 , a larger interval I ′ ⊃ I, and we define the distance Then the relative overlap of the initial vector |ψ⟩ with the ground state (as defined in Eq. ( 3)), denoted by p r (I, I ′ ), plays the role of the effective value of p 0 .Specifically, if p r (I, I ′ ) ≥ 0.71, we can effectively filter out the contribution from (I ′ ) c in the initial state using the algorithm in [24] with maximal runtime Θ(D −1 ).After the filtering operation, the relative overlap plays the role of p 0 in the previous section, and we can apply the multi-level QCELS algorithm to estimate λ 0 with respect to the filtered data.The concept of the relative overlap may allow us to estimate the ground-state energy for certain small gapped systems with a short-depth circuit, especially when p r (I, I ′ ) ≈ 1 and D is much larger than the spectral gap ∆.Furthermore, unlike the spectral gap assumption which is a property of the Hamiltonian, the relative overlap is a property of the initial state.This introduces flexibility in the initial state design that may be useful for future explorations.
2. Eigenvalue filtering to remove high energy contribution: Define a polynomial F q (x) = d l=−d Fl,q exp(ilx) such that: We use [24,Lemma 6] to construct F q that satisfies (37) with d = Θ(D −1 polylog(q −1 )), Fl,q = Θ(|l| −1 ), and 3. Refined estimation of λ 0 with multi-level QCELS: We can apply Algorithm 1 with the filtered data set (see detail in Algorithm 2) to obtain an accurate estimation of the ground state energy.

Define
Fl,q = Fl,q e iϕ l,q , β l = Fl,q The main algorithm is summarized in Algorithm 2.Here the "DataGenerator" is used to filter out the high energy contribution.According to the construction of F q and Z n,q , we have This implies the new data set successfully removes the high energy contribution when q ≪ 1, p r (I, I ′ ) ≈ 1 and the solution θ * to the optimization problem (39) should be a good approximation to λ 0 .We also need a sequence of data set {(nτ j , Z n,q )} N −1 n=0 with an increasing sequence of {τ j } J j=1 to shrink and keep the correct estimation interval.Similar to Algorithm 1, we propose the choice of τ j+1 = 2τ j (see Theorem 3 for the precise choice of τ j and J).

5:
Run the quantum circuit (Figure 1) with t n = r + nτ and W = I to obtain X k,n .

6:
Run the quantum circuit (Figure 1) with t n = r + nτ and W = S † to obtain Y k,n .

IV.2. Complexity analysis
In this section, we analyze the complexity of Algorithm 2 to show that it can reduce the circuit depth and maintain the Heisenberg-limited scaling.Define the expectation estimation error: and Using (37), we obtain By choosing a large N s to reduce the expectation error |E n,q |, increasing the quality of the filter to reduce the approximation error q, and assuming p r (I, I ′ ) ≈ 1, we can effectively reduce the error G n,q .These choices will let us find a good approximation of λ 0 by solving the optimization problem.In Algorithm 2, constructing the loss function contains two steps.First, to construct the eigenvalue filter, the required circuit depth is d = Θ(D −1 ).We then combine the eigenvalue filter with the Algorithm 1 to construct the filtered data set.This increases the circuit depth to T max = d + δ/ϵ = Θ(D −1 ) + δ/ϵ.To construct an enough accurate loss function, the number of repetitions is )) .This implies the total evolution time T total = Θ p −2 0 δ −(2+o(1)) D −1 + δ/ϵ .The result is summarized in the following theorem: Theorem 3 (Complexity of Algorithm 2, informal).Given any failure probability 0 < η < 1, target precision 0 < ϵ < 1/2, and knowledge of the relative overlap p r (I, ), and J, τ j according to Theorem 2 with T = N τ J = δ/ϵ.Then, where θ * is the output of Algorithm 2. In particular, the maximal evolution time is T max = d+δ/ϵ = Θ(D −1 ) + δ/ϵ and the total evolution time is T total = Θ p −2 0 δ −(2+o(1)) D −1 + δ/ϵ .

V. NUMERICAL SIMULATION
In this section, we numerically demonstrate the efficiency of our method using two different models.In Section V.1, we assume a large initial overlap, and compare the performance of Algorithm 1 with QPE for a transverse-field Ising model.In Section V.2, we assume the initial overlap is small, and compare the performance of Algorithm 2 with QPE for a Hubbard model.The Hamiltonian is constructed using the QuSpin package [42].Our numerical experiments are available via github (https://github.com/zhiyanding/QCELS).
In our numerical experiments, we normalize the spectrum of original Hamiltonian H so that the eigenvalues belong to [−π/4, π/4].Given a Hamiltonian H, we define the normalized Hamiltonian: We then use the QCELS based Algorithm 1 or Algorithm 2, as well as QPE to estimate the smallest eigenvalue of H and measure the error accordingly.

V.1. Ising model
Consider the one-dimensional transverse field Ising model (TFIM) model defined on L sites with periodic boundary conditions: where g is the coupling coefficient, Z i , X i are Pauli operators for the i-th site and the dimension of H is 2 L .We choose L = 8, g = 4.We apply Algorithm 1 (referred to as QCELS for simplicity in this subsection) and QPE to estimate λ 0 of the normalized Hamiltonian H (see Eq. ( 41)).In the test, we set p 0 = 0.6, 0.8 and implement QCELS (with N = 5, N s = 100) and QPE 10 times to compare the averaged error.The comparison of the results is shown in Fig. 4. The errors of both QPE and QCELS are proportional to the inverse of the maximal evolution time (T max ).But the constant factor δ = T ϵ of QCELS is much smaller than that of QPE.Fig. 4 shows that QCELS reduces the maximal evolution time by two orders of magnitude, even in this case when p 0 = 0.6 is smaller than the theoretical threshold 0.71.This suggests that the numerical performance of QCELS can be significantly better than the theoretical prediction in Theorem 2. The error of QPE is observed to scale as 6π/T .Moreover, the total evolution time (T total ) of QCELS is also smaller (by nearly an order of magnitude) than that of QPE.QPE vs QCELS in TFIM model with 8 sites.The initial overlap is large (p0 = 0.6, 0.8).Left: Depth (Tmax); Right: Cost (T total ).For QCELS, we choose N = 5, Ns = 100.J and τj are chosen according to Theorem 2. Both methods have the error scales linearly in 1/Tmax.The constant factor δ = T ϵ of QCELS is much smaller than that of QPE.

Consider the one-dimensional Hubbard model defined on L spinful sites with open boundary conditions
Here c j,σ (c † j,σ ) denotes the fermionic annihilation (creation) operator on the site j with spin σ. ⟨•, •⟩ denotes sites that are adjacent to each other.n j,σ = c † j,σ c j,σ is the number operator.We choose L = 4, 8, t = 1, U = 10.To implement Algorithm 2 (also referred to as QCELS for simplicity in this subsection) and QPE, we normalize H according to Eq. ( 41) and choose a small initial overlap (p 0 = 0.1, 0.4).Following the method in Section IV, we first use the algorithm in [24] to find a rough estimation λ prior of λ 0 such that |λ prior − λ 0 | ≤ D 2 , where D is chosen properly so that the relative overlap p r (I, I ′ ) > 0.75 with intervals I = [−π, λ prior + D/2] and I ′ = [−π, λ prior + 3D/2].In our test, we set D = (λ K − λ 0 )/4 with K = K k=1 p k > p 0 /3.We find that the (normalized) relative gap (D) is 0.63 and 0.26 for L = 4, 8, respectively.This is significantly larger than the spectral gap, which is 0.018 and 0.005 for L = 4, 8, respectively.
After obtaining the rough estimation λ prior , we construct the eigenvalue filtering F q according to [24, Lemma 6] to separate I, I ′ .Noticing dist(I, (I ′ ) c ) = D, we set d = ⌊15/D⌋ to ensure a small enough approximation error q.We run QCELS with N = 5 and N s = 15p In both figures, it can be seen that the maximal evolution time of QCELS is almost two orders of magnitude smaller than that of QPE.The total cost of the two methods are comparable when p 0 = 0.4, and the total cost of QCELS is larger than that of QPE when p 0 = 0.1, mainly due to the increase of the number of repetitions N s .We note that, for small p 0 , since we first construct the eigenvalue filter F q , the circuit depth of QCELS is at least d = ⌊15/D⌋.Thus, it is reasonable to choose τ J ≥ d.This directly ensures a relatively small error (ϵ ≤ 10 −2 ) in our case.In other cases when the gap D is large and only low accuracy is need, it may be possible to further reduce the circuit depth.

VI. DISCUSSION
Due to the relatively transparent circuit structure and minimal number of required ancilla qubits, the quantum circuit in Fig. 1 is suitable for early fault-tolerant quantum devices, and has received significant attention in performing a variety of tasks on quantum computers.Note that all algorithms in this paper (including the filtering algorithm in [24]) all use the same circuit, and the only difference is in the post-processing procedure.This paper finds that the circuit in Fig. 1 is even more powerful than previously thought for phase estimation and ground-state energy estimation, especially when the initial overlap p 0 , or the relatively overlap p r is large.The advantage of our method can be theoretically justified when p 0 or p r approaches 1. Numerical results show that even when p 0 or p r is away from 1 (e.g.0.8), our algorithms can still outperform QPE and reduce the maximal runtime (and hence the circuit depth) by around two orders of magnitude.
Viewed more broadly, the problem of post-processing the quantum data from the circuit in Fig. 1 is a signal processing problem using a simple (complex) exponential fitting function.Many methods have been developed in the context of classical signal processing for similar purposes (see   e.g., [5,7,8,15]).We think that at least two features distinguish the quantum setting from the classical counterpart: (1) it is a priority to reduce the maximal runtime, and (2) each data point in the signal is inherently noisy, and the total number of measurements need to be carefully controlled.While these classical data processing methods can be applied to the phase estimation problem, we are not yet aware of analytical results demonstrating the efficiency of such methods in the quantum setting.Such connections could be an interesting direction to explore in the future.
When the initial overlap p 0 is small, we have combined QCELS with the Fourier filtering algorithm in [24] to effectively amplify this overlap as shown in Algorithm 2. Another natural choice is to use the quantum eigenvalue transformation of unitary matrices (QETU) [9], which is a more powerful and slightly more complex circuit than that in Fig. 1, to amplify the overlap with the ground state.While we have demonstrated applications of QCELS-based algorithms to estimate ground-state energies, such algorithms may be useful in a much wider context, such as estimating excited-state energies and other observables [43].Simultaneous estimation of multiple eigenvalues using the same circuit is another interesting topic, which may open the door for developing efficient algorithms for a broader class of quantum systems with small spectral gaps.
• (Lemma 6 equation (A6)): Using the bound and independence of {E k,n } k,n , we can first show the following lemma: Since {|E n |} are bounded by 2 and independent of each other, according to Hoeffding's inequality, we have Combing this inequality with E (|E n |) ≤ 2
We note that the magnitude bound in Lemma 5 (A4) is a stronger than what we need in the analysis of the optimization problem (10).Intuitively, for fixed θ, with high probability, we have N Ns , which is a much better bound than (A4).However, we can not directly use sub-Gaussian properties in this case since θ is not fixed in the optimization process.On the other hand, we expect that when N, N s are chosen properly, θ * should belong to a tiny interval around λ 0 , meaning that it's not necessary to give a uniform bound for E θ all θ.In particular, to control the effect of error term, it suffices to bound E θ when θ is close to λ 0 .This bound is stated in the following lemma: Lemma 6.Given 0 < η < 1/2 and ρ > 0, then we have where we use It's straightforward to see that {a n,m } are independent random variables with zero expectation and |a n,m | ≤ 2.Then, according to sub-Gaussian theory, for any ξ > 0, we have Similar bound also hold for {b n,m }.Thus, we obtain that for any ξ > 0 Given any ϵ > 0, we can find a set of NsN , we prove (A5).Next, to prove (A6), we first consider the tail bound of Note that {a n,m } are independent random variables with zero expectation and |a n,m | ≤ 2nρ N .Then, according to Hoeffding's inequality, for any ξ > 0, we have Similar to before, we finally have NsN , we prove (A6).

B.1. Rough estimate
The rough complexity estimation is stated in the following proposition: Proposition 8. Let θ * be the solution of Eq. (10), α be defined in Eq. (B1), T = N τ , and p 0 > 0.71.Given the depth constant 0 < δ ≤ 4, and the failure probability 0 < η < 1/2.If there exists a small enough number ξ > 0 such that and N, N s satisfy According to the proposition, we can choose δ, N , and N s according to the inequality (B3).First, we use the parameter ξ in (B3) to represent an upper bound of "|E n |" (defined in Eq. ( 31)) and N, N s are chosen according to the results in Appendix A so that E n satisfies this upper bound.Second, when δ is very small, to make sure p 0 satisfies (B3), we need p 0 = 1 − O(δ 2 ), which implies (32) in Theorem 1. Finally, the lower bound p 0 > 0.71 comes from Eq. (B8) later in the proof of Proposition 8.More specifically, to obtain (B8), we need p0 (1+α)p0−α−10 −3 ≤ 2, which implies p 0 > 0.71.
We first rewrite the loss function (9).Notice that for any fixed θ, This means that minimizing L(r, θ) is equivalent to maximizing the magnitude of following function: Step 1: Lower bound for "|f (λ 0 )|".Using Eq. (E4) in Appendix E.2 , we have Step 2: Loose upper bound for R 0 .
We claim that for α in Eq. (B1), If the claim does not hold, notice Combining this with Eq. (E1) in Appendix E.1 , in the second last inequality.This contradicts to the fact that |f (θ * )| is the maximum.Thus, we must have (B7).
The main idea is to use a different way to bound the expectation error E n .To achieve a better bound for the error term, instead of bounding E θ * and E λ0 separately, now we can bound the difference of these two error terms using (A6).Intuitively, when θ * and λ 0 is close to each other, it is likely that these two error terms will cancel each other when we compare the difference between |f (θ * )| and |f (λ 0 )|.This intuition is justified by Lemma 6 (A6).Assume that we have already known is suffices to guarantee E θ,q − E λ0,q ≥ ξ with high probability.Formally, compared this requirement with the first inequality of (B4), we can reduce the blow up rate to O(δ −2 ) when δ → 0, which matches the condition in Theorem 7.However, the above calculation is assuming T , which is unknown to us in prior.To overcome this difficulty, we need to use an iteration argument to achieve the desired order.We first show the following lemma to start our iteration: Lemma 9. Given 0 < δ ≤ 4 and 0 < η < 1/2 and define T = N τ .Assume the condition of Proposition 8 is satisfied.If there exists 0 < ξ 1 < ξ such that and then there exists δ new ∈ (0, δ) such that where δ new is the unique solution to the following equation: Combining Lemma 6 (A6) (setting ρ = δ) with (B14) and p 0 > 0.71, we have Thus, with probability 1 − η, we have From (B16), it suffices to prove |(θ * − λ 0 ) mod [−π/τ, π/τ ]| < δnew T .Because θ * is the maximal point, using (B5) and the result in Appendix E.2 we have Also, using (B16), we have where we use E θ * − E λ0 < p 0 ξ 1 in the second inequality, | exp (iR 0 (N − 1)/2) − 1| ≤ δ/2 and , where sin(R0/2) according to the second inequality of (B16), and the estimates in Eqs.(B17) and (B18) can be combined to obtain is a linearly decreasing function with 0 < h 2 (0) < 1, δ new is the unique solution satisfying (B15).
Eq. (B14) provides a direction for reducing the scaling of N, N s with respect to δ.If we ignore (B3), to guarantee a short depth δ new /ϵ, we can choose δ ≈ δ new , ξ ∼ δ new , and ξ 1 ∼ δ 2 new , then it suffices to choose N N s ∼ Θ(δ −2 new ).This gives us the desired order that we want.However, the previous argument can not be directly applied because we also need ξ = O(δ 2 ) and N N s = Ω(ξ −2 ) according to (B3) and (B4).These two requirements would bring back the original δ −4 new dependence in N N s .
Even the previous argument cannot be applied directly, we can still use Lemma 9 to improve the scaling with respect to δ in the following way: According to Lemma 9, for fixed small δ new , according to (B13), to make the depth smaller than δ new /ϵ, we should choose for all 1 ≤ i ≤ M .Then, according to the first inequality of (B4) and (B14), we set Minimizing this in δ, ξ, ξ 1 with fixed δ 1 , a proper choice of these parameters should be new .
This choice of parameters would reduce the blow up rate of N N s to Θ δ when δ new → 0. Using similar argument as before, we can apply Lemma 9 repeatedly.In each iteration, we can slightly reduce scaling of N N s , and the final scaling can be O(δ −2+o (1) ).This iteration process can be carried out using the following lemma: Lemma 10.Given 0 < δ ≤ 4, 0 < η < 1/2 and define T = N τ .Given an integer M > 1, a decreasing sequence {ξ i } M i=0 with small enough ξ 0 , a decreasing sequence {δ i } M i=0 .Assume the condition of Proposition 8 is satisfied with ξ = ξ 0 and δ = δ 0 .If and Proof of Lemma 10.First, according to Lemma 9 and Proposition 8, conditions (B21), (B22) ensure Combining second inequality of (B21) with Lemma 6 (A6), we also have .
where δ new is the unique solution to the following equation: .

and
|G n,q | ≤ |E n,q | + q + (1 − p r (I, I ′ )) p 0 p r (I, I ′ ) .(D8) Similar to the discussion in Appendix B, we can not directly prove Theorem 13 using Lemma 15.We need to to use a different way to bound the expectation error E n,q to reduce the scaling of N, N s with respect to δ.First, we show the following lemma similar to Lemma 9: where we use q ≤ p 0 δ 2 /1600 and 1−pr(I,I ′ ) pr(I,I ′ ) ≤ δ 2 /800 in the last inequality.Also, using (D12), we have

Figure 1 .
Figure 1.Quantum circuit used for collecting the input data.H is the Hadamard gate, tn = nτ .Choosing W = I or W = S † (S is the phase gate) allows us to estimate the real or the imaginary part of ⟨ψ| exp(−itnH)|ψ⟩.

Figure 3 .
Figure 3. Landscape of the objective function f (θ) in Eq. (13), and a number of possible choices of the initial guess θ0 with Tmax = 80 and eight sites TFIM model (see detail in Section V.1).Here p0 = 0.8, and the landscape for other values of p0 is similar.The BFGS algorithm is used to maximize the objective function f (θ) for different initial guess θ0.The error threshold is set as 0.01, meaning the optimization problem is successfully solved if |θ * − argmax θ f (θ)| < 0.01.

− 2 0
log(d) and QPE 5 times to compare the averaged error.The results are shown in Fig. 5 (4 sites) and Fig. 6 (8 sites).