Even more efficient quantum computations of chemistry through tensor hypercontraction

We describe quantum circuits with only $\widetilde{\cal O}(N)$ Toffoli complexity that block encode the spectra of quantum chemistry Hamiltonians in a basis of $N$ arbitrary (e.g., molecular) orbitals. With ${\cal O}(\lambda / \epsilon)$ repetitions of these circuits one can use phase estimation to sample in the molecular eigenbasis, where $\lambda$ is the 1-norm of Hamiltonian coefficients and $\epsilon$ is the target precision. This is the lowest complexity that has been shown for quantum computations of chemistry within an arbitrary basis. Furthermore, up to logarithmic factors, this matches the scaling of the most efficient prior block encodings that can only work with orthogonal basis functions diagonalizing the Coloumb operator (e.g., the plane wave dual basis). Our key insight is to factorize the Hamiltonian using a method known as tensor hypercontraction (THC) and then to transform the Coulomb operator into an isospectral diagonal form with a non-orthogonal basis defined by the THC factors. We then use qubitization to simulate the non-orthogonal THC Hamiltonian, in a fashion that avoids most complications of the non-orthogonal basis. We also reanalyze and reduce the cost of several of the best prior algorithms for these simulations in order to facilitate a clear comparison to the present work. In addition to having lower asymptotic scaling spacetime volume, compilation of our algorithm for challenging finite-sized molecules such as FeMoCo reveals that our method requires the least fault-tolerant resources of any known approach. By laying out and optimizing the surface code resources required of our approach we show that FeMoCo can be simulated using about four million physical qubits and under four days of runtime, assuming $1\,\mu$s cycle times and physical gate error rates no worse than $0.1\%$.

We describe quantum circuits with only O(N ) Toffoli complexity that block encode the spectra of quantum chemistry Hamiltonians in a basis of N molecular orbitals. With O(λ/ ) repetitions of these circuits one can use phase estimation to sample in the molecular eigenbasis, where λ is the 1-norm of Hamiltonian coefficients and is the target precision. This is the lowest complexity that has been shown for quantum computations of chemistry within an arbitrary basis. Furthermore, up to logarithmic factors, this matches the scaling of the most efficient prior block encodings that can only work with orthogonal basis functions diagonalizing the Coloumb operator (e.g., the plane wave dual basis). Our key insight is to factorize the Hamiltonian using a method known as tensor hypercontraction (THC) and then to transform the Coulomb operator into a diagonal form with a non-orthogonal basis defined by THC factors. We then use qubitization to simulate the nonorthogonal THC Hamiltonian, in a fashion that avoids most complications of the non-orthogonal basis. We also reanalyze and reduce the cost of several of the best prior algorithms for these simulations in order to facilitate a clear comparison to the present work. In addition to having lower asymptotic scaling spacetime volume, compilation of our algorithm for challenging finite-sized molecules such as FeMoCo reveals that our method requires the least fault-tolerant resources of any known approach. By laying out and optimizing the surface code resources required of our approach we show that FeMoCo can be simulated using about four million physical qubits and under four days of runtime, assuming 1 µs cycle times and physical gate error rates no worse than 0.1%. Tables  The quantum computation of quantum chemistry is commonly regarded as one of the most promising applications of quantum computers [1][2][3]. This is because there are many applications of quantum chemistry that pertain to the development of practical technologies and, for at least some of these applications, quantum algorithms appear to provide an exponential scaling advantage relative to the best known classical approaches [4]. Specifically, most algorithmic work in this area focuses on the construction of quantum circuits that precisely sample in the eigenbasis of the electronic Hamiltonian. This enables the very precise preparation of any electronic eigenstate that can be reasonably approximated with a classically tractable theory, such as the ground states of many molecules. The ability to arbitrarily refine the accuracy of an approximation to molecular eigenstates is valuable because high precision is often required to predict important properties of these systems, such as the rates of chemical reactions, excitation energies, barrier heights, and non-covalent molecular interaction energies [5]. The "holy grail" of quantum chemistry is to have a generally applicable electronic structure method that yields relative energies with errors less than 1.6 milliHartrees (so-called "chemical accuracy" [6]), which is still a long-standing challenge in the field.

List of
The most efficient rigorous approaches to this problem in quantum computing use the quantum phase estimation algorithm [7,8] to sample in the eigenbasis of the molecular Hamiltonian by measuring the phase accumulated on an initial state under the application of a unitary operator with eigenvalues that are related to those of the electronic Hamiltonian. The electronic Hamiltonian in an arbitrary second-quantized basis can be expressed as T pq a † p,σ a q,σ V = 1 2 α,β∈{↑,↓} N/2 p,q,r,s=1 V pqrs a † p,α a q,α a † r,β a s,β (1) where N is the number of spin-orbital basis functions used to discretize the Hamiltonian, T pq represents a matrix element of the one-body operator, V pqrs is a matrix element of the two-body Coulomb operator, and a † p,σ and a p,σ are fermionic raising and lowering operators for the p th single particle orbital with spin σ. Note that unlike the convention in [9], we do not absorb the factor of "1/2" into the V pqrs (this is consistent with conventions of [10] to avoid confusion). V pqrs is the central tensor in this work, which is defined as V pqrs = dr 1 dr 2 φ p (r 1 )φ q (r 1 )φ r (r 2 )φ s (r 2 ) |r 1 − r 2 | i=1 is a set of real-valued single particle spatial-orbital basis functions. While most work has focused on encoding the eigenspectra of this Hamiltonian in a unitary for phase estimation by synthesizing the time-evolution operator e −iHt [11], several recent papers (including this one) have demonstrated increased efficiency by instead synthesizing a qubitized quantum walk [12] with eigenvalues proportional to e ±i arccos(H/λ) , where λ is a parameter related to the norm of the Hamiltonian. By repeating this quantum walk a number of times scaling as O(λ/ ) one is able to prepare eigenstates of H and estimate the associated eigenvalue such that the error in the estimated eigenvalue is no greater than [13,14].
Throughout this paper, all discussion of an error in the eigenspectra refers to error relative to the finite sized basis of N spin-orbitals that we have used to discretize our system. The finite sized basis we choose also introduces a discretization error with respect to the representation of the system in the continuum limit. While this basis set discretization error can be asymptotically refined as = O(1/N ) for common choices of the single-particle basis such as plane waves or molecular orbitals, the precise constant factors in this scaling depend on the particular basis functions used (which also determine the coefficients T pq and V pqrs in Eq. (1)). In the last few years, several papers have demonstrated quantum algorithms for chemistry with reduced complexity [10,[15][16][17][18][19]. Some algorithms achieved a lower scaling by performing simulation in a basis [15,19,20] that diagonalizes the Coulomb operator V such that V pqrs = 0 unless p = q and r = s. However, those representations typically require a significantly larger N in order to model molecular systems within target accuracy of the continuum limit. While such representations might prove practical for molecules when used in first quantization [21,22], in second quantization the requirement of a significantly larger basis translates into needing significantly more qubits and thus those approaches seem impractical for simulating molecular systems (as opposed to, e.g., crystalline solid-state systems).
Given this, there is a need for efficient quantum algorithms that are compatible with arbitrary basis functions and not limited to those that diagonalize the Coulomb operator. Such quantum algorithms can exploit the molecular orbital basis directly and thereby significantly reduce the required number of basis functions for a target accuracy towards the continuum limit. There are only three prior papers have fully determined the cost of performing such a quantum algorithm for chemistry within an error-correcting code [9,10,23].
The first of these papers by Reiher et al. [23] deployed a Trotter-based approach to study the quantum simulation of the FeMo cofactor of the Nitrogenase enzyme, also known as "FeMoCo" (Fe 7 MoS 9 C), a molecule that is important for understanding the mechanism of biological nitrogen fixation [24]. The algorithm used for that work had T gate complexity scaling as approximately O(N 2 S/ 3/2 ) where S is the Hamiltonian sparsity (S = O(N 4 ) in an arbitrary basis but with a sufficiently large, localized basis sometimes S = O(N 2 ) [25]). The work of Reiher et al. required more than 10 14 T gates to simulate an active space model of FeMoCo. These papers focus on counting (and reducing) the required number of T gates (or Toffoli gates) because within practical error-correcting codes such as the surface code [26], these gates require significantly more time to implement than any other gate and also require a very large number of physical qubits for their implementation. If implemented in the surface code using gates with 10 −3 error rates, the most efficient protocols for implementing T gates require roughly 15 qubitseconds [27,28] of spacetime volume. At those rates, just distilling the magic states needed for the Reiher et al. FeMoCo calculation would require four million qubitdecades (e.g., one million qubits running for four decades or one billion qubits running for two weeks).
This work and the two papers [9,10] employ a qubitization based approach [12], and improve considerably over the Trotter-based methods of [23]. The primary difference between these more recent algorithms is that each adapts qubitization to a different type of tensor factorization of the Coulomb operator. The first of these papers, and the first to suggest combining qubitization with tensor factorizations of the Coulomb operator was by Berry et al. [9]. That work presents two approaches (one with lower asymptotic complexity, and one with lower gate counts for molecules such as the FeMoCo active space). The approach with the lower finite sized gate counts is based on using qubitization to exploit sparsity structures in the electronic Hamiltonian. We discuss that method (referred to throughout as the "sparse" method) and its cost in Appendix A and show that in [9], an error led us to overestimate the complexity. We have also recalculated the number of nonzero entries that must be retained in the Hamiltonian, as well as making a number of minor improvements to the algorithm, leading to the sparse approach improving over the Toffoli complexity of the Reiher et al. result by roughly a factor of 1,100× rather than the factor of 700× reported in [9,10]. In terms of asymptotic complexity this approach has Toffoli complexity O((N + √ S)λ/ ) 1 and space complexity O(N + √ S), where λ is a parameter related to the Hamiltonian norm (we discuss λ further towards the end of the introduction).
The lower scaling method (but with slightly higher constant factors) presented by Berry et al. [9] combines qubitization with the first step of a tensor factorization originally discussed for quantum computing in [29]. Here we refer to that representation as the "single low rank" factorization or simply "SF". The SF uses an eigendecomposition of the two-electron integral tensor, similar to the Cholesky decomposition [30] or density fitting [31][32][33] as commonly used in electronic structure literature. The single low rank factorization algorithm obtains Toffoli and space complexities of O(N 3/2 λ/ ) and O(N 3/2 ), respectively. We discuss that method and its cost in Appendix B.
The most recent paper by von Burg et al. [10] adapts qubitization to a second tensor factorization that occurs on top of the SF used by Berry et al. This second tensor factorization was also first described for quantum computing in [29] and corresponds to a diagonalization of the squared one-body operators, which dates back to [34] in the electronic structure literature. We refer to this as the "double low rank" factorization or simply DF. We review the method and its cost in Appendix C. Von Burg et al. claim that their DF algorithm gives more than an order of magnitude complexity improvement relative to the SF algorithm of Berry et al.; however, the numbers they compare to are actually those for the sparse algorithm of Berry et al. (which is more efficient than the SF method for molecules they consider). Furthermore, for the more accurate FeMoCo Hamiltonian introduced by Li et al. [35], the sparse algorithm requires half as many logical qubits and 2/3 the number of Toffoli gates as the DF algorithm (so in that case, the sparse algorithm is considerably cheaper). The Toffoli complexity of the DF approach is O(N λ √ Ξ/ ) with space complexity O(N √ Ξ) where Ξ is the average rank of the second tensor factorization discussed in [29]. In most cases (including the regimes that are usually of interest for small quantum computers) Ξ will scale around O(N ), giving similar scaling as the SF method. There is some evidence that when N is growing towards the thermodynamic limit (the number of atoms is very large and increasing while fixing the ratio of spin-orbitals to atoms) Ξ can scale as O(log N ) [34]; however, this is not the case in our numerics on hydrogen systems which reveal scaling of O(N ).
There are several other commonalities between the algorithms of [9,10] and the ones developed here which are worth discussing before introducing the main techniques of this paper. In addition to combining qubitization [12] and tensor factorizations, all three works involve the use of unary iteration, QROM, coherent alias sampling, and qubitized phase estimation bounds from [16] as well as a more advanced version of QROM with fanout first developed in [36] and then optimized for use in our context [9] (where it is referred to as "QROAM" -a portmanteau of QROM and QRAM). Using these tools one can often construct algorithms to realize qubitized quantum walks with gate complexity that scales as O( √ Γ) where one requires O( √ Γ) ancilla, and Γ is the amount of information required to specify the Hamiltonian within a particular tensor factorization. Then, by performing phase estimation on the resultant qubitized quantum walks, one is able to sample in the Hamiltonian eigenbasis with Toffoli complexity O( √ Γλ/ ). For the sparse algorithm of [9], no tensor factorization is employed and thus the Hamiltonian (the same as in Eq. (1)) contains a number of integrals scaling as Γ = O(S). Thus, accounting for a complexity proportional to N required to perform controlled operations, we end up with an algorithm having Toffoli complexity O((N + √ S)λ/ ) and space complexity O(N + √ S). For the SF algorithm of [9] we rely on the single low rank factorized Hamiltonian shown in Appendix B which is specified with an amount of information scaling as Γ = O(N 3 ). This leads to an algorithm with Toffoli complexity O(N 3/2 λ/ ) and space complexity O(N 3/2 ). Finally, the DF algorithm of [10] relies on the factorization shown in Appendix C which compresses the Hamiltonian to Γ = O(N 2 Ξ) pieces of information. Accordingly, this approach yields gate complexity O(N λ √ Ξ/ ) and space complexity O(N √ Ξ). The last element of the complexity of these algorithms that we should discuss is the scaling of λ. The quantity λ is a type of norm of the Hamiltonian that is being simulated, and depends on how it is expressed. For example, for the "sparse" algorithm of [9] the value of λ is simply the sum of the absolute value of Hamiltonian coefficients (i.e., the 1-norm) appearing in Eq. (1). Numerical studies reveal that λ usually scales somewhere in between O(N ) and O(N 3 ) depending on details of the particular system as well as the algorithm and how one is increasing N (e.g., the scaling is lower if N is growing because the number of basis functions per atom is fixed but the number of atoms is increasing and the scaling is higher if N is growing because the number of atoms is fixed but the basis size is growing). The largest λ tends to be that associated with the SF used in [9], followed by the sparse representation used in [9], with the smallest λ of three approaches discussed so far being the λ associated with the DF used in [10]. In most contexts the main advantage of the DF algorithm relative to the SF algorithm is not a substantially smaller Γ (or less complex quantum walk) but rather, a smaller λ. Thus, it is critical to consider how the algorithmic choices that one makes will affect the value of λ.

B. Overview of results
We now discuss the main results of this paper. We will present a new qubitization based algorithm that results from exploiting structure in the molecular Hamiltonian that emerges from a tensor factorization of the Coulomb operator known as tensor hypercontraction (THC) [37][38][39]. THC is a very compact representation for the Hamiltonian which gives Γ = O(N 2 ) regardless of how one increases N . It is relatively straightforward to apply qubitization directly to the THC representation and a method for that is presented in Appendix E. This results in an approach with Toffoli complexity O(N λ/ ) and space complexity O(N ). However, it turns out that directly applying qubitization to the THC representation is not particularly efficient because this causes λ to become even larger than the single low rank λ of [9] (which was already orders of magnitude larger than the double low rank λ of [10] for systems like FeMoCo). To remedy this, we discuss a different approach to utilizing the THC representation of the Hamiltonian.
We show that one can use the THC tensors to define a larger orbital basis with exactly the same resolution as the original basis while diagonalizing the Coulomb operator. This form of the Hamiltonian would then be amenable to simulation using the efficient strategies of [10,[16][17][18] except for the fact that the orbitals are non-orthogonal. However, by separately rotating into this non-orthogonal basis before operating on each tensor factor of the Hamiltonian we are able to avoid many of the complications that would usually arise from simulating a non-orthogonal operator. To achieve this, we use a strategy for storing and implementing non-orthogonal Givens rotation based orbital basis transformations [40] with rotation angles loaded from QROM that is similar to techniques developed for qubitizing orthogonal basis rotations in [10]. As these rotations add significant cost to the method, we also introduce a strategy for coarse graining the angles of the basis transformation that allows us to precisely control a tradeoff between the complexity of these rotations and the accuracy of the Hamiltonian (a similar technique would also reduce the cost of the algorithm by von Burg et al.). Ultimately, our method does not require additional system qubits (beyond those for QROM and other minor ancillary functions) and results in asymptotic Toffoli complexity O(N λ/ ) and space complexity O(N ), while essentially matching the small λ values obtained in the double low rank algorithm.
As can be seen in Table I our algorithm improves over the spacetime volume of the approaches in [9,10] by a factor of about O(N ) in most contexts. In fact, the scaling is often even better than that due to the lower scaling of λ associated with our representation, as can be seen from numerics on hydrogen systems in Table II. Surprisingly, the λ value associated with our algorithm sometimes scales even less than O(N 2 ), which is the scaling of the λ for the lowest scaling qubitization approach requiring orthogonal basis functions that diagonalize the Coulomb operator [16].
In addition to having the best asymptotic scaling compared to prior approaches, our algorithm also outperforms the finite spacetime volume for all molecules studied here including hydrogen chains and FeMoCo. We study active space models of FeMoCo proposed by Reiher et al. [23] as well as by Li et al. [35]. The Reiher Hamiltonian was found to be qualitatively incorrect in describing the ground state of FeMoCo as it does not capture the open-shell nature of the  [47] Use of Taylor series (on-the-fly method) O(N ) O(N 5 / ) 2015 Babbush et al. [48] Use of Taylor series with first quantization Reiher et al. [23] First T count and tighter Trotter bounds Motta et al. [29] Use of low rank factorization with Trotter Campbell [49] Use of randomized compiling with Trotter Berry et al. [9] Use of qubitization (sparse method) Berry et al. [9] Use of qubitization (single factorization) O(N 3/2 ) O(N 3/2 λSF/ ) 2019 Kivlichan et al. [50] Better randomized compiled phase estimation O(N ) O(λ 2 V / 2 ) 2020 von Burg et al. [10] Use of qubitization (double factorization) TABLE I. History of the lowest asymptotic scaling quantum algorithms for simulating quantum chemistry in an arbitrary (e.g., molecular orbital) basis. N is the number of arbitrary orbital basis functions, η is the number of electrons (relevant only in firstquantized simulations) and is the target precision to which we estimate the Hamiltonian eigenvalues using phase estimation. S is the sparsity of the electronic Hamiltonian; usually S = O(N 4 ) when using an arbitrary basis but sometimes the scaling can be lower. The λ parameters (discussed extensively in this paper) are roughly the 1-norm of the Coulomb operator associated with the representation in which we simulate the system. In general, we expect that the precise scaling is difficult to report since it depends on the specific molecule and how N is growing. Roughly, the λ values have scaling that is typically between O(N ) and O(N 3 ). Here, Ξ is the average rank of the second tensor factorization discussed in Motta et al. [29], which is also important for the complexity of the work of von Burg et al. [10]. In general (for example, when scaling towards the continuum limit) we would expect that Ξ = O(N ). But for large systems that are growing because we are adding more atoms while keeping the basis to atom ratio fixed, Ξ can be smaller; for the hydrogen chains studied in this work it seems that Ξ = O(N ). See Table II for a better sense of how algorithms that depend on S, λ and Ξ parameters scale for hydrogen benchmarks. The work of Motta et al. [29] does not determine the scaling of the Trotter error for the Trotter steps compiled therein and so this table assumes (rather speculatively) that the Trotter error scaling for those Trotter steps is the same as the Trotter error scaling for the Trotter steps of Reiher et al. [23]. This table omits methods that require special basis functions or non-Galerkin representations. All such representations are less compact for molecules compared to molecular orbitals and thus require more qubits to reach the same level of accuracy. Notable examples include the grid bases used in [21,51], the discontinuous Galerkin techniques of [19], the plane waves required by [10,17,22] and the basis sets diagonalizing the Coulomb operator required by [15,18,40].
system [35]. The Li Hamiltonian was then proposed and shown to capture the open-shell nature of the ground state properly [35]. We focus on the FeMoCo Hamiltonians primarily because they are regarded as a standard benchmark for quantum computing. For the FeMoCo Hamiltonians of Reiher et al. and Li et al. we find a reduction in spacetime volume of about 3× and 6× (respectively) compared to [10]. The results for FeMoCo are summarized in Table III.
We also carefully analyze the surface code resources required to simulate the FeMoCo Hamiltonian of Li et al [35] using our THC approach. Rather than just focus on the cost of distillation, we fully layout the surface code computation in spacetime and optimize resource usage. We determine that the computation could execute using approximately four million physical qubits and run in under four days, assuming surface code cycle times of 1 microsecond and gate error rates of about 0.1%. We find that if error rates were reduced to 0.01% that the computation could complete using about one million physical qubits and under two days.

Algorithm
H4 continuum limit hydrogen chain thermodynamic limit space complexity Toffoli complexity space complexity Toffoli complexity Babbush et al. [47] (Taylor series database) [49,50] Table I for two benchmark chemical series. N is the number of spin-orbital basis functions and is the target precision to which we would aim to realize phase estimation. This table summarizes the findings of numerics discussed in Section IV which reveal the scaling with N that one might observe in practice for these algorithms when the system size is growing towards either the continuum or thermodynamic limits. For the continuum limit we focus on H4 Hamiltonians with each hydrogen placed on the corners of a square plaquette with side length of 2.0 Bohr radii. We then increase the number of molecular orbitals used to represent the system and determine the scaling of the associated algorithms. For scalings towards the thermodynamic limit we fix the ratio of basis functions to atoms and increase the number of hydrogens in a one-dimensional hydrogen chain, with atom spacings again at 1.

C. Paper organization
Our paper is organized as follows. In Section II we describe the THC factorization of the electronic Hamiltonian. We give some background on how the factorization can be obtained, and discuss how it has been shown to scale. We outline how one can directly apply qubitization to the THC Hamiltonian, although with large constant factors in the scaling. Then, we introduce an elegant use of the THC Hamiltonian that corresponds to a diagonal Coulomb operator in a larger non-orthogonal basis.
In Section III we give a complete description of how qubitization can be combined with the novel (diagonal and non-orthogonal) THC representation to give the most efficient known quantum algorithm for simulating electronic structure in an arbitrary basis. We compile our approach all the way to Clifford + Toffoli gates and report the constant factors in the leading order scaling for both the total Toffoli complexity and total ancilla required.
In Section IV we analyze the finite resources required to perform the algorithm of Section III for the simulation of several real systems: FeMoCo and hydrogen chains of various sizes. These numerics demonstrate the effectiveness of the THC representation, help to elucidate certain aspects of the scaling of our approach, and allow us to compare the cost of our approach to prior methods. The second part of this section discusses the layout of the Li FeMoCo Hamiltonian simulation in the surface code. Our analysis considers the cost of routing, distillation, and analyzes how many physical qubits are required at all points in the computation. Finally, we reflect on the significance of these results and suggest future directions for research in Section V.
Appendix A, Appendix B and Appendix C contain extensive and self-contained descriptions of the prior methods from [9] and [10], adapted to the notation and conventions of this paper, and in some cases with improved bounds on the complexity of those methods. Furthermore, they also contain numerical details associated with each method that are not presented in the main text. Appendix D covers the randomized compiled methods of [49] which are not particularly related to the other methods here but are simple to analyze and also have scaling that depends on λ so we are able to analyze the exact resources required by that approach with numerics that were already available to us. Then, in Appendix E we perform a detailed analysis of how one can directly use the THC representation without projecting into the non-orthgonal basis. This leads to an exceptionally simple-to-understand algorithm (more straightforward than the other approach of this paper or those of [9,10]) but with worse constant factors in the scaling compared to the primary approach of this paper due to a larger value of λ. The remaining two appendices discuss technical details pertaining to the implementation of important circuit primitives used throughout this work. In Appendix F we discuss a technique and costings for computing contiguous registers. Finally, in Appendix G we describe and analyze the cost of modifying the standard QROM procedure [16] to output two registers at a time.
The structure of the tensor factorization in Eq. (3) is rather different from either the SF or the DF used in [9,10]. For instance, unlike with the low rank decompositions, it is unclear how one might combine the THC factorization with reduced scaling product formulas for time-evolution [29] or better methods of performing energy measurements for variational algorithms [73]. Similar to how the work of [9,10] combines qubitization with the tensor factorizations described in Appendix B and Appendix C, here we will discuss how the THC factorization leads to an advantage when combined with qubitization; however, our approach will require different qubitization oracles (and thus different algorithms) from those in [9,10].

B. Numerical computation of the tensor hypercontraction factorization
The problem of obtaining the factorization of Eq. (3) is often numerically ill-conditioned and has been the subject of research for many years in quantum chemistry [37][38][39][53][54][55][56][57][58][59][60][61][62][63][64][65][66][67][68][69][70][71][72]. We write the THC factorization problem as an L 2 norm minimization problem where we seek minimizers, χ and ζ, for This objective function L is quartic in χ and linear in ζ. This generally exhibits multiple minima as well as a flat optimization landscape, which makes finding global minimizers challenging. We developed a strategy to cope with numerical difficulties, which was found to be effective for systems considered in this work. We provide details of the strategy below. All numerical results in this paper obtained from the following protocol. The resulting THC tensors are available in [74].

Initial guess
Due to the nonlinear nature of Eq. (5), good initial guesses are often critical in obtaining accurate THC factorizations. We generate random guesses for χ and ζ when the real-space molecular orbital representation is unavailable. This is the case for the FeMoCo Hamiltonians since only Hamiltonian matrix elements V pqrs are reported in literature. In this work, we tried 20 random guesses and picked the best performing one in the end.
On the other hand, when we have real-space molecular orbital representation available (which is the case for hydrogen systems and other chemical systems in general) we utilize the interpolative separable density fitting (ISDF) technique to generate an accurate initial guess for χ and ζ. The ISDF approach was originally proposed by Lu and Ying [61]. In terms of the classical precomputation required, this approach is more efficient than the direct gradient descent approach. It is based on the intuition that the THC factorization is an interpolative decomposition of the electronic pair density in real-space: where φ p (r) is the p th single-particle orbital represented on a grid {r}, r µ denotes the interpolation points and ξ µ (r) is called an interpolation vector which can be obtained via a simple least-squares fit. Once r µ and ξ µ (r) are found, the THC factorization is then obtained via ISDF involves no nonlinear optimization and thereby it is numerically robust and provides straightforward initial guess for subsequent direct minimizations. Finding interpolation points can be also performed with a linear complexity [71] via the centroid Voronoi tessellation (CVT) approach [68]. Determining ξ µ (r) scales as O(N ) and is ultimately the bottleneck of this algorithm. While this is more economical than also performing further optimization (see the next subsection), the resulting THC factorization is not as compact for given accuracy. Therefore, in this work, we use the ISDF/CVT approach to generate a good initial guess from which we perform further optimization.

Optimization
While ISDF initial guesses are quite accurate, random guesses are sometimes very far away from any local minima. Therefore, when starting from random initial guesses, we found that a quasi-Newton method such as the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm with box constraints (L-BFGS-B) algorithm [75] is quite robust in the warm-up stage. We note that quasi-Newton methods were previously studied for use in obtaining THC factorizations in [67]. The necessary gradient of Eq. (5) for BFGS is obtained via the automatic differentiation method in JAX [76]. Unfortunately, L-BFGS-B alone could not produce accurate factorization for all of the systems considered in this work. Therefore, in some cases we also had to perform another optimization with a different solver.
After an L-BFGS-B run, we employ a popular machine learning optimizer called AdaGrad [77] as implemented in JAX to finalize the factorization [76]. We found that L-BFGS-B tends to get easily stuck in unwanted local minima and subsequent optimization via AdaGrad helped to escape from a local minimum and improved the accuracy of our optimization by more than an order of magnitude.
C. Diagonal Coulomb operators from projecting into the auxillary tensor hypercontraction basis One can attempt to apply qubitization directly to the Hamiltonian representation of Eq. (3). In fact, doing this is relatively straightforward and in Appendix E we describe an algorithm based on exactly that approach. We show that this results in an algorithm with Toffoli complexity O(N λ thc / ), which prima facie, is relatively low complexity. The method of Appendix E is also exceptionally simple as far as arbitrary basis quantum chemistry qubitizations go; thus, that approach might be valuable for pedagogical purposes. However, the problem with directly applying qubitization to this form of the THC Hamiltonian is that we end up with a λ thc defined as Because the sum over µ and ν appears outside of the absolute value in this expression, λ thc is much larger than even λ V , the 1-norm of the original Hamiltonian. Note that λ V is already larger than λ DF (the λ value associated with the von Burg et al. [10] algorithm), and so this larger λ value should be avoided. To avoid this blow up in λ, the main approach of this paper will focus on simulating a different Hamiltonian that we derive from the THC representation.
As discussed in the introduction, a number of recent papers have demonstrated very high efficiency quantum algorithms for simulating the electronic Hamiltonian in a basis that diagonalizes the Coulomb operator [15][16][17][18]78]. The downside of these approaches is that the required orthogonal basis sets often require many more qubits in order to reach chemical accuracy for molecules, compared to arbitrary basis functions such as molecular orbitals which lead to the Coulomb operator in Eq. (1). On the contrary, the THC representation of the Coulomb operator in Eq. (3) may be more compact, but it is not diagonal. However, as we show below, it is possible to use the χ tensor from the THC factorization to define a larger "auxillary" basis in which the Coulomb operator is diagonal.
We can achieve this diagonal representation by first defining a non-unitary (and non-orthogonal) basis rotation corresponding to a projection of the original fermion ladder operators into a larger basis: where the creation / annihilation operators c † µ / c µ act on a larger space of 2M spin-orbitals rather than N spinorbitals. Without loss of generality, we are taking χ (µ) to be a normalized vector for each µ (because constant factors can be absorbed into ζ µν ). Using this, we rewrite Eq. (3) as This provides a diagonal form of the Coulomb operator in the expanded basis: where n µ,σ = c † µ,σ c µ,σ is the number operator in the larger basis. To our knowledge, we are the first to derive this representation of the Hamiltonian (it does not seem to appear in any prior literature even for classical electronic structure). Even though the basis size (and thus the number of qubits) has been increased by a factor that is roughly between 4 and 10 in most contexts, the appeal of highly efficient algorithms presented in [15][16][17][18]78] for diagonal Coulomb operators makes this representation interesting. Unfortunately, this larger basis is not orthogonal, which complicates our approach to directly simulating this Hamiltonian. For example, one cannot use methods such as those in [16,18] that were developed to simulate diagonal electronic Hamiltonians in orthogonal basis sets. Instead, we will pursue a qubitization based approach that involves rotating into the underlying non-orthogonal basis, one tensor factor at a time, avoiding complications that arise if the rotations were done globally. The Hamiltonian in Eq. (11) will be referred to as the non-orthogonal THC Hamiltonian. D. Deriving the λ value associated with the non-orthogonal THC Hamiltonian representation We will now give a very high level overview of the main approach of this paper, which involves qubitizing Eq. (11), and we will derive the λ value associated with that representation. In [10] it was shown that when qubitizing an operator in a different basis, the basis rotation does not need to rotate all the bases at once. Instead, because it is controlled by a register, it is sufficient to perform basis rotations independently for each number operator controlled by the register. As shown in Eq. (51) of [10], only N/2 Givens rotations are needed, instead of O(N 2 ) if all N basis vectors were being rotated at once. So in our case, we can control on µ to rotate the basis to that described by n ν,α using N Givens rotations in the same way as shown in Eq. (51) of [10]. Since χ (µ) is taken to be a normalized vector for each µ, this is a valid rotation for each individual µ. It does not matter that the basis is not orthogonal, because the rotations are done individually for each µ separately, rather than jointly for all µ together.
It is also possible to apply other methods from [10] associated with implementing these rotations.
1. Use a QROM to output the N/2 rotation angles controlled on the register µ which takes M values.
2. Since the number operators can be represented by (1 1 − Z)/2, we can take the identity parts out and combine them with the one-body terms, and the remaining two-body terms will have a λ-value that is divided by 4.
Note also that the procedure to rotate to and from the new basis must be done twice, once for the operators dependent on ν and again for those dependent on µ. The net result is that the complexity for a single step is O(N ), and the value of λ has a contribution from two-body terms To determine the λ value more carefully, G can be given as In actually implementing this, we would use α or β to control swaps between the qubits representing spin up and spin down orbitals, so U µ would only need to act on N/2 qubits, but for simplicity this will not be shown explicitly here. Then by taking n 1,α = (1 1 − Z 1,α )/2, we have The first term corresponds to an overall energy shift that can be ignored. The third term is the two-body term that gives the contribution to λ given in Eq. (12). The middle term can be given as Therefore, combining T (2→1) with T gives a new one-body operator Then T can be diagonalized as where t are eigenvalues of T pq , and U T,p are individual rotations for T similar to the U p for G. The first term is an overall energy shift, and the second term is the one-body term that contributes towards λ. The net result of this is As we will discuss later on, λ ζ actually scales even better than the λ values associated with any prior algorithm in the literature, including the λ DF associated with the doubled factorized algorithm of von Burg et al. [10].

III. QUBITIZING THE NON-ORTHOGONAL TENSOR HYPERCONTRACTION HAMILTONIAN
A. Approach to qubitization Our approach to encoding the eigenspectra of the THC representation of the electronic Hamiltonian in a unitary for phase estimation will use the linear combination of unitaries (LCU) query model [79]. Specifically, we will use qubitization [12] to block encode [80] the Hamiltonian eigenspectra as a Szegedy quantum walk [81]. What all LCU methods have in common is that they involve simulating or block encoding the Hamiltonian from a representation where it can be accessed as a linear combination of unitaries: where U are unitary operators and the ω are scalars. LCU methods are defined in terms of queries to two oracle circuits that are commonly defined as where |ψ is the system register, | is an ancilla register which usually indexes the terms in Eq. (20) in binary, and λ is a parameter that turns out to be important for the complexity of the approach. Currently, the most practical fault-tolerant approaches for simulating quantum chemistry are based on qubitization [9,10,12,16]. Following the analysis and techniques of [16], one can use phase estimation based on qubitized quantum walks to sample in the eigenbasis of a Hamiltonian with error in the sampled eigenvalue bounded from above by with Toffoli complexity scaling exactly as where C S is the gate complexity of select, C P is the gate complexity of prepare, C P † ≤ C P is the gate complexity of uncomputing prepare, and pea is the allowable error in the phase estimation.
The multiplying factor of πλ/(2 pea ) corresponds to the number of repetitions of the LCU step used in the phase estimation. In [16] the number of repetitions was taken to be a power of 2, because that makes the phase estimation particularly simple, with each qubit of the control registers controlling a number of repetitions that is a power of 2. It is also possible to use a number of repetitions that is an arbitrary integer. This was assumed by von Burg et al. [10], although doing this requires a more sophisticated control which those authors do not show how to perform. We will explain how this can be accomplished. The general principle is to control each step using the unary iteration procedure introduced in [16].
To explain the procedure in more detail, one should combine the select operation together with a reflection R = 2 |L L| ⊗ 1 1 − 1 1 based on the prepare operation, to create a step of a quantum walk W which has eigenvalues proportional to e ±i arccos(En/λ) where H |n = E n |n [13,14]. In order for this procedure to work, the select needs to be self-inverse. It is possible to obtain the same complexity if there were controlled application of select or select † , but we avoid that because it doubles the complexity, and instead will construct select so it is self-inverse.
Simplistically, one could make W controlled as shown in Figure 1, but for the purpose of obtaining the complexity shown in Eq. (22), one needs to control application of W and its inverse. Given that select is self-inverse, one can obtain W † simply by performing the reflection before select instead of after. That means that one could control between W and W † by performing a controlled reflection before select as well as after (and not making select controlled). But, it is not necessary to perform two controlled reflections, only one. To see why, consider the case where the control is selecting four applications of W with the rest of the operations being W † . Then we need to perform the sequence of operations where U is being used for select and R is being used for the reflection (i.e. the combined inverse preparation, reflection about the zero state, and preparation). We will only use U or R with this meaning in this equation and in Figure 2.
Here, after the fourth U , we simply perform the identity rather than two reflections, because the reflections cancel. This means that we always perform the reflection in between each successive select, except when the number of times we have performed select matches the number in the control register. That means we only need one controlled reflection in between each select, which corresponds to removing the control on select in Figure 1.
Next we explain how to address cases where πλ/(2 pea ) is not a power of 2. To achieve that, one can simply prepare the optimal control state as described in [16] except using a superposition over a number of basis states that is smaller than a power of 2. Then, instead of controlling on each successive qubit of this state, use the unary iteration procedure of [16] to control the reflection. The unary iteration procedure introduces a trivial additional cost of one Toffoli per step, and doubles the ancilla cost of the control register for the temporary ancillas for the unary iteration. To show explicitly how this control is done, see the example in Figure 2.
A circuit realizing the qubitized quantum walk operator W controlled on an ancilla qubit. Note that the Z gate with the 0-control is actually controlled on the zero state of the entire | register and not just a single qubit. Accordingly, implementation of that controlled-Z has gate complexity log L + O(1) where log L is the size of the | register.
In our algorithm there are four sources of error: 1. Error due to measurement in phase estimation pea which first appears in Eq. (22).
2. The approximation of the Hamiltonian coefficients as part of the coherent alias sampling procedure from [16] that is used as part of our prepare strategy.
3. The approximation of the individual Givens rotations needed for the basis rotations.
To bound the overall error, one could take these individual errors and simply add them together. However, the sources of error 2 to 4 all contribute to error in the approximation of the Hamiltonian. That is, they together give an approximate Hamiltonian, and one can simply estimate the error due to using this approximate Hamiltonian.
A unary iteration circuit for selecting W 2t−10 for t from 0 to 10. Here R indicates the inverse preparation, reflection, and preparation. As in Figure 1 only the reflection need be made controlled.
One approach would be to determine the root-mean-square sum of the errors of each of the coefficients in the Hamiltonian as is done in [10]. Based on classical simulations we found that this overestimates the error, so we will instead perform classical calculations of the error in the energy to estimate the allowable truncations. We will include all error from approximation of the Hamiltonian into thc , and require that where ∆E is our target accuracy, which we will take to be 0.0016 Hartree (chemical accuracy) for the resource estimates of this paper. We will take pea ≤ 0.001 Hartree and thc ≤ 0.0006 Hartree. The allowable error for phase estimation in [10] was 0.0009 Hartree, but we will recalculate the cost with pea ≤ 0.001 Hartree for the DF method of [10] for a fair comparison. We also recalculate the costs using the same parameters for the sparse and SF approaches. Note that in quantum chemistry it is typical to require the differences in energy between two configurations to be determined to chemical accuracy, which naively would require accuracy of 0.0008 Hartree on each estimation of each energy. That precision is expected to not be necessary, because the approximations made in the Hamiltonian for the two configurations have correlated errors. For the phase estimation, the errors would add if the computation was performed independently for the two configurations, but it is possible to use the control register to control a forward evolution for one Hamiltonian on one target system, and reverse evolution on a second target system with the Hamiltonian for the second configuration. This can be combined with the improved phase estimation techniques of [14] which help to ensure one is projecting into the correct states at lower cost than the entire phase estimation. The phase estimation will then provide an estimate of the energy difference. We therefore continue to assume that the accuracy required is overall 0.0016 Hartree, which also provides consistency with prior work.
We will now discuss how to implement and compile prepare and select for our algorithm.

B. State preparation for the non-orthogonal tensor hypercontraction Hamiltonian
The method to perform the prepare step is based on expressing the Hamiltonian in a non-orthogonal basis as in Eq. (11). We can rewrite the Hamiltonian in the form where the first term is from T in Eq. (18) and the second term is from G in Eq. (14). The state we need is Here the first and second registers give the superpositions over spins, and the third and fourth registers store µ and ν.
The last register takes the value M + 1 to flag that this is the first term in the Hamiltonian. For simplicity we adopt the convention that these registers are labelled starting from 1, though 1 would be stored as binary 0 in the register.
For the state preparation, we have a three step procedure where we first prepare an equal superposition over the µ and ν registers, then perform coherent alias sampling [16], then swap the µ and ν registers controlled by an ancilla in a |+ state. This means we only need to initially prepare the range µ ≤ ν, and the number of coefficients needed in the state preparation is In the first step we need to create an equal superposition over µ ≤ ν ≤ M + 1 for registers 3 and 4, with µ ≤ N/2 for ν = M + 1. The procedure needed is depicted in Figure 3.
The method is to perform Hadamards on all qubits for these registers, then perform the inequality tests ν ≤ M + 1, µ ≤ ν, and µ ≤ N/2 controlled on ν = M + 1. These inequality tests have cost 4(n M − 1), with n M = log(M + 1) . Rotate an ancilla qubit by adding a constant into a phase gradient register (as described in Appendix A of [82]), to obtain an overall amplitude for success on this qubit and the three inequality tests approximately 1/2. This needs b r − 3 Toffolis, using b r bits of precision, which can typically be taken to be about 7. We can flag failure on an ancilla and perform the identity (instead of select) in the case of failure. Then we need to reflect on five registers, which we do using three Toffolis. Next, we can invert the inequality tests (which may be done without further Toffoli costs using the out-of-place adders of [83]), and invert the rotation with cost b r − 3. Then inverting the Hadamards and reflecting about zero has cost 2n M − 1. Then we perform the Hadamards and inequality tests again with cost 4n M − 3. Checking that the inequality tests are satisfied has cost 3, giving a total cost of 10 log(M + 1) + 2b r − 9 Toffolis.
The circuit for creating an equal superposition over the µ and ν registers, with nM = nM = log(M + 1) , and R br being a Y rotation to br bits of precision. The state is prepared on the first two registers, the last register flags success of the entire procedure, and the fourth-last register records whether ν was equal to M + 1. We use |µ and |ν to label the outputs on the top two registers, though these are in a superposition state.
Once we have prepared the equal superposition over the µ and ν registers, we can then prepare the state as shown in Figure 4. In more detail, the steps needed are as follows.
1. Use Hadamards to create the |+ states on the first two registers.

Create a new contiguous register from registers 3 and 4. This contiguous register can be given as
where we are using the convention in this equation that numbering starts from 1. Because we are using the convention that ν = M + 1 flags the first term where µ ≤ N/2, the allowed range of values for µ and ν gives a contiguous range of values for s. The complexity of computing s is n 2 M + n M − 1, which is shown in Appendix F. 3. Use the new contiguous register to output alternate values for registers 4 and 5, a sign qubit and an alternate sign qubit (to account for the signs in the linear combination of unitaries), and keep values. The size of the output is then where ℵ is the number of bits for the keep register. The cost of the QROM is then 4. Perform an inequality test between the keep register and a register in an equal superposition, with cost ℵ.
5. Perform a swap controlled on the result of the inequality test, with cost 2n M . We can simply perform controlled phase gates on the sign qubits instead of explicitly swapping them, which is why we simply have the cost of swapping the µ and ν registers with the alternate registers.
6. Controlled on a |+ state, and the result of the test for ν = M + 1 from step 2, swap the µ and ν registers, with cost n M + 1. Here the +1 is the Toffoli needed to perform the AND operation on two qubits for the control.
In inverting the state preparation, the cost of the inverse QROM is reduced to but the other costs are unchanged. Adding all these costs together gives with m = 2n M + 2 + ℵ, n M = log(M + 1) . The number of bits used for the keep register was ℵ = 2.5+log(10λ/∆E) in [10]. The error in the keep probability is translated to that in the state in the following way. The squared amplitudes of the state needed are |t |/λ and |ζ µν |/λ for µ = ν, but |ζ µµ |/2λ for µ = ν. These amplitudes are compared to initial squared amplitudes in the equal superposition of 1/d. Taking ℵ as the number of bits for the keep probability is equivalent to discretizing the squared amplitudes to the nearest 1/(2 ℵ d) (see Eq. (35) of [16]). It would at first be expected that this would lead to an error no larger than λ/(2 ℵ+1 d) in |t | or |ζ µν | for µ = ν, or λ/(2 ℵ d) for |ζ µµ |, with a factor of 2 reduction in the error because rounding gives error no larger than half the discretization size. The subtlety is that if we round all the squared amplitudes, the result will no longer be normalized. To maintain normalization it will be necessary to discretize some of the amplitudes in a different way than rounding, resulting in an error larger than λ/(2 ℵ+1 d), but still no larger than λ/(2 ℵ d). That is why the factor of 2 is not included in [16]. However, it is possible to make the typical errors smaller than λ/(2 ℵ+1 d) by adjusting the threshold for rounding from 1/2 to maintain normalization. In practice only a very small adjustment is needed.
FIG. 4. The state preparation after preparation of the equal superposition state in Figure 3. Here n d = log d , and we use |µ and |ν to label the input registers that are in a superposition. The modifications to this state preparation over that in [16] are that a contiguous register is calculated in the beginning, and at the end the µ and ν registers are swapped controlled on a |+ state and ν not being equal to M + 1, which is contained in a register output from the procedure for creating the equal superposition. Since we simply need to perform a Z gate on the sign qubit θ, instead of explicitly performing a controlled swap with the alt value, we can perform two controlled phase operations thereby eliminating one non-Clifford gate. These controlled phase gates need not be performed when inverting the state preparation.

C. Hamiltonian selection oracle for the non-orthogonal tensor hypercontraction Hamiltonian
For the select operation, we have two steps. In the first we need to perform the operation U µ or U T,µ for the twoelectron or one-electron terms, respectively. In the second we only need to perform U ν . To perform these operations, we can use a QROM to generate a sequence of N/2 rotations, then perform the rotations using the method given in [10]. The costs are as follows.
1. There is a cost of N/2 to perform the swaps controlled on the spin qubit.
2. The dominant cost is the QROM. The output size is large, so it is most efficient to perform the QROM as in [16]. In the first step (for µ) we have M + N/2 sets of data to output which has complexity M + N/2 − 2. Note that we need to output one set of rotations if the ν register is in the state |M + 1 , or a different set if the ν register is not in that state. The value ν = M + 1 is still flagged in an ancilla. In the second step (for ν) we only need to generate the rotations for the two-electron terms, so the cost is M − 2.
3. We can perform the rotations with cost N ( − 2) by adding the rotations into the phase gradient state. Here is the number of bits of precision used for the rotation angles.
4. The Z operation has no Toffoli cost. It needs to be controlled on the qubit which is the result of testing whether ν = M + 1 for the case where we apply the operations for ν.
6. To erase the QROM, in the case of µ we need to subdivide the bits into the more-and less-significant bits, where the more-significant bit includes that distinguishing the case ν = M + 1. We have a cost of M/k r1 for the case ν < M + 1, and a cost N/2k r1 for the case ν = M + 1, then a cost k r1 for the remaining k r1 less significant bits, for a total of For the case of µ, we only have to perform the QROM on M possible values, giving a cost of 7. Perform the swaps controlled by the ancillas (for spin) again with cost N/2.
This procedure needs to be performed twice, once for µ and once for ν. The complete procedure is shown in Figure 5. A subtlety is that the operation needs to be controlled by success of the state preparation, which can be achieved simply by making the Z operation controlled, which is a Clifford gate, so does not add to the Toffoli cost. Another subtlety is that for ν, we perform no operation if ν = M + 1. That can be achieved simply by making the Z operation controlled by the qubit output by the result of the equality test as well, requiring just one more Toffoli gate.
A major subtlety is that the select operation needs to be made self-inverse, and the operation as depicted is not self-inverse. This problem can be averted by using a NOT operation on the register controlling the swap of the µ and ν registers. To see why this is so, consider the prepared state (ignoring the one-electron term for simplicity) where ζ µν are the amplitudes needed for the asymmetric state. Introduce the ancilla in the |+ state and perform a controlled swap, giving Then, denoting V µ = U † µ Z 1,α U µ and similarly for ν, and performing the controlled operation on the target system in state |ψ , we have The circuit for performing the controlled operations (select) for the linear combination of unitaries. The registers from top to bottom are the success flag register, the µ and ν control registers, the qubit flagging that ν = M + 1, a blank ancilla to put the database of rotations in, two |+ states to provide the superposition over the operations on the spin up and down components of the state, and the qubits representing the spin down and up orbitals. The controlled R is indicating the basis change obtained from the sequence of rotations with angles provided in the database, equivalent to U u,0 U u,1 in [10], which is depicted in Eqs. (58) and (59)  Now performing an X on the ancilla qubit and swapping the µ and ν registers, we have Then, again performing the controlled operation on the target system, we have Performing the controlled swap again gives Finally, projecting onto |+ for the ancilla qubit gives Thus, this procedure has given the desired sum of operations V µ V ν + V ν V µ and is self-inverse. As shown in Figure 6, the form of the circuit we have just described can be simplified so that there are just controlled swaps performed on the µ and ν registers at the beginning and the end. Those controlled swaps correspond to the controlled swap at the end of Figure 4, with the controlled swap at the end corresponding to that done when inverting the state preparation.
To be more specific, for the complete select operation we can include the controlled swaps, and perform the operation as shown in Figure 7. For that form of the select operation the controlled swap at the end of state preparation in Figure 4 would not be performed, because it is being bundled into the select operation. As can be seen from Figure 7, that form of the operation is more clearly self-inverse, though there is the subtlety that we have the qubit with ν = M + 1 flagging the one-electron component of the Hamiltonian. In the case ν = M + 1, the circuit simplifies to a completely symmetric form. In the case ν = M + 1, only the left half of the circuit (shown in the dotted box) is performed, which is itself clearly self-inverse due to symmetry. The swap of the µ and ν registers is shown as controlled by the register flagging ν = M + 1 here, which would require more Toffoli gates. However, this form is just used to illustrate that the circuit is self-inverse, and that controlled swap can be moved to the end and combined with the other controlled swap, resulting in no more Toffolis being needed. There is one more Toffoli needed in order to apply the controlled swap between the two ancilla qubits in the |+ state which give the spin. As a result, the total complexity for the select operation is only two Toffolis more (the second Z operation must be made controlled) than described in the list above, and can be given as FIG. 6. This shows that the procedure for generating the two-electron terms is self-inverse. The self-inverse controlled operations V are equivalent to Vµ = U † µ Z1,αUµ, or Vν , but we omit the dependence on µ and ν because it is in a superposition of being dependent on both. On the left is the form of the circuit as we described it in Eq. (35) to Eq. (41), which is obviously self-inverse because of the symmetry of the circuit. On the right is a simplified form of the circuit.
In data : rot data : rot In In data : rot The circuit for performing the controlled operations (select) for the linear combination of unitaries in a form that is more clearly self-inverse. The registers from top to bottom are the success flag register, the µ and ν control registers, the qubit in the |+ state to control swapping the µ and ν registers, the qubit flagging that ν = M + 1, a blank ancilla to put the database of rotations in, two |+ states to provide the superposition over the operations on the spin up and down components of the state, and the qubits representing the spin down and up orbitals. The dotted box shows the only part that is applied when ν = M + 1. The controlled R is the rotation of the Majorana basis, equivalent to U u,0 U u,1 in [10]. The registers labelled |ψ ↓ and |ψ ↑ correspond to the qubits for the spin-down and spin-up orbitals. The operation Z1 acts on only one of the qubits of the |ψ ↓ register, similar to the circuit diagram labelled (57) in the Supplementary Material of [10].

D. Overall costs of the non-orthogonal tensor hypercontraction Hamiltonian simulation
In order to construct the entire step of the quantum walk, we also need a reflection on the ancillas used for the control state. We need 2n M + ℵ + 4 qubits, which are as follows.
1. We need 2n M for the µ and ν registers.
2. The equal superposition state for the coherent alias sampling needs ℵ qubits.
3. One qubit that controls the swap of the µ and ν registers. 4. One ancilla that is rotated to ensure the amplitude amplification for the equal superposition state preparation works.
5. Two qubits encoding the superposition over spins.
The complexity needed to perform a reflection on this many qubits is 2n M + ℵ + 2. Another cost needed for each step is to perform a step of the unary iteration on the control register. This has a cost of another single Toffoli for each step. The output of that unary iteration needs to be used to control the reflection about the ancilla as well, which adds another single Toffoli. Another single Toffoli cost is the controlled swap of the spin registers in Figure 7.
Altogether that gives an additional cost of 2n M + 5 for each step, in addition to the cost of the prepare and select operations. The total complexity for a single step can then be given as with m = 2n M + 2 + ℵ, n M = log(M + 1) . For the total cost, this needs to be multiplied by the number of iterations needed for the phase estimation Next we consider the costs for the number of logical qubits.
1. The control register for the phase estimation needs log(I + 1) qubits. The unary iteration to control on this register needs another log(I + 1) − 1 qubits.
2. N qubits used for the system register.
3. There are 2n M qubits used for the µ and ν registers.
4. There is an ancilla with ℵ bits that is placed in an equal superposition.
5. We need another qubit for the |+ state to control the swap of the µ and ν registers.
6. In preparing the equal superposition state we also need to perform a rotation on a single ancilla.
7. There are 2 qubits used for the spin registers in the prepared state.
8. One qubit is used for flagging success of the inequality tests in the state preparation.
9. One qubit is used to flag if ν = M + 1.
10. There are qubits for the phase gradient state.
11. The contiguous register has size log d .
12. The QROM has qubit cost (including the output) of mk s1 + log d/k s1 . Of these, m(k s1 − 1) + log d/k s1 are temporary ancillas which can be reused later. That is, log d/k s1 temporary ancillas for QROM, and m(k s1 −1) are extra outputs that may be erased by measuring in the X basis. If the costs in the following steps are smaller, then we can ignore them and use the cost of the QROM here instead.
13. The size of the data output for the rotations is N/2, and there will be log M temporary ancillas as well.
14. There are − 2 temporary qubits when adding into the phase gradient state for rotations, which will be larger than log M in the previous step for the systems of interest.
15. The ancilla costs for erasing the QROM can be ignored because they can reuse ancillas that were previously erased.

E. Error metrics for approximate tensors
The evaluation of Hamiltonian approximation errors, i.e. thc , is critical for precisely estimating the resource requirements of any quantum simulation method based on tensor factorizations, including tensor hypercontraction and the low rank methods. Moreover, to fairly compare against other methods such as the single factorization (SF), double factorization (DF) and the sparse method, one should use a consistent error metric for all. We will take the perspective that one should estimate the error in the exact ground state energy made by approximating the Hamiltonian matrix elements since preparing ground states is very often the goal. Obviously, estimating the error in the exact ground state energy is not feasible for problems whose ground state computations are not classically tractable. We list below desired properties which a good error metric should satisfy: 1. An error metric should be classically tractable to evaluate so that computing it for a handful of medium-sized systems is relatively straightforward.
2. An error metric should be size-extensive and size-consistent. Size-extensivity and size-consistency are important properties of the exact ground state wavefunction and energy [84,85]. Violating these properties often results in poor approximations of the exact ground state wavefunctions and therefore these are important properties to preserve. Size-extensivity asserts the asymptotic scaling of the energy to be linear in system size and sizeconsistency asserts the product separability of wavefunctions and the additivity of energies when two subsystems are infinitely far part. Therefore, any error metrics that attempt to bound the ground state energy should satisfy these two properties.
3. An error metric should provide a reliable bound on the exact ground state energy error and also be well-correlated with the actual error in the energy. By well correlated, we mean that improving the error metric should lead to the improvement in the ground state energy error as well. This is generally very difficult to meet and also verify because, in general, we do not know the exact ground state energy.
We summarize existing error metrics here and discuss which one is most suitable for our purposes. We refer to the approximate integral tensor as V and keep the discussion of error metrics as general as possible so that it can be applied to any form of approximation methods. In von Burg et al. [10], two error metrics were introduced and advocated: (1) coherent error ( co ) and (2) incoherent error ( in ). In fact, these are the 1-norm and 2-norm measures of the difference between the exact and approximate integral tensors; i.e., The incoherent scheme provides a rigorous bound to the shift in the ground state energy but we find these bounds to be too loose. In examples considered here, they are not well correlated with the error in the ground state energy despite bounding that error. For instance, a slight improvement in in can lead to a drastic improvement in the ground state energy estimate. More importantly, both co and in scale up to quartically with system size (at the very least super-linearly in general) which grows far too quickly with system size and violates the size-extensivity criterion. Furthermore, in is no longer size-consistent. We believe that the violation of size-extensivity and size-consistency is what makes these error metrics often not well correlated with the error in the exact ground state energy. Berry et al. [9] advocated error metrics based on classical quantum chemistry methods called the second-order Møller-Plesset perturbation theory (MP2) and configuration interaction with singles and doubles (CISD). The MP2 correlation energy error metric can be thought of as a weighted signed difference between two tensors: where w pqrs is a positive weight that is defined as the orbital energy differences defined for each quartet (p, q, r, s). It can be rigorously shown that Eq. (47) satisfies size-consistency and size-extensivity since the MP2 correlation energy satisfies these [85]. We note that the correlation energy in a finite basis is defined as the energy difference between an approximate method and a mean-field approach (Hartree-Fock) in the same basis [84]. Unfortunately, the MP2 correlation energy behaves quite erratically in many cases [86] so the error bound based on the MP2 correlation energy may not be well correlated with the actual error in the exact ground state energy. Since there are other classical approaches that go beyond MP2, one may consider other options as well. The CISD correlation energy error metric is the option that Berry et al. [9] chose. Unfortunately, the CISD correlation energy is neither size-extensive nor size-consistent. Consequently, in the limit of infinite system size, the CISD correlation energy approaches zero.
In this work, we advocate the use of an error metric based on the correlation energy error in the coupled-cluster with singles, doubles, and perturbative triples (CCSD(T)) approach [87]. CCSD(T) is so-called the "gold standard" method in classical simulations of quantum chemistry. While its quantitative accuracy on strongly correlated systems is often doubtful, it is still the best classical method that satisfies criteria (1) and (2) above. Whether it meets criterion (3) is again difficult to assess, but for some Hamiltonians for which CCSD(T) is qualitatively accurate, it is reasonable to expect that (3) is satisfied. Furthermore, CCSD(T) contains the MP2 wavefunction contribution in it in the sense that some diagrams contained in CCSD(T) exactly correspond to those of MP2. Therefore, we think that measuring the errors made by approximating the Hamiltonian elements based on CCSD(T) would be representative of classical simulation methods. Ultimately, it is possible that the errors in the CCSD(T) correlation energy are not well correlated with the errors in the exact ground state energy. We leave more precise understanding of these error metrics for future study and focus on comparing different approaches based on the CCSD(T) correlation energy error. For strongly correlated Hamiltonians, one can consider high-spin states where single determinant wavefunctions provide a good description. In this case, the CCSD(T) energy must be well-correlated with the exact high-spin eigenstates. As long as the underlying integral approximation does not assume the underlying spin state (which is the case for all approximations considered here), this should facilitate a fair comparison between the methods discussed in this work.
In addition to the inherent factorization error, it is useful to further account for the error made by the approximation of the Hamiltonian coefficients (ζ) as part of coherent alias sampling and the individual qubit rotations via χ. This is achieved by first making approximate representations for χ and ζ for fixed numbers of bits to represent these tensors, and ℵ, respectively. That is, χ is represented by a sequence of rotations with angles {θ (µ) p } and these angles are approximated by the limited precision given by . More specifically, a rotation vector, χ (µ) is parametrized by a set of angles {θ (µ) p } which are defined recursively [10] as Next, we define units for θ and ζ, where u θ is the unit for θ, u o ζ is the unit for off-diagonal elements of ζ, and u d ζ is the unit for diagonal elements of ζ. Finally, write approximate χ and ζ asχ andζ, respectively, bỹ where x is a small constant to ensure the normalization of |ζ|. We employed a sextic polynomial fit for x ∈ [−1, 1] to find the optimal x that normalizes |ζ|. The resulting λ of |ζ| was normalized to 1 with a small error on the order of 10 −6 . These can be used to build an approximate representation G for G. In addition to evaluating the CCSD(T) correlation energy error using G, we also evaluate the error using G to provide more precise resource estimates.

IV. RESOURCE ESTIMATES FOR REAL SYSTEMS
A. Resource estimates for simulating active space models of FeMoCo molecule Keeping with the tradition of the work by Reiher et al. [23], Berry et al. [9] and von Burg et al. [10], here we will benchmark our methods on the problem of simulating the ground state of active space models of FeMoCo. While this is not the only quantum chemistry problem worth studying on a quantum computer, it is easier to compare to prior methods if we study the same molecule. We look forward to also deploying our method to some of the catalysis Hamiltonians studied by von Burg et al. [10] as soon as they are made available by those authors. Unfortunately, the first of these papers Reiher et al. used an active space for FeMoCo that was later found to be problematic for the ground state simulation by Li et al. [35]. The ground state of the active space proposed by Reiher et al. does not capture the open-shell nature (i.e., strong correlation) of the FeMoCo model cluster and therefore the gold standard CCSD(T) calculation was found to be off by only 5 milliHartree from the near-exact density matrix renormalization group (DMRG) energy [35]. Therefore, in their work, Li   In Figure 8 we analyze the dependence of the shift in the ground state energy on the THC rank M . For the Reiher et al. Hamiltonian, as mentioned above, the gold standard CCSD(T) is quantitatively accurate so it is quite sensible to assess the error of approximate integral tensors based on the error in the CCSD(T) correlation energy. The ground state computation involves N = 108 qubits and 54 electrons and the target spin state is S = 0 (singlet). We employ spin-restricted Hartree-Fock orbitals for this Hamiltonian. As we can see, we achieve chemical accuracy for M ≥ 250.
For the Li Hamiltonian, there are 152 spin orbitals and 113 electrons in this active space model. For this Hamiltonian, CCSD(T) is no longer quantitatively accurate for low-spin states (such as the ground state S = 3/2) due to its inability to capture the antiferromagnetically-coupled open-shell nature. However, if we consider a high-spin state such as S = 35/2, CCSD(T) with spin-unrestricted Hartree-Fock orbitals should be quantitatively accurate compared to the exact energy of the S = 35/2 state. Furthermore, there is no aspect in the THC factorization (or other methods compared in this work) specialized for a specific spin state. Therefore, we believe that the CCSD(T) error metric for the S = 35/2 is well correlated with the actual error made by integral approximations in the exact S = 3/2 ground state energy. As shown in Figure 8, THC achieves chemical accuracy for M ≥ 350.
For both Hamiltonians, we assess the CCSD(T) correlation energy error with approximate rotation angles and ζ. We found that 10 bits for state preparation and 16 bits for rotations were enough for the Reiher Hamiltonian, whereas 10 bits for state preparation and 20 bits for rotations were needed for the Li Hamiltonian. The resulting CCSD(T) correlation energy error, λ, Toffoli count, and logical qubit count are available in Table IV for the Reiher Hamiltonian  and Table V  In order to study the scaling of these methods in this section we analyze a chemical series consisting of hydrogen chains. We use the same chemical series as the one studied in [9] in order to facilitate a direct comparison with that work: one-dimensional hydrogen chains with a spacing of 1.4 Bohr. For the finite size scaling, we use the STO-6G minimal basis set and grow the system by adding more hydrogens. Denoting the number of hydrogens in a chain by N H , we study the chemical series from N H = 10 to N H = 100. When estimating thermodynamic asymptotes, it is often convenient to fix the error per particle as opposed to the total error for every system size [88]. Indeed, the most widely used integral factorization, the resolution-of-the-identity approximation, typically yields an error per atom to be 50-60 µHartrees [88]. We follow a similar standard in this section. For the hydrogen chain calculations, we used a threshold of 5 × 10 −5 for the sparse method, 0.1 for the single factorization method using the modified Cholesky factorization [89] (which yields roughly 2 Cholesky vectors per H atom), and 0.01 for the double factorization method (see Eq. (C41)). For THC, we fixed the THC rank to be 7 ×N H . These thresholds were enough to obtain the CCSD(T) correlation energy error per atom to be less than 50 µHartrees as shown in Table VI.
For H 4 , we used the cc-pVTZ basis set [90] to obtain V and considered truncated Hamiltonians in the molecular orbital basis with varying number of orbitals from N/2 = 10 to 56. The number of Cholesky vectors used in the SF method was fixed to be 300 and the THC rank was held at 576. This was enough to maintain the accuracy of 1-10 µHartrees for N/2 greater than 35 in all methods. With respect to the basis set size, the data size for state preparation scales cubically with basis set size in the case of the SF method and quadratically with basis set size for THC [71]. Therefore, we focus on the scaling of λ for SF and THC methods. We used a threshold of 5 × 10 −5 for the sparse method and 1 × 10 −4 for the DF method. On the other hand, the data size of sparse and DF methods varies with system so we obtain representation for each instance of the truncated Hamiltonian. This gives us the scaling of data size with respect to the basis set size. In all systems, the sparse method λ values were evaluated with localized orbitals using the Edmiston-Ruedenberg method [91] to exploit spatial locality. All other methods used canonical Hartree-Fock orbitals.
Values of λ for each method are presented in Figure 9. In (a), we have the hydrogen chain data from which we can infer the thermodynamic size scaling of different approaches. In (b), we have the H 4 two-body λ data from which we can find the continuum limit scaling of different approaches. Since H 4 is a very small system, the one-body contribution is comparable to the two-body contribution to λ. However, as we are only interested in asymptotes, we focus just on the two-body contribution because asymptotically the two-body term dominates λ. These scalings of λ are used to generate the Toffoli complexity scaling data in Table II. For the space complexity scaling, we need We aimed to reach 10-50 µHartrees for all methods but SF. For the SF method, the largest error below 50 µHartrees / H atom that we could find was about 9 µHartrees. additional information for the sparse method (i.e., the number of non-zero elements) and the DF method (i.e., the average number of eigenvectors in the second factorization). These are provided in Appendix A and Appendix C, respectively. We perform linear fits to obtain the asymptotic scaling of each method appropriately. The results of these linear fits are summarized in Table VII and Table VIII Table VII and Table VIII. Those slopes are used to compute scalings listed in Table II. approach slope data range We now come to the problem of determining how much physical space and time will be used by our computations if we plan to realize them using quantum error-correction. Producing these estimates requires making assumptions about the performance characteristics of physical qubits and the classical control system making up the quantum computer. We will be assuming a physical gate error rate of 0.1%, a control system reaction time of 10 microseconds, and a surface code cycle time of 1 microsecond. We will also consider the more optimistic gate error rate of 0.01% to give a point of comparison.

Space and time constraints
In surface code quantum computations, space-time tradeoffs are ubiquitous. A good first approximation is that a quantum computation is like a liquid that can be poured into any desired spacetime volume. For example, often one can (roughly) halve the amount of time taken by a computation by doubling the number of magic state factories. Of course, the "uncompressible liquid computation volume" approximation does break down when pushed enough. For example, as the space available is reduced, the contortions needed to fit the computation become more and more challenging. The time has to increase by proportionally more, to accommodate the contortions. There is also a minimum number of qubits needed to store the system being simulated, which no amount of contortion will get below. There are also constraints that prevent the time of a quantum computation from being reduced arbitrarily far, such as the code distance d and the reaction time t. To give the reader a better sense of what is driving the layout decisions we have made, we will discuss these two constraints in more detail.
The code distance d dictates how many times a local physical stabilizer has to be measured before it is possible to error correct those measurements to a desired level of reliability. We will call the amount of time it takes to run d rounds of the surface code cycle a "beat". Any surface code construction that involves changing which stabilizers are being measured will (usually) take at least one beat to complete, since that is how long it takes to become confident about the values of the new stabilizers.
The reaction time t measures how long it takes for the classical control system to receive the raw physical measurements corresponding to a logical measurement, error correct them to recover the logical measurement, and trigger a following logical measurement that is either in the X or Z basis depending on the value of the previous logical measurement. This quantity becomes relevant when performing non-Clifford operations via magic state distillation and gate teleportation. The gate teleportation will apply the desired operation, but will also randomly apply other Clifford operations that need to be undone by corrective Clifford operations. If the corrective Clifford operations needed to finish a gate teleportation are applied "in place", by waiting for the correction to be known and then applying operations to the target qubits, then the correction process would have a time measured in beats. This is a strategy we have used in the past when laying out surface code computations [16]. However, when using this strategy, it is difficult for the computation consuming magic states to keep pace with even a single magic state factory. A more time efficient strategy is to precompute multiple world-lines, one for each possible Clifford correction, and selectively teleport the affected qubits through the appropriate world line [27,92,93]. The selective teleportation can be controlled by measuring certain qubits in either the X or Z basis. Because logical X and Z measurements are implemented transversally, the time they take does not depend on the code distance, and so they are not beat limited. Instead, the limiting time is how quickly the control system can process the deluge of data coming at it and iteratively figure out what the next basis to measure in is.
It is not always feasible to execute one non-Clifford operation per reaction time. For example, there might not be enough magic state factories to produce one magic state per reaction time. To account for this, we will introduce another time unit: the "tick". A tick is either the reaction time of the control system or the average period between the production of magic states; whichever is slower.
Optimally packing a quantum computation often involves balancing tick-based constraints and beat-based constraints. For example, the driving force of a QROM read is unary iteration sequentially preparing qubits flagging whether or not the address register was equal to each possible address value [16]. Unary iteration is made up of a series of interdependent Toffolis; it is primarily tick-constrained. On the other hand, the flag qubits prepared by unary iteration are used as controls for huge multi-target CNOTs reaching into the target data registers. These CNOTs are done via lattice surgery, which involves changing which stabilizers are being measured, and so the CNOTs are primarily beat-constrained. Optimally packing a QROM computation into spacetime requires balancing the beat-based constraints from the multi-target CNOTs and the tick-based constraints from the unary iteration.

QROM tradeoffs
Suppose we lay out data qubits into columns with every third column left empty as an access hallway. In this layout, each data qubit has a single side exposed to a hallway. We will call this layout "single side storage". When using single side storage, a multi-target CNOT will monopolize the single side of its target qubits for one beat. A basic QROM read performs one multi-target CNOT per possible address value, and therefore when using single side storage a basic QROM read with n addresses will take at least n beats to complete.
If n beats is too slow (e.g. if it is significantly slower than the tick-limited unary iteration), we could instead use a storage layout where every second column was left empty as an access hallway. In this "double side storage" layout, every qubit has two exposed sides. This allows a second CNOT to start before the first has finished, by targeting the second side, lowering the minimum time required for an n-address QROM read from n beats to n/2 beats. If n/2 beats is still too slow, alternative QROM circuits can reduce the number of multi-target CNOTs by an "output expansion factor" k < √ n by using k times more workspace qubits [9,36]. These alternative circuits also lower the number of Toffoli gates required, reducing the tick-based lower bound from the unary iteration.
There are other possibilities for optimizing the packing of a QROM read, but for the physical parameter regime we are interested in there are three relevant design choices. (1) Should we use single or double sided storage? (2) How many magic state factories should there be? (3) What should the expansion factor k be, for each QROM read? After some experimentation and iteration We settled on using single side storage, four magic state factories, and an expansion factor of 16 for the QROM read during the prepare subroutine and 1 for the QROM reads during the select subroutine. These decisions were based on looking at the algorithm's resource utilization at a global level, for various possible choices. We will now describe how that was analyzed.

Diagram driven decisions
When laying out a quantum algorithm in the surface code, there are two diagrams we recommend making. The first diagram is a floorplan, or footprint, of where the various parts of the computation will live. Our final floorplan diagram is in Figure 10. The goal of the floorplan diagram is not to find a perfectly optimal layout, but rather to get a rough idea of how things will fit together and how much space will be needed to ensure it is possible to route data and magic states into operating areas quickly enough to keep the computation going.
The second useful diagram to make is an inventory of allocated qubits over time, which shows how many qubits are allocated as the algorithm progresses and also what they are being used for. Our final diagram of this type is in Figure 11. This diagram can reveal spikes where too many qubits are being used, and holes where a lot of qubits are available for use (e.g. as additional magic state factories).
The reason these diagrams are useful to make is that understanding resource utilization allows one to make optimizations. For example, elsewhere in this paper we note that the phasing operations during the select subroutine will use angles loaded from QROM. Originally, we intended for all the angles to be loaded simultaneously from one QROM read. This avoids redundant QROM reads from the same address register and correspondingly minimizes the number of Toffolis. However, when plotting allocated qubits over time as in Figure 12, it became obvious that loading the angles is much faster than using them, and so loading half of the angles at a time would not cost much more  [35]. Covers 53 × 36 = 1908 logical qubits. The code distance is 31. The number of physical qubits is 1908 · 2 · 32 2 ≈ 4 · 10 6 . Red areas are the CCZ factory from [93], with a level 1 code distance of 19 and a level 2 code distance of 31. Green areas are for the AutoCCZ fixup box from [93]. Purple areas are for data qubit storage. White areas are for routing, performing Clifford gates, and teleporting magic states into data qubits.
overall. The benefit of loading half of the angles at a time is that half as many output qubits are needed. This is a substantial space savings for an acceptable time loss.
Because of the lower number of qubits available, we also had to adjust the output expansion factor of the QROM read during the prepare subroutine. We reduced it from 32 to 16, which again increases the Toffoli count slightly but substantially reduces the workspace needed. Contrast the original allocation plan in Figure 12 with the more space efficient allocation plan in Figure 11. In short, because we made a diagram showing allocated qubits over time, giving us a global view of our resource usage, we spotted a reasonably simple optimization that reduced the number of data qubits by 40%. This is why we recommend making these diagrams.

Routing and distilling
We now consider the routing of data during the phasing operations within the select subroutine. These phasing operations are implemented by adding qubits output from a QROM read into a phase gradient register, controlled by a target qubit. This process creates an amount of phase kickback proportional to the amount read from QROM, and this phase kickback acts on the target qubit due to it being used as a control. Because the phase gradient register is used for every single phasing operation, the phase gradient register should be moved out of the storage area and kept in the operating area. The data being added into the phase gradient register then needs to be gradually streamed out of the storage area as the phasing operations are applied. The majority of this data is the data that was produced by the QROM read. In order to avoid traffic jams, it is important that data that will be needed at the same time be placed down separate hallways. The QROM circuit construction gives us full control over where the target data lives, so this is not a difficult constraint to meet.
In the past, we have had trouble laying out quantum computations in a way that could consume incoming magic states quickly enough [16]. This is because the layout strategy we were using involved performing the probabilistic Clifford operations resulting from this process directly on the qubits being acted on by the magic states. We are now using a different strategy, based on using AutoCCZ states which attach the corrections to the magic state instead of to the target qubits [27,92,93]. When using AutoCCZs, the corrective operations can be packed into spacetime far away from the data qubits being operated on, allowing the computation to progress at a reaction-limited rate instead of a beat-limited rate. This removes a significant constraint we were previously operating under. Instead of struggling to keep pace with one magic state factory, we could now in principle keep pace with ten or more (if we were willing to pay the qubit cost of operating so many factories).
The magic state factory we will be using is the CCZ factory from [93]. Note that results from [94] suggest the factory can be reduced in size, due to the distillation process heralding topological errors. We have not incorporated these results into our factory designs, but intend to do so in the future.
After exploring several cases using the estimation spreadsheet from the ancillary files of [52], tweaked slightly to account for the improvements to the factory made in [93] and for the presence of multiple factories, we decided to use four factories with a level 1 code distance of 19 and a level 2 code distance of 31. According to the "logical factory dimensions" method from the "estimate costs.py" script in the ancillary files of [95], the large level 1 code distance results in the factory having a footprint of 15d × 8d and a depth of 5d where d = 31 is the level 2 code distance. This choice gives a roughly 0.1% chance of any distillation failure occurring when executing ten billion Toffolis, allocates nearly a million physical qubits for use as factories, and results in a Toffoli production rate of 25 kHz. We give each factory a row of routing space to move the produced CCZ states to where they are needed, and two 3 × 4 AutoCCZ fixup areas for correcting the teleportation of the magic states being produced. As shown in Figure 11, the number of data qubits we store when running FeMoCo is less than 700. The number of Toffoli operations is less than ten billion, and we perform Toffolis at a rate of 25 kHz. According to the spreadsheet from [52], a data code distance of 31 is sufficient for this regime while maintaining a 1% total error budget. We summarize this information in Figure 10, which shows one potential way to lay out the factories and the data qubits. With this layout the FeMoCo computation would span four million physical qubits.
The total Toffoli count of the algorithm is approximately 483,000 inner loops times 13,800 Toffolis per inner loop, which equals 6.7 billion Toffolis. It would take 3 days to perform these Toffolis with four CCZ factories distilling at a total rate of 25 kHz.

Estimate at optimistic error rates
So far in this section, our estimates have been based on conservatively assuming a physical error rate of 0.1% and a corresponding error suppression factor Λ of 10 (meaning we assume the logical error rate per round goes down by a factor of 10 each time we increase the code distance by 2). If we optimistically assume physical error rates of 0.01%, and a corresponding Λ of 100, all of the code distances can be cut in half while achieving the same reliability.
We again refer to the spreadsheet from Ref. [52]. We find that, under a total error budget of 1% and a physical error rate of 0.01%, ten billion Toffolis can be executed using a level 1 distillation code distance of 9, a level 2 distillation code distance of 15, and a data code distance of 15. The resulting CCZ factory has a footprint of 14d × 8d, proportionally slightly better than the 15d × 8d we had with a physical error rate of 0.1%. Therefore we can simply use the floorplan from Figure 10 unchanged, but using d = 15 instead of d = 31. Since the physical qubit count per logical qubit is 2(d + 1) 2 , reducing the code distance from 31 to 15 reduces the physical qubit count by a factor of 32 2 /16 2 = 4. The physical qubit count shrinks from four million to one million.
Because the computation is still beat limited at distance 15 when using four factories (as opposed to being reaction limited) the execution time at distance 15 is the execution time at distance 31 multiplied by 15/31. The execution time shrinks from 3.5 days to 1.5 days.

V. CONCLUSION
The primary contribution of this paper has been to introduce a new method for simulating electronic Hamiltonians in arbitrary basis on a quantum computer. In terms of both the asymptotic scaling and the finite resources required for complex benchmark molecules, our method appears to have lower cost for realization on a fault-tolerant quantum computer than any prior algorithm in the literature. We perform the first explicit compilation of an arbitrary basis quantum chemistry simulation into surface code logical gates, which we accomplish using lattice surgery, to arrive at the most accurate estimates yet of what would be required to realize an important and classically intractable quantum computation of chemistry within the surface code.
The method we introduce uses a number of now standard techniques for quantum simulation including phase estimation of quantum walks, qubitization, QROM, coherent alias sampling and unary iteration. Our key innovation was to adapt this set of tools to the tensor hypercontraction representation of quantum chemistry, which had previously gone unexplored in the context of quantum computing. While the standard THC representation compresses the Hamiltonian considerably, enabling highly efficient quantum walks, using that representation directly leads to very large λ values, requiring many repetitions of the quantum walk. To avoid this, we use tensors obtained from THC to transform the standard electronic Hamiltonian into a new representation which has a diagonal Coulomb operator but in a larger and non-orthogonal auxiliary basis. We then develop algorithms for realizing the associated qubitization oracles which work by rotating into the correct basis one tensor factor at a time, thus avoiding the need for a global non-orthogonal basis rotation, and allowing us to apply our algorithm without using any more system qubits than what is required by the original Hamiltonian.
The most efficient prior algorithms for simulating arbitrary basis quantum chemistry were the "sparse" method of Berry et al. [9] and the "double low rank" method of von Burg et al. [10]. In both cases we correct errors in the original work (leading to reduced Toffoli complexity of the former approach). The sparse algorithm has Toffoli complexity O((N + √ S)λ V / ) and space complexity O(N + √ S) where S is the Hamiltonian sparsity. The double low rank algorithm has Toffoli complexity O(N λ DF √ Ξ/ ) and space complexity O(N λ DF √ Ξ) where Ξ is the rank of the second tensor factorization discussed for quantum computing in [29]. By contrast, the tensor hypercontraction approach introduced here has Toffoli complexity O(N λ ζ / ) and space complexity O(N ). To contextualize the scaling of these λ values as well as S and Ξ we analyze the asymptotic scaling of these methods applied to hydrogen systems growing towards both continuum and thermodynamic limits. Of the two prior algorithms, the double low rank method scales better towards the continuum limit with Toffoli complexity O(N 3.9 / ) and space complexity O(N 1.5 ). The THC algorithm has a favorable scaling with Toffoli complexity of O(N 3.1 / ) and space complexity of O(N ). When scaling towards the thermodynamic limit the sparse algorithm scales considerably better than the double low rank algorithm, having Toffoli complexity O(N 2.3 / ) and space complexity O(N ). Again, our THC algorithm still has even lower scaling with the same asymptotic space complexity but Toffoli complexity of O(N 2.1 / ). See Table II for details. Next, we compared the finite resources required to implement these methods for popular FeMoCo benchmarks. As can be seen in Table III, the THC algorithm has both fewer logical qubits and less Toffoli complexity than any of the other competitive methods. We note that there might be more compact and accurate THC factors than those we found which could improve our results even further. Due to the difficulties associated with non-linear optimization, we expect that our solutions are suboptimal. In this sense, the estimates we report for the cost of the THC approach should be required as upper bounds on the cost of the most efficient possible implementations. We discuss a very detailed scheme for efficiently laying out the Li Hamiltonian FeMoCo computation within the surface code. Ultimately, we show that the computation could be realized with about four million physical qubits and under four days of runtime, assuming physical gate error rates of about 0.1%. With the more optimistic assumption of 0.01% per gate error rates we could realize the same computation with about one million physical qubits and under two days of runtime.
It is interesting to note that, in our physical cost estimates, most qubits are used for routing and distillation. Quantum circuits describing a quantum algorithm typically omit these qubits; omit the part making up the majority of the cost. For example, consider modifying a random access algorithm to use sequential access to save space by reducing routing overheads. If avoiding random access forced a tradeoff versus the number of data qubits, the abstract circuit model would classify this improvement as a downgrade. Because of this, we caution the reader that comparisons derived purely from quantum circuit metrics (such as counting Toffolis and data qubits) can be misleading. They are good approximations, but these approximations can fail in known ways.
Despite the rapid progress detailed in Table I, we expect that we are nearing the end of a series of asymptotic speedups for arbitrary basis quantum chemistry algorithms. At least for second quantized approaches based on linear combinations of unitaries, it would be difficult to imagine an algorithm that improves more than logarithmically on the O(N λ/ ) Toffoli complexity achieved by our approach (although it is possible that one could reduce λ). This is because it would seem that Ω(N ) complexity should be required for any nontrivial quantum walk on N qubits and phase estimation of LCU methods generically requires Ω(λ/ ) repetitions. Indeed, we suspect that the near O(N 2 / ) complexity obtained when scaling towards the thermodynamic limit for hydrogen chains would be lowest possible scaling for simulation of the Coulomb operator as a consequence of its pairwise nature; also, this roughly matches the lowest scaling that has been achieved for simulating the Coulomb operator in special basis sets [96]. However, perhaps there is room to improve over the O(N 3 / ) scaling we observe when scaling hydrogen systems towards their continuum limit. Still, there are several ways that we might hope to extend these methods and reduce constant factors.
While the THC results presented in this work show remarkable improvements over previous qubitization approaches, the non-linear optimization associated in obtaining the THC factorization is challenging for broad applications. The future research and some ongoing effort in quantum chemistry may help to resolve this issue [71,72]. One natural question to ask is whether the THC representation (and in particular, our non-orthogonal THC Hamiltonian in Eq. (11)) has utility for quantum computing beyond qubitization based methods; e.g., can one combine the THC representation with Trotter methods? Another question is whether one can leverage hierarchical matrix representations of the Coulomb operator [97] to compress the Hamiltonian in a way that is useful for qubitization either within the THC framework or within the sparse method. Another area to consider would be developing strategies of choosing active space orbitals with an aim towards reducing the value of λ associated with qubitizing the resultant Hamiltonian. Finally, we should continue to identify more concrete molecular benchmarks beyond the capabilities of classical electronic structure methods that could be solved on a quantum computer to provide insights about chemistry. Noting that by combining qubitization with quantum signal processing [98] one can adapt our approach to perform highly efficient time evolutions rather than phase estimation, it is also worth exploring applications of quantum chemistry that would benefit from time evolution of the electronic Hamiltonian.

Data and code availability
To maximize reproducibility, we share data and code used in this work on a public Zenodo repository [74]. The repository includes all of the integral tensors (i.e., both exact and approximate) for systems studied in this work, python codes for generating the DF factors and computing λ values as well as Mathematica notebooks for evaluating the cost of the SF, DF, sparse, and THC methods. Since THC factorization done in this work involves challenging non-linear optimization, it may be difficult for others to reproduce the THC numerical data presented here. Therefore, we recommend that any interested readers use the available THC factors on [74] for the future study.
Here we review the method of [9], with some further optimizations of the method. The first step in simulating a Hamiltonian using qubitization is to represent it as a linear combination of unitaries. How one chooses this linear combination of unitaries has significant ramifications for the λ-factor which will scale the cost of the qubitized simulation. We start by expressing the electronic Hamiltonian in an arbitrary second-quantized basis as in Eq. (1). We can map these operators to qubits as T pq a † p,σ a q,σ + a † q,σ a p,σ = where the Jordan-Wigner transform that is used is expressed as where X, Y and Z are the Pauli operators, the subscripts indicate the qubits these operators act on, and A p ZA q is shorthand for A p Z p+1 · · · Z q−1 A q . Note that we get a factor of 1/8 in front of Eq. (A2) from multiplying the original factor of 1/2 by a factor of 1/4 that comes from expanding the pq and rs terms as the sum of their Hermitian conjugate for each index. Note that we have made a correction of a factor of 2 from Q pqσ as defined in [9]. An improvement we can make is to remove the identity from the case where p = q, so that Q pqσ has the same weightings with on-diagonal and off-diagonal terms, as Then the expansion of T can be written as and the expansion of V can then be written as Here the middle two terms correspond to one-body terms that can be combined with T , and the fourth is a constant offset that can be omitted. We can therefore write the Hamiltonian in terms of a new T and V , given by with Then H = T + V plus a term proportional to the identity, which can be omitted because it gives a constant shift to the eigenvalues. Because the operator Q pqσ as well as products Q pqα Q rsβ are all unitary operators we now see that H = T + V is a linear combination of unitaries. In this representation the associated λ values are For T the multiplying the factor of 1/2 cancels with a factor of 2 for summing the spin degree of freedom. The factor of 1/2 in front of the V term comes from multiplying the factor of 1/8 by 4 for summing the spin degrees of freedom. Note that these λ definitions differ from those given in [9]. At first glance λ V appears to be a factor of 8 smaller. The first reason for this is because we deviated from the convention of absorbing the factor of 1/2 into V (since this is just a difference in convention, it is consistent with prior work). However, there is roughly another factor of 4 difference that comes in because the work of [9] accidentally left out a factor of 1/2 in Eq. (A3). This leads to a λ V that is reduced by a factor of 4. For the Reiher Hamiltonian [23], we previously reported λ T = 1,490 a.u., λ V = 8,373 a.u. so λ = 9,863 a.u. with a truncation threshold of 2 × 10 −4 . This should be updated to λ T = 90 a.u., λ V = 2, 045 a.u. so λ = 2,135 a.u. with a truncation threshold of 7.5 × 10 −5 . For the Li Hamiltonian [35] we previously gave λ T = 3,446 a.u., λ V = 4,168 a.u. so λ = 7,614 a.u. with a truncation threshold of 1 × 10 −4 . This should be updated to λ T = 561 a.u., λ V = 986 a.u. so λ = 1,547 a.u. with a truncation threshold of 3.5 × 10 −5 . The truncation threshold became tighter in this work because the metric we chose (i.e., CCSD(T) correlation energy error) is more conservative than what was used in our previous work [9]. Here d is the number of permutation-unique non-zero elements above a given threshold, and λT is 561.5 Hartree. The entry in blue corresponds to the threshold used in our resource estimates.

The cost of qubitization of the sparse chemistry Hamiltonian
The state to be prepared is similar to that in Eq. (48) of [9], except T pq is replaced with T pq and V pqrs is replaced with V pqrs /2 (due to the factor of 2 in the definition of V ), so the state to be prepared is |θ V pqrs |p, q, α |r, s, β .
(A11) This state can be prepared as described in [9], using controlled swaps to generate the symmetries of the state.
Because we are here not including the identity in Q pqσ , the controlled unitaries can be performed in a simpler way than shown in Figure 1 of [9]. In that work there are inequality tests between the p and q registers which are needed to produce the sum of the identity and Z operations. When the identity is not included, then the circuit needed simplifies to just two applications of the circuit for Majorana operators shown in Figure 9 of [16] and Figure 1 of [99]. The simplified circuit is shown in Figure 13. When the select is controlled as shown, then the complexity is 2(N − 1). If it does not need to be controlled, then the complexity is only 2(N − 2). We will require only one of these two select operations to be controlled, because for the case of the one-body term in the Hamiltonian only one select should be applied, so one of the select needs to be controlled on the qubit flagging one-and two-body terms in the Hamiltonian. The total complexity of these select operations is therefore 4N − 6. 13. The circuit needed to perform a controlled select operation. This is similar to that in [9], except it is not necessary to perform the equality test p = q. The unitaries labeled as − → Z Aj apply the operation Z0 · · · Zj−1Aj to the target register, depending on the value from the input register, using the technique shown in Figure 9 of [16]. The sign is shown as being obtained via a Z gate on the sign qubit, but in practice this would be obtained as part of the state preparation.
In the sparse simulation method, the relevant parameters are the number of orbitals N , the λ value, and the number of unique nonzero entries d. If we are allowing error in the energy due to the state preparation of prep , the output size for the keep probabilities for the QROM is There are eight registers of size n N = log(N/2) , because the sparse preparation scheme needs to output ind values and alt values of p, q, r, s. There are also 2 qubits needed for the two output values of θ (one for the ind and one for the alt values of p, q, r, s), as well as 2 qubits used for ind and alt values of the first register which distinguishes between T and V . As a result the QROM output size is The cost of the preparation is then and of the inverse preparation is To begin the state preparation, we need to prepare an equal superposition state over d basis states. Given that 2 η is a factor of d, the procedure and its costs are as follows.
2. Rotate an ancilla qubit to obtain overall amplitude for success of 1/2 using b r bits of precision. This has cost b r − 3 Toffolis (since the rotation angle is given classically).

3.
Reflect on success for both, which may be performed via a controlled phase gate (no Toffolis).
4. Invert the rotation with cost b r − 3.

5.
Invert the inequality test, which may be performed with Cliffords provided the first inequality test was performed using the out-of-place adder.
The total cost is then 3 log d − 3η + 2b r − 9. This is a cost paid both for the preparation and for the inverse preparation.
Other minor Toffoli costs are as follows. In the following we can use extra ancillas to save cost, because a large number of ancillas were used for the QROM, and can be reused here without increasing the maximum number of ancillas used. Figure 13 twice, with only one being controlled. As discussed above, this has complexity 4N − 6.

Perform select as shown in
2. The state preparation needs an inequality test on ℵ qubits, as well as controlled swaps. The cost of the inequality test on ℵ qubits is ℵ, and by performing an inequality test with the out-of-place adder we can invert it in the inverse preparation with no additional Toffoli cost. The controlled swaps are on 4n N + 1 qubits. Here the 4 log(N/2) are for the values of p, q, r, and s, and the +1 is for the qubit which distinguished between the one-and two-electron terms. It is not necessary to perform a controlled swap on the sign qubits output, because the correct phase can be applied with Clifford gates. It is possible to invert the controlled swaps in the inverse preparation with Cliffords. The method is to copy all values being swapped before they are swapped. Then to invert the controlled swap, perform measurements on the swapped values in the X basis. We can perform phase fixups using controlled-phase operations, where the control is the control qubit for the controlled swaps, and the targets are the copies of the registers. That means we can eliminate the non-Clifford cost of the inverse preparation, giving a Toffoli cost of ℵ + 4n N + 1.
3. The controlled swaps used to generate the symmetries have a cost of 4n N . Again these controlled swaps can be inverted for the inverse preparation with measurements and Clifford gates.

4.
A reflection on the ancilla is needed as well. The qubits that need to be reflected on are the log d qubits needed for preparing the state, the ℵ qubits used for the equal superposition state in the coherent alias sampling, the three qubits that are used for controlled swaps to generate the symmetries of the state, and the ancilla qubit that is rotated to produce the equal superposition state. That gives a Toffoli cost of log d + ℵ + 2.
5. For each step one more Toffoli is needed for the unary iteration used for the phase estimation, and one more Toffoli is needed to make the reflection controlled.
The total cost for a single step is then with m = ℵ + 8n N + 4, n N = log(N/2) , and η an integer such that 2 η is a factor of d.
The logical qubits used are as follows.
1. The control register for the phase estimation uses log(I + 1) qubits, and there are log(I + 1) − 1 qubits for the unary iteration.

The system uses N qubits.
3. The QROM uses a state with log d qubits.

4.
A qubit is needed to flag success of the equal superposition state preparation.
5. An ancilla qubit is rotated in the preparation of the equal superposition state.
6. The phase gradient state uses b r qubits.
7. The equal superposition state used for the coherent alias sampling, which has ℵ qubits.

Counting the number of permutation-unique elements
The two-electron integral tensor V pqrs exhibits a 8-fold permutational symmetry: To evaluate the cost of the sparse method presented above, we need to count the number of permutation-unique elements above a given truncation threshold. We provide more details about how this counting was performed so that others can easily reproduce the numerical data presented here. We note that the number of permutation-unique elements in V pqrs is given as [84] 1 8 This expression can be obtained by considering the following four different classes of V pqrs : 1. p, q, r, and s are all unique indices. We loop over a total of N/2 4 unique combination of quartets (p, q, r, s) and count V pqrs , V prqs , and V psrq in this category.
2. Two indices are redundant (i.e., only three unique indices). We loop over a total of N/2 3 unique combination of triplets (p, q, r) and count V ppqr , V pqpr , V qqpr , V qrpq , V rrpq , and V rpqr in this category.
3. Three indices are redundant (i..e, only two unique indices). We loop over a total of N/2 2 unique combination of doublets (p, q) and count V ppqq , V pqpq , V pppq , and V qqqp here.
4. All four indices are redundant. We loop over p and count V pppp in this category.
In order to obtain d, we enumerate each category and count the number of elements above a threshold. Adding (1/2)(N/2)(N/2 + 1) to account for the symmetry-unique part of the one-body operator to this directly gives us the numerical data in Table IX and Table X.

Numerical data for hydrogen chains and H 4
In Figure 14, we present the number of non-zero values of V with a truncation threshold, 5 × 10 −5 , for hydrogen chains and H 4 .
Appendix B: The "single low rank factorization" algorithm of Berry et al.

Representing the single low rank factorized Hamiltonian as a linear combination of unitaries
In Berry et al. the approach focuses on simulating a truncated low rank decomposition of the Coulomb operator which expresses the two-body terms in Eq. (1) as a sum of squared one-body terms in a form that was first described for quantum computation in [45]. The first work to suggest exploiting the low rank properties of this decomposition was [29]. Specifically the factorization of the Coulomb operator used in that work is where the W ( ) pq are scalars obtained by performing a Cholesky decomposition on a flattened version of the V pqrs tensor and L = O(N ). Note that the factor of 1/2 becomes a factor of 1/8 because there is a factor of 1/2 that is then squared for adding the Hermitian conjugates. Using the Jordan-Wigner transform as in Eq. (A3) our Hamiltonian can be represented as the linear combination of unitaries: (B2)  As before, when we separate out the identity for p = q, we can express this in terms of Q pqσ from Eq. (A4) as In the last line we have used that W ( ) pq W ( ) rs = V pqrs , which is exact when L is taken to be sufficiently large. In this expression, the second term can be combined with T to give a one-body operator T that is identical to that in Appendix A, The third term in Eq. (B3) is proportional to the identity, and can be omitted. The fundamental two-body term in W is then just the first term of Eq. (B3), The Hamiltonian to be simulated is T + W . We can now give the expression for λ = λ T + λ SF . Because T is the same as in Appendix A, λ T is as in Eq. (A10). For λ SF we take the sum of the weightings in W to give There is a factor of 4 due to summing over the spins, which would give a factor out the front of 1/2. However, then this λ can be effectively divided by 2 if we choose to realize the squared operator by performing oblivious amplitude amplification as they do in [10]. After oblivious amplitude amplification of the operator A we effectively implement 2A 2 − 1 1 where we can ignore the identity. Since this gives us a factor of 2 we can divide the λ by two, giving λ SF as in Eq. (B6).

The cost of qubitization of the single low rank factorized Hamiltonian
Next we describe the costing of the implementation of the single low rank factorised Hamiltonian as in [9], except taking into account the amplitude amplification approach for reducing λ. The complete Hamiltonian written as a linear combination of unitaries is (ignoring an additive term proportional to the identity) where T pq is as in Eq. (A9). The procedure as given in [9] corresponds to preparing a state, performing controlled operations, then inverting the preparation. Now, we need to perform the state preparation separately on two registers. First, prepare a superposition over the first register, which selects between the terms in the factorisation, as well as the one-body term. Next, prepare a superposition on the second register with weights corresponding to the square roots of W ( ) pq and T pq . Perform the select operation by using the circuit shown in Figure 1 of [9]. Then invert the state preparation on the second register, reflect on that register, then perform the preparation, select, and inverse preparation again. 15. The circuit for performing the state preparation and controlled operations for the single factorisation approach. The register labelled is the first control register containing , and p, q labels the second and third control registers. The registers labelled succ and succ pq are the registers that flag success of the preparation of the equal superposition state for the and p, q registers. The register labelled = 0 is a temporary register used to keep the result of an inequality test checking that = 0, which is used to ensure that the second half of the circuit has no effect for that value of , which is used to label the one-electron term. The register labelled |0 is used to control swaps of p and q. The register labelled α is used to select the spin (and is initialised as |0 ). The bottom register labelled |ψ is the target system. The operation labelled sel is the select as shown in Figure 1 of [9].
The steps are shown in Figure 15, and in detail are as follows.
1. We first prepare a state on the first register as This has complexity as follows.
(a) Preparing an equal superposition on L + 1 basis states has complexity 3n L − 3η + 2b r − 9, where b r is the number of bits used for the rotation on the ancilla, and η is a number such that L + 1 is divisible by 2 η .
(b) A QROM is applied with output size b L = n L + ℵ 1 + 2, where the extra 2 qubits are for outputting the qubit showing if = 0. The complexity is (c) An inequality test is performed with complexity ℵ 1 .
(d) A controlled swap is performed with complexity n L + 1.
2. Next, we prepare a state on the second register as where θ ( ) pq are used to obtain the correct signs on the terms, and the |+ states at the end are used to select the spin and control the swap between the p and q registers. The complexity of this state preparation is as follows.
(a) First, an equal superposition over p and q is prepared with p ≤ q ≤ N/2. Letting n N = log(N/2) be the numbers of bits used for the p and q registers, the cost of preparing this equal superposition is as follows.
i. For testing the two inequalities the cost is n N − 1 for q ≤ N/2 and n N for p ≤ q.
ii. An ancilla is rotated with cost b r − 3 for b r bits of accuracy. iii. A reflection is performed controlled on the result of the two inequality tests and the ancilla qubit, with cost 1. iv. The inequality tests are inverted with Cliffords, and there is a cost of b r − 3 for another qubit rotation. v. There is a reflection on the p and q registers and the ancilla with cost 2n N − 1. vi. The inequality tests are performed again with cost 2n N − 1. vii. Another Toffoli is used to give a single qubit flagging success for both inequality tests.
That gives a cost of 6n N + 2b r − 7.
(b) Next, a contiguous register is computed from p, and q as In Appendix F we show that the complexity of computing p(p − 1)/2 + q is n 2 N + n N − 1. We are making an improvement over the method in [9] by not computing a contiguous register including as well. This is because it is possible to apply the QROM to two registers (provided there are not restrictions like ≤ s), as we show in Appendix G.
(c) Perform the QROM on the two registers as described in Appendix G with output of size b p = 2n N + ℵ 2 + 2, with complexity L + 1 k p1 (d) Perform the inequality test with cost ℵ 2 .
(e) Perform the controlled swap with the alt values with cost 2n N . The sign required for the sign qubits can be implemented with Cliffords as before.
3. Perform a controlled swap between the p and q registers, with cost n N . Figure 1 of [9]. As discussed in Appendix A the cost is 2(N − 2) Toffolis when it does not need to be controlled.

Perform select as shown in
5. Reverse steps 2 and 3, where the complexities are the same except the QROM complexity which is changed to 6. Reflect on the qubits that were prepared in step 2. The qubits we need to reflect on are as follows.
(a) The 2n N qubits for the p and q registers.
(b) We need to reflect on the ℵ 2 registers that are used for the equal superposition state for the state preparation.
(c) One that is rotated for the amplitude amplification.
(d) One for the spin.
(e) One for controlling the swap between the p and q registers.
That gives a total of 2n N + ℵ 2 + 3 qubits. The reflection needs to be controlled on the success of the preparation on the register, and = 0, making the total cost 2n N + ℵ 2 + 3 Toffolis.
7. Perform steps 2 to 5 again, but this time L + 1 is replaced with L in Eq. (B13) and Eq. (B14). Also, the select operation needs to be controlled on = 0, which flags the one-body term. That requires another Toffoli.
8. Invert the state preparation on the register, where the complexity of the QROM is reduced to 9. To complete the step of the quantum walk, perform a reflection on the ancillas used for the state preparation. There are n L + 2n N + ℵ 1 + ℵ 2 + 4, where the qubits we need to reflect on are as follows.
(a) The n L qubits for the register.
(b) The 2n N qubits for the p and q registers.
(c) The ℵ 1 qubits for the equal superposition state used for preparing the state on the register using the coherent alias sampling.
(d) The ℵ 2 qubits for the equal superposition state for preparing the state on the p, q registers.
(e) Two qubits rotated for the amplitude amplification.
(f) One qubit for the spin.
(g) One qubit for controlling the swap of the p and q registers.
10. The steps of the walk are made controlled by using unary iteration on an ancilla used for the phase estimation. Each step requires another two Toffolis for the unary iteration and making the reflection controlled.
1. The control register for the phase estimation, and the ancillas for the unary iteration, together need 2 log I − 1 qubits.
2. There are N qubits for the target system.
3. There are n L + 2 qubits for the register, the qubit rotated in preparing the equal superposition, and the qubit flagging success of preparing the equal superposition.
4. The state preparation on the register uses n L + 2ℵ 1 + 3 qubits. Here n L is for the alt values, ℵ 1 are for keep values, ℵ 1 are for the equal superposition state, 1 is for the output of the inequality test, and 2 are for the qubit flagging = 0 and its alternate value.
5. There are 2n N + 2 qubits needed for the p and q registers, a qubit that is rotated for the equal superposition, and a qubit flagging success of preparing the equal superposition.
7. The equal superposition state used for the second preparation uses ℵ 2 qubits.
The QROM for the state preparation on the second register uses a large number of temporary ancillas, which can be reused by later parts of the algorithm, so those later parts of the algorithm do not need the number of qubits counted. The total number of qubits used is then 2 log I +N +2n L +2ℵ 1 +ℵ 2 +b r +2n N +6+ log(N 2 /8+N/4) +b p k p1 k p2 + log[(L+1)/k p1 ] + log[(N 2 /8+N/2)/k p2 ] (B17) with b p = 2n N + ℵ 2 + 2, n N = log(N/2) , n L = log(L + 1) . This completes the costing of the low rank factorization method. The idea to combine this second factorization with qubitization was suggested but not implemented in [9]. The idea is to further factorize the low rank operator as where n p,σ = a † p,σ a p,σ , W ( ) pq are scalars obtained by performing a Cholesky decomposition on a flattened version of the V pqrs tensor, f ( ) p are scalars obtained from the diagonalization of the squared one-body operator, and U are the eigenvector matrices of that diagonalization. The sum over p can be truncated at Ξ ( ) to a good approximation. Generally, we have that L = O(N ) and Ξ ( ) < N but in some rather special cases Ξ ( ) can be asymptotically less than this (e.g., for very large systems scaling towards the thermodynamic limit). For the moment we will keep the sum at N/2 to show how part of this two-body operator can be combined with the one-body operator. We will also assume for the moment that L = N 2 /4 so there is no approximation involved in summing to L.
Using the Jordan-Wigner transformation of Eq. (A3) one can map the operator F to qubits as The factor of 1/2 becomes a factor of 1/8 because a factor of 1/2 comes from the Jordan-Wigner transformation which is then squared. This operator may then be expressed as where we used the definitions The authors of [10] add the one-body part of this expression into the one-body operator to provide a new one-body operator where we have used the relation that The expression for the one-body term can be simplified by noting that the trace is unchanged under U , so Thus, the additional term becomes The last equality is exact when L is large enough that there is no truncation of the rank. Then the one-body term becomes, with no approximation due to truncation of the rank, where T pq is as in Eq. (A9). Thus the one-body operator can be taken to be T , the same as for the sparse case in Appendix A and the single factorization in Appendix B.
For the two-body operator F , we can omit −I 2 , which only gives a uniform shift in the energy, so in F only U Z 2 U † is retained. Moreover, in this two-body operator one can now truncate the sum over p at Ξ ( ) , so In this expression one can also take a truncation with L < N 2 /4, so there is a truncation of the rank for the two-body operator but not the one-body operator.
The key insight of [10] is a method for implementing the U such that one obtains a lambda scaling as λ = λ T +λ DF where The factor of 1/8 becomes a factor of 1/4 because we must multiply by 2 2 = 4 due to the sum over the spin degree of freedom but we can then divide by two by using the oblivious amplitude amplification, Chebyshev polynomial technique. The contribution to λ from T can be determined as follows. One can perform a rotation of the basis to diagonalise T as well, so the value of λ T is the Schatten-norm (sum of the absolute values of the eigenvalues) of the matrix T pq = T pq + N/2 rr=1 V pqrr . The main reason that this double-factorization method is more efficient than the one in [9] is because it has reduced λ. While the authors of [10] write that their method is more efficient by more than an order of magnitude for various systems studied, this turns out not to be the case comparing to the most efficient method in [9] and correcting the estimate of λ.

Cost of qubitization of the double low rank factorized Hamiltonian
The primary cost of the double-factorisation method from [10] comes from the circuit shown in Eq. (78) of their Supplementary Material. The costs there are given by the following parts. Eq. (79). This needs to be performed and inverted twice. 4. There are N/2 controlled swaps performed before and after the rotations as shown in Eq. (62) of [10]. Since that is repeated, the cost is 2N .

A state preparation as shown in their
5. The reflection in Eq. (78) of [10] has a complexity depending on how the reflection is performed.
First note that there seems to be a problem in using the circuit exactly as shown in Eq. (79) of [10]. It is using a joint state preparation on two registers, but the qubits before the preparation cannot be cleanly divided into those for the first register and those for the second register. That would be needed in order to reflect on only one, as is needed in the procedure in Eq. (79) of [10]. We will first discuss the costing for the procedure as shown in Eq. (79), then describe how to perform the separate state preparations on the two registers.
In comparing the complexity as presented here to that in [10], it must be taken into account that they are defining N in a different way. In that work N is the number of orbitals, whereas here it is the number of spin-orbitals, which is twice as large, because it is multiplied by the spin degree of freedom. Therefore, [10] takes N = 54 for the FeMoCo orbitals of [23], whereas here we take N = 108. Also, in the expression for , is the error in synthesising a single step, which is taken to be 0.0001 Hartree in [10].
The dominant costs are from the QROM and the rotations. In using our THC representation we are able to reduce the QROM costs below those in [10], though the rotation costs are unchanged. To understand the QROM costs, we first list the main variables used.
• The main rank L.
• Ξ ( ) is the rank of each term.
• (1/L)Ξ = L =1 Ξ ( ) is the average rank of the second factorization; thus, the total number of coefficients is LΞ. Before you perform the state preparation, you need to prepare an equal superposition over LΞ basis states. That has cost 3 log LΞ − 3η + 2b r − 9 Toffolis, as per the analysis in Appendix A, where LΞ is divisible by 2 η . The QROM used for the state preparation has an output size where n Ξ = log(max Ξ ( ) ) , n L = log(L + 1) , are the bits for the two registers and the bits of precision ℵ used for the keep values. We take n L = log(L + 1) rather than n L = log L in order to use an additional value of to account for the one-electron term. There is also a size for the contiguous register n L,Ξ = log(LΞ + N/2) .
Here the N/2 will be to account for the one-electron term in the simulation. In the expression for b the factor of 2 on n Ξ and n L accounts for the ind and alt values in sparse state preparation, and the +2 is for the sign bit and its alternate value. (In [10] only one bit is assumed for the sign.) The complexity for the state preparation is that of the QROM, plus ℵ for the inequality test and n Ξ + n L for the controlled swaps.
For the state preparation, since the number of items of data is LΞ, and each has size b, the complexity of the QROM computation is where k p must be chosen as a power of 2. The uncomputation cost is then There is cost n Ξ + n L for the controlled swaps and ℵ for the inequality test, which are inverted as well for the inverse preparation, giving cost 2n Ξ + 2n L + 2ℵ = b + ℵ − 2. Taking account of the need to perform the state preparation and inverse state preparation twice, we give the cost of the state preparation for the two-body term as Before performing the QROM for the rotations, it is necessary to use QROM to generate a register with the offsets in order to produce a contiguous register. This procedure is described in Eq. (39) and Lemma 7 of [10]. The complexity of the QROM for the offsets is In uncomputing the complexity is We also need to perform addition of registers twice and invert addition of registers twice, which has cost 4(n L,Ξ − 1). That gives a total cost of Once the contiguous register is produced, one can apply the QROM to output the rotation angles. Because there are LΞ sets of N /2 bits for the rotation angles, the cost of the QROM used for the rotation angles can be given as and the cost of the inverse QROM as In [10], is taken to be = 5.652 + log(N λ/2 ) .
Note that we are using our convention for N , which is double that used in [10]. Accounting for two steps with preparation and inverse preparation the cost is For the total cost of the two-body term in the double-factorisation method of [10], we add the QROM costs in Eq. (C18), Eq. (C21), and Eq. (C25), plus the 4N ( − 2) cost for the controlled rotations, plus the 2N cost for the controlled swaps (for the spin), plus twice the 3n L,Ξ − 3η + 2b r − 9 cost for preparing the equal superposition state, n L,Ξ − 1 for reflection and 2 for the unary iteration for the phase estimation and making the reflection controlled. There is also a cost of n Ξ − 1 for the reflection on the second register, if this were possible as claimed in [10]. The total cost is therefore where b = 2n Ξ + 2n L + ℵ + 2 with the numbers of qubits as given in Eq. (C14). The values of k should be chosen as powers of 2 that minimise the cost. As mentioned above, the procedure needs to be made more complicated, because the state preparation should be performed separately on the two registers, rather than jointly on both registers at once, because a reflection needs to be made on one register. An alternative approach is to prepare the state on the first register, then prepare the state on the second register controlled on the first register, rather than preparing the state on both registers together.
It is also necessary to account for the one-electron term. It is possible to add that in an explicit way as proposed in [10], but that significantly increases the complexity because of the need to perform high accuracy rotations again. Instead, one can combine it with the two-electron term in a similar way as we have described above for the single low rank factorization. The principle is to use an additional basis state on the first register to flag the one-electron term. Then the second register will have a state preparation corresponding to the one-electron term for this basis state on the first register, but only for the first part, not the second, because we are not applying the oblivious amplitude amplification within a single step for the one-electron term.
The detailed procedure is shown in Figure 16. The steps are as follows.
1. First we need to prepare the appropriate superposition state on the register, which is indicated as the box labelled "prep" in Figure 16. The steps to perform that are as follows.
(a) Prepare an equal superposition over L + 1 basis states, with cost 3n L − 3η + 2b r − 9 Toffolis, as per the analysis in Appendix A, where L is divisible by 2 η . (b) Use a QROM for state preparation on the first register outputting an alt value of size n L and a keep value of size ℵ 1 . The output size is so the QROM cost is (c) Perform an inequality test with cost ℵ 1 .
(d) Perform the controlled swap with cost n L , completing the state preparation on the first register.
2. Before we prepare the state on the p register, we need to output data from the register. This is indicated by the "In " and "data " in Figure 16. For each value on the first register we need to output Ξ ( ) (with n Ξ bits), and b r bits for a rotation on an ancilla qubit. It is also convenient to output the offset (with n L,Ξ bits) at this stage, and determine if = 0, giving output size

In
In 16. The circuit for performing the state preparation and controlled operations (select) for the double factorisation approach. The register labelled is the first control register containing , and p labels the second control register. The registers labelled succ and succ p are the registers that flag success of the preparation of the equal superposition state for the and p registers. The register labelled = 0 is a temporary register used to keep the result of an inequality test checking that = 0, which is used to ensure that the second half of the circuit has no effect for that value of , which is used to label the one-electron term. The registers labelled Ξ ( ) , "offset" and "rot" are the outputs of the QROM on the register that are mainly used for the controlled preparation of the state on the p register. The register labelled "rotations" is the data needed for the basis rotations for implementing Zp,σ. The register labelled |0 is used to select the spin, and |ψ ↓ and |ψ ↑ are the spin down and up components of the target system. and QROM cost 3. Next we need to prepare the state on the p register controlled on the register. This step is indicated by the "In" and "prep " in Figure 16, and uses the data output by the QROM in the previous step. The steps needed for the preparation are as follows.
(a) The controlled preparation of an equal superposition state may be performed in the following way with complexity 7n Ξ + 2b r − 6. The steps are as follows.
i. Use a string of n Ξ − 1 Toffolis (and Clifford gates) to produce a new n Ξ -qubit register that has zeros matching the leading zeros in the binary representation of Ξ ( ) , and ones after that. These qubits are used to control the Hadamards in the next step. ii. Perform n Ξ controlled Hadamards with cost n Ξ . A controlled Hadamard can be performed with a single Toffoli by catalytically using a T state, as shown in Figure 17. Therefore we count the cost of each controlled Hadamard as 1. iii. Perform an inequality test with the register containing Ξ ( ) with cost n Ξ . iv. Rotate the ancilla qubit with cost b r − 2 based on the rotation angle given by the output of the QROM. v. The reflection on the result of the inequality test and ancilla qubit is a controlled phase which is a Clifford gate. vi. Invert the rotation with cost b r − 2 and the inequality test with Cliffords. vii. Perform the n Ξ controlled Hadamards again. viii. Reflect about the zero state on n Ξ +1 qubits (the qubits where the state preparation is being performed and the rotated ancilla) with cost n Ξ − 1. ix. Perform the n Ξ controlled Hadamards again. Now the binary-to-unary conversion can be inverted with Cliffords.
x. Perform an inequality test again with cost n Ξ . Now, flagged on the success of the inequality test, we have prepared an equal superposition state on the second register that can be used for state preparation on the second register.
(b) We now need to add the offset to the second register to provide a contiguous register to apply QROM to for state preparation. This addition has cost n L,Ξ − 1.
(c) Next, apply the QROM to output the alt and keep values with size where the +2 is for the sign bit and its alt value. In this part there are LΞ+N/2 outputs to give, because the rotations for the one-electron term are needed as well, then in the next part there are only LΞ. Therefore the two QROMs have cost (d) The remainder of the state preparation uses an inequality test with cost ℵ 2 , and a controlled swap with cost n Ξ .
4. Apply the number operators via rotations and QROM with the following steps.
(a) To apply the QROM for the rotation angles, we need to again add an offset to the second register to provide a contiguous register, with cost again n L,Ξ − 1. This operation is indicated by the controlled box with + in it in Figure 16, and a box with − in it for its inversion.
(b) Use QROM to output the rotation angles, which is shown as "In p " and "data p " in Figure 16. The QROM cost is (c) Perform controlled swaps with the spin qubit as control, with cost N/2. (e) Apply the Z 1 operation, controlled on success of the preparation of the and p registers. This control gives a cost of 1.
(f) Reverse the controlled rotations and controlled swaps, with cost N ( − 2) and N/2.
(g) Reverse the QROM with cost (h) Reverse the addition onto the contiguous register, with cost again n L,Ξ − 1.
5. Invert the state preparation on the p register, which is shown as "In" and "prep † p " in Figure 16. The cost is the same as given in step 2, except the cost of the QROM is reduced to 6. The reflection on the second register only (which is what caused the procedure to be this complicated) is performed. The qubits that need to be reflected on are as follows.
(a) The n Ξ qubits that the state is prepared on.
(b) The ℵ 2 that are used for the superposition state for the coherent alias sampling.
(c) One that is rotated for the amplitude amplification.
(d) One for the spin.
That gives a total of n Ξ + ℵ 2 + 2. This reflection also needs to be controlled on success for preparation of the register, and = 0, so the complexity of the reflection is n Ξ + ℵ 2 + 2.
7. Repeat steps 2 to 5. This time, because we do not need to output values for the one-electron terms, the costs of the QROM and inverse QROM for state preparation are reduced to The costs for the QROM for the rotations are reduced to This time Z 1 must be controlled by the qubit flagging = 0, which introduces a cost of another Toffoli.
8. Invert the QROM in step 2 and the preparation in step 1, where the costs of the QROMs are reduced to 9. There is also a cost for the reflection needed for constructing the quantum walk. The qubits that need to be reflected on are as follows.
(a) The n L qubits for the register.
(b) The n Ξ qubits for the p register.
(c) The ℵ 1 qubits for the equal superposition state for the alias sampling for the register.
(d) The ℵ 2 qubits for the equal superposition state for the p register.
(e) The two qubits that are rotated.
(f) One qubit for the spin.
10. The steps of the quantum walk are made controlled using unary iteration on an ancilla. This introduces a cost of another two Toffolis for the unary iteration and making the reflection controlled.
Adding together all these costs, then gives 2(3n L − 3η + 2b r − 9) + L + 1 with b p1 = n L + ℵ 1 , b o = n Ξ + n L,Ξ + p, and b p2 = n Ξ + ℵ 2 + 2, and the numbers of qubits as defined in Eq. (C14). Next consider the logical qubit count. The qubits used are as follows. 1. The log(I + 1) qubits for the control register for the state preparation and log(I + 1) − 1 for the unary iteration on this register.
2. There are N qubits used for the system.
3. The first register that is prepared needs n L qubits, plus one to flag success of the state preparation and one that is rotated as part of the state preparation.
4. The output of the QROM needs n L + ℵ 1 , another ℵ 1 are used for the equal superposition state to take the inequality with, and another qubit is needed as the output of the inequality test. We can ignore temporary ancillas used as more will be needed at a later step.
5. The second QROM on the first register used to output the data needed for the equal superposition state on the second register uses b o output bits.
6. The second register needs n Ξ bits, plus another bit for flagging success of preparing the equal superposition and another that is rotated.

7.
A register with size n L,Ξ for outputting the contiguous register for the state preparation on the second register. This is a temporary ancilla that can be erased after the QROM with Cliffords by using the out-of-place adder. It then is computed again for the QROM for the rotation angles. There it can be computed in place, so does not add to the qubit count.
8. The QROM used for preparing the state on the second register uses b p2 qubits output, as well as a number of temporary qubits that are less than those needed for the rotation angles output later.
9. The state preparation needs another ℵ 2 qubits in a superposition, as well as another qubit flagging success of the inequality test.
10. The angles for the rotations need k r N /2.
11. The phase gradient state which needs qubits.
12. Another control qubit is needed for the spin.
13. A T state on a single qubit is used to perform the controlled Hadamards with Toffoli gates.
Adding together these ancilla costs gives

Numerical determination of double low rank factorization
The numerical determination of double low rank factorization can be ambiguous without further details. In this subsection, we aim to provide full details of how the factorization is obtained in this work. The original approach proposed by Peng and Kowalski [34] worked with separate thresholds for the first and the second factorizations. Instead, von Burg et al. [10] proposed a truncation scheme just based on a single threshold which we further elaborate here.   [35]. Here λ T is 478.1 Hartree, which is the Schattennorm of T . A threshold is used to truncate eigenvectors based on the truncation strategy described in [10] and Eq. (C41). The entry in blue is the one used for our resource estimates.
2. The second factorization was originally proposed to be done with the singular value decomposition of W ( ) [34], but following von Burg et al. [10] we use eigendecomposition: W 3. For a given threshold and looping over the first factorization index from 1 to L, we discard the m-th eigenvector in the second factorization that satisfies We note that if no eigenvectors are kept at 0 then we discard the rest of vectors for > 0 without going through them.
This scheme was used to generate the numerical data presented in Table XIII and Table XIV.

Numerical data for hydrogen chains and H 4
The average rank of the second factorization for hydrogen chain and H 4 is shown in Figure 18. Theoretically, Ξ cannot scale worse than O(N ). However, we obtained O(N ∞.∈∃ ) in Figure 18 (b). Therefore, in Table II we  Here we discuss in more detail a recent randomized approach, often called qDRIFT [49], that approximates time evolution using a randomized evolution. The randomized nature of qDRIFT leads to a subtle point arising between the cases considered here and the analysis provided in [49,50] in that the results in both cases discuss the probability of failure for the algorithm rather than the mean-square error. The analysis below focuses on the mean-square error in phase estimation that arises from qDRIFT and we see that because of the possibility of large deviations, the worst case scaling of the complexity can be substantially worse than the O(λ 2 / 2 ) scaling anticipated from such results.
Before jumping into the analysis, we will review the fundamentals of the qDRIFT algorithm. The idea behind the evolution is to implement a channel, for the Hamiltonian where p j = H j / j H j where · is the Schatten-norm. In our case we will assume that each H j is proportional to a unitary so then p j = H j /λ. This channel can be implemented by randomly selecting a term from the Hamiltonian, simulating the evolution of just that term for a duration that scales inversely with the probability of drawing the Hamiltonian term. From [49] we have that if we define E(τ, 0) : ρ → e −iHτ ρe iHτ Note that this exact same bound holds if we consider a controlled evolution using qDRIFT which can be thought of as simulating the Hamiltonian H ctrl = |1 1| ⊗ H = j |1 1| ⊗ H j . For improved performance of the phase estimation, one can also use control between positive and negative Hamiltonian evolution, which corresponds to H ctrl = −Z ⊗ H and has the value of λ. This implies that the same bound holds in the controlled and the uncontrolled case. Here the diamond distance bounds the worst case trace distance between outputs that can be attained over all inputs to the channels. Further, it can be seen using the arguments from [49,100] that for any positive integer R Thus by the von-Neumann trace inequality it follows that for any bounded observable O Thus if we choose τ = t/k and R = kr then we have that the the error in the expected phase obeys Here φ is a random variable that describes the estimate of the phase that we would see from the phase estimation procedure in the absence of simulation error andφ is the exact eigenphase that we would find by diagonalizing the evolution given by qDRIFT. Then from the definition of the variance and under the assumption that our estimator of the eigenphase is unbiased Here the last inequality comes from bounds on the variance of the estimate of the eigenphase that comes from phase estimation [100]. The number of exponentials is often used as a surrogate for the cost of a quantum simulation and in the case of qDRIFT this cost is Solving N exp = kr for k gives r = N exp /r. Then substituting it into Eq. (D6) and setting the result equal to 2 t 2 gives 2 t 2 ≈ π 2 4r 2 + 4π 2 (rλt) 4 N 2 exp (D8) Solving for N exp then gives N exp = 4π 2 (rλt) 4 2 t 2 − π 2 /4r 2 = 4πr 3 λ 2 t 2 √ 4 2 t 2 r 2 − π 2 (D9) Taking the derivative with respect to r gives dN exp dr = 12πr 2 λ 2 t 2 √ 4 2 t 2 r 2 − π 2 − 16πr 4 2 λ 2 t 4 (4 2 t 2 r 2 − π 2 ) 3/2 (D10) Solving for the derivative being equal to zero gives 12πr 2 λ 2 t 2 (4 2 t 2 r 2 − π 2 ) = 16πr 4 2 λ 2 t 4 r = 3/2π 2 t .
Substituting into the expression for N exp then gives This is a monotonically decreasing function of t so we optimize this result by choosing t to be as large as possible.
Without a priori knowledge about the spectrum of the Hamiltonian the maximum value of t that we can take is t ≤ π/λ, which would lead to This scaling is as O(λ 3 / 3 ), so would yield very large gate counts. The reason for the large scaling, is that the diamond norm bounds the difference in the probabilities of measurement results, which can allow an increase in probability concentrated at the maximum error leading to large mean-square error. However for many practical purposes, it is more reasonable to construct a confidence interval for the measurement result. Then a bound on the diamond norm of 0.01, for example, would change a 96% confidence interval to a 95% confidence interval. It would therefore be reasonable to allow a diamond norm of some fixed size qD , and consider the scaling in other parameters.
In the following we will take k = 1 so N exp = r, and Bounding the error due to the phase estimation gives so choosing the minimum value of r satisfying these inequalities gives N exp = r ≈ π 2 λ 2 2 qD 2 pha . (D16) If one were concerned with constructing a confidence interval of width (rather than a root-mean square error of ), then one could simply use pea ∝ for constant qD . Then we obtain N exp = O(λ 2 / 2 ), rather than a power of 3 in the scaling. It is this scaling which we use as the basis for that given in Table II. In the remainder of this Appendix we will consider the scaling given in Eq. (D13), and we will use that as the basis for the gate counts given in Table III. The Hamiltonian can be expressed such that each H j is a weighted Pauli operator. In this case e −iHj τ /pj can be implemented using at most O(N ) one and two qubit Clifford operations and a single R z gate [11]. We implement these rotations using a state that is constructed using a q + 1 qubit quantum Fourier transform on the state |1 : We can then use this as a resource state to implement a single qubit rotation using incrementer gates as described in Appendix A of [82] which requires q − 1 Toffoli gates to implement e −iZπ/2 q using the q + 1 qubit resource state in (D17). Next note that for any value of λt/k we have that we can round the desired rotation angle to λt/k ≥ π/2 log(πk/λt) . Thus we can choose the evolution time t such that we will, at worst, need to halve the evolution time to ensure that the rotation angle coincides with the angle returned by the phase gradient state. This will necessitate choosing t ∈ [π/2λ, π/λ] which corresponds to taking q = log(2k) ≈ log(6π 2 λ 2 / √ 2 2 ). The number of Toffoli gates needed is The total number of ancillae needed by the algorithm is given by the sum of the ancillae needed to store the resource state |ψ q+1 , the ancillae needed for the phase estimation step using the method of [16] and any ancillae needed to implement the carry logic incrementer circuit [83] and finally the N target qubits. These spatial overheads sum to N qubits = N + (q + 1) + 2 log(r + 1) − 1 + q N + 2 log(2k) + 2 log 3/2λ + 2 N + 2 log 3 √ 3π 2 λ 3 3 + 2 (D19) The scaling of this algorithm with respect to both λ and is roughly cubically worse than the cost of applying qubitized phase estimation. While these comparisons suggest at first glance that randomized simulations are not a competitive technique for simulating such dynamical systems. However, the results in [18] suggest that incorporating randomness to simulate low importance terms and using conventional methods to simulate high-importance terms may be a much more profitable approach to using randomized simulation methods for challenge problems in chemistry.
The specific numbers for the Reiher et al. and Li et al. Hamiltonians, using double low rank factorization can be found using (D18) and (D19). Since qDRIFT doesn't have quantum costs that depend at all on the number of terms in the Hamiltonian we take the smallest thresholds considered in Tables IX and X (5 × 10 −5 ) for the Reiher et al. and Li et al. Hamiltonians respectively. This corresponds to λ = 2183.6 and λ = 1600.9 respectively. Taking chemical accuracy as 0.0016 Hartree as our target accuracy then yields N Toff = 2.9 × 10 21 for the Reiher Hamiltonian and N Toff = 1.1 × 10 21 for the Li Hamiltonian. Further, the Reiher Hamiltonian requires 334 logical qubits to simulate within this accuracy whereas the Li Hamiltonian would require approximately 374 logical qubits.
Here θ T pq gives the sign of T as before, θ (µ) p gives the sign of χ (µ) p , and θ ζ µν gives the sign of ζ µν . The key thing to note is that the second part of this state, corresponding to the two-body terms, can be factorised as The fey feature of this state that makes it easier to prepare is that it factorises. One may therefore start by preparing a state on µ and ν, then controlled on µ and ν prepare the states in brackets.
2. Using the structure of tensor hypercontraction to implement the qubitization state preparation In preparing the overall state in Eq. (E4), one can first prepare the state on the first 4 registers, with amplitudes proportional to |T pq | for the one-body term and |ζ µν | for the two-body-term. That is, we first prepare the state After this preparation, for the two-body term flagged by 1 on the first qubit, we need to perform four applications of the mapping This is a preparation on two registers, but this time there is not symmetry. For the first preparation, the steps to be performed are as follows.
1. Rotate the first qubit to give the appropriate relative weighting between the one-and two-body terms.
2. For 1 on the first qubit, prepare an equal superposition over µ and ν for 1 ≤ µ ≤ ν ≤ M , or for 0 on the first qubit prepare an equal superposition with the restrictions 1 ≤ µ ≤ ν ≤ N/2. That can be performed using inequality tests and amplitude amplification in a similar way to that explained in Appendix B. The difference is that instead of just using the inequality test ν ≤ M , you use a controlled inequality test of ν ≤ N/2 for 0 on the first qubit, or ν ≤ M for 1 on the first qubit. The extra inequality tests increase the Toffoli cost to 8n M + 2b r + O(1), where b r are the bits of precision for rotation on an ancilla and n M = log M .
3. Compute µ(µ − 1)/2 + ν, which can be performed with complexity n 2 M + n M − 1. In the case of 0 on the first qubit (for the one-body term), add M (M + 1)/2 to yield a contiguous register. That addition has complexity 2n M + O(1).
4. Perform a QROM using this contiguous register to alt values of µ and ν, keep probabilities, and two values (one is the alt value) of the sign θ ζ µν . The output size for the QROM is where ℵ are the bits of precision for ζ µν . The complexity of the QROM is where L ζ = M (M + 1)/2 + N 2 /8 + N/4 (E10) and k ζ must be a power of 2.
5. Perform an inequality test between the keep value and another register in an equal superposition, with cost ℵ.
6. Perform a swap between µ, ν and the alt values controlled on the result of the inequality test, with cost 2n M . The signs can be applied with Clifford gates.
7. Apply a swap between the µ and ν registers controlled by a qubit in a |+ state, with cost n M .
The total complexity is then where rot is the accuracy required for the rotation on the first qubit. In inverting the state preparation, all costs are the same except the QROM cost, which is reduced to give a total cost For the preparation given in Eq. (E7), the considerations are similar, except we do not take advantage of symmetry, so L χ = M N/2. There the preparation needs a QROM to run through all values of p and µ, and it is convenient to use the QROM for two registers as in Appendix G. The total Toffoli costs are 1. Prepare an equal superposition over N/2 basis states in the p register, with cost 3n N − 3η + 2b r − 9, where η is the largest number such that 2 η is a factor of N/2, and b r is bit of precision for rotation on an ancilla.
2. Because we only need to output alternate values of p (and not µ), the output size for the QROM is Therefore we can apply the QROM with cost (using the method for separate registers in Appendix G) 3. Perform the inequality test with cost ℵ.
4. The controlled swap does not need to touch the register with µ, because we are just preparing a superposition over p for a given µ. Therefore the controlled swap has cost n N .
The total costs are then for preparation and for inverse preparation. It is also possible to combine the preparation of some of the equal superposition states together, which would give slightly different costs. We will not analyse that here, because the method that is best tends to depend on the particular example.
We can now combine all of the Toffoli costs. Clearly, we will need to prepare and unprepare the χ state four times and prepare and unprepare the ζ state once. Thus the overall cost of preparation and unpreparation is For the total cost, there will also be 4N for the cost of implementing the select operation, and 2n M +4n N +5ℵ+O(1) for reflections on the control registers to construct the overall step of the quantum walk. There are 2n M + 4n N + O(1) qubits that the state is prepared on, and we need to reflect on the 5ℵ qubits used for equal superposition states as well.
The logical qubit costs are as follows, where we omit a number of single ancillas (about 20) for simplicity.

=1
2 (q + p +1 ) + 2 n−1 q n−1 + n/2−1  values on x, it is used as a control for iterating through N 2 /k 2 values on the most significant bits of y. Since that iteration is done in a controlled way, its cost is N 2 /k 2 − 1. Since it is done N 1 /k 1 times, the overall cost is Adding to that the cost of the iteration on x gives a cost of where the −2 is if the overall QROM does not need to be made controlled. We will omit the −2 for simplicity, and consistency with the way these costs for QROM are usually quoted. Then the cost of the controlled swaps at the end is identical to what it usually is for the QROM, so for output size b it will be b(k 1 k 2 − 1). That gives a total cost (omitting the −2) of The cost here should be compared to a cost for the case of a contiguous register with k = k 1 k 2 . The cost using a contiguous register will be slightly less, but the difference will often be less than the cost of computing a contiguous register. For example, say N 1 = 350, N 2 = 72, and b = 20. Then with k 1 = 8 and k 2 = 4, the cost of the QROM is decreased by only 4 using a contiguous register, but the cost of computing the contiguous register is 17. The number of ancillas is increased to log(N 1 /k 1 ) + log(N 2 /k 2 ) + bk 1 k 2 .
That is because there are log(N 1 /k 1 ) required for iteration on the first register and log(N 2 /k 2 ) for iteration on the second register, as well as bk 1 k 2 for the outputs. In practice this usually only needs one more ancilla than using a contiguous register, which is less than would be needed to store the contiguous register itself.
In the same way, in uncomputing the QROM where it is necessary to perform a phase fixup, there is the same change to the cost for outputting the data, so the cost becomes In the example with N 1 = 350, N 2 = 72, but taking k 1 = 16, k 2 = 8, the cost is only increased by one Toffoli over the cost for a contiguous register. The number of ancillas needed is log(N 1 /k 1 ) + log(N 2 /k 2 ) + k 1 k 2 . (G7)