Encoding Electronic Spectra in Quantum Circuits with Linear T Complexity

We construct quantum circuits which exactly encode the spectra of correlated electron models up to errors from rotation synthesis. By invoking these circuits as oracles within the recently introduced"qubitization"framework, one can use quantum phase estimation to sample states in the Hamiltonian eigenbasis with optimal query complexity $O(\lambda / \epsilon)$ where $\lambda$ is an absolute sum of Hamiltonian coefficients and $\epsilon$ is target precision. For both the Hubbard model and electronic structure Hamiltonian in a second quantized basis diagonalizing the Coulomb operator, our circuits have T gate complexity $O({N + \log (1/\epsilon}))$ where $N$ is number of orbitals in the basis. This enables sampling in the eigenbasis of electronic structure Hamiltonians with T complexity $O(N^3 /\epsilon + N^2 \log(1/\epsilon)/\epsilon)$. Compared to prior approaches, our algorithms are asymptotically more efficient in gate complexity and require fewer T gates near the classically intractable regime. Compiling to surface code fault-tolerant gates and assuming per gate error rates of one part in a thousand reveals that one can error correct phase estimation on interesting instances of these problems beyond the current capabilities of classical methods using only about a million superconducting qubits in a matter of hours.


I. INTRODUCTION
The ubiquitous problem of predicting material properties and chemical reactions from ab initio quantum mechanics is among the most anticipated applications of quantum computing. The limitation of most classical algorithms for modeling the physics of superconductivity and molecular electronic structure arises from the seemingly exponential growth of entanglement required to accurately capture strong correlation in systems of interacting electrons. This apparent classical intractability was referenced by Feynman in his seminal work as a key motivation for why we need quantum computers [1,2]. Fourteen years later, Lloyd formalized the concept of a universal quantum simulator [3] and demonstrated an extension for treating systems of interacting electrons in second quantization [4].
Since then, most work developing fermionic quantum simulation methods has focused on time evolution as a means of estimating Hamiltonian spectra and preparing eigenstates [5] via the quantum phase estimation algorithm [6]. Beginning with the proposal of [7], the idea that one should use phase estimation and adiabatic state preparation [8][9][10] to extract quantum chemistry ground state energies became especially popular. More recently, experimental demonstrations [11][12][13][14][15] have focused research on the development of variational algorithms [16,17] which are often [18,19], but not always [20,21], inspired by time-evolution primitives.
Performing quantum phase estimation to sample Hamiltonian spectra requires a quantum circuit to implement an operation W(H) which has eigenvalues that are a known (and efficient-to-compute) function of the eigenvalues of H. Most past work has analyzed phase estimation of circuits corresponding to dynamical Hamiltonian simulation, i.e. W(H) ≈ e −iHτ for some duration τ [6]. We denote by f the cost of implementing a primitive circuit that is repeated to realize W(H); e.g., a Trotter step [22] or Taylor series segment [23]. We further define g( ) as the number of times that one must repeat that primitive to ensure that error in the spectrum of H encoded in the eigenphases of W(H) is at most O( ). Then, the cost of phase estimation is bounded by where · denotes the spectral norm and the operation W (H) is that where we have taken the derivative of the function of the eigenvalues. That is, W (H) has eigenvalues which are a function of the eigenvalues of H, and that function is the derivative of the function which gives the eigenvalues of W(H). For the case of dynamical time-evolution implying that the cost of phase estimation is O(f · g( )/( τ )) in this context. Modern Hamiltonian simulation methods such as the signal processing algorithm [24] and qubitization [25] have saturated the asymptotic lower bound that is possible for g( ) within a query model that aims to synthesize e −iHτ : Ω λτ + log (1/ ) log log (1/ ) .
The definition of λ depends on the query model; e.g. in models for which the Hamiltonian is given as a weighted sum of unitaries, λ is the sum of the absolute values of the weightings [25]. However, W(H) ≈ e −iHτ is not the only encoding from which one may sample the spectrum of H via phase estimation. Recent papers [26,27] have advocated performing phase estimation on a quantum walk operator corresponding to W(H) = e i arccos(H/λ) which can be realized exactly as a quantum circuit without approximations beyond those required for rotation synthesis [25]. (This quantum walk operator also produces eigenvalues corresponding to e −i arccos(H/λ) , but we will ignore those for simplicity of the exposition here.) Even within a black box query model one can achieve g( ) = O(1) if the goal is to implement e i arccos(H/λ) rather than e −iH/λ . Performing phase estimation on either circuit would provide the same information since the spectra of these operators are isomorphic. In this case, the cost of phase estimation is O(f · λ/ ) which follows from Eq. (1), g( ) = O(1), and This work develops methods with such scaling for modeling systems of correlated electrons so that f = O(N +log(1/ )). Our focus on synthesizing unitaries for phase estimation, rather than time-evolution operators that could be used as variational primitives, will result in quantum circuits with millions of gates. Hence, we shall need quantum errorcorrection. We focus on planar nearest-neighbor coupled arrays of qubits, which are being developed experimentally by multiple groups [28][29][30]. We shall use the surface code [31][32][33][34][35], as it has the highest gate threshold error rate for this geometry. Within this model of fault-tolerant quantum computation, the physical resources required for error correcting a quantum circuit are mostly determined by (i) the number of logical qubits and (ii) the number of T gates. This is because applying a single T gate usually consumes many logical qubits and takes significantly longer than applying any other operation [36]; e.g., a T gate typically consumes 50-100 times the spacetime volume of a CNOT. Thus, throughout this work we focus on T complexity as the primary cost model. We note, however, that for all algorithms presented or discussed in this work, the T complexity is within logarithmic factors of the gate complexity.
We focus on the two most-studied models of correlated electrons: the Fermi-Hubbard model and the molecular electronic structure Hamiltonian. The Hubbard Hamiltonian is an approximate model of electrons interacting on a planar lattice which some believe may qualitatively capture the behavior of high-temperature superconductivity in cuprates [46]. The molecular electronic structure Hamiltonian is a realistic model of electrons interacting via the Coulomb potential with real kinetic energy in the presence of an external potential (which usually arises from atomic nuclei), in a finite-sized basis [47]. We shall focus on simulating the electronic structure Hamiltonian in a basis diagonalizing the Coulomb potential [43,44,48]. For both the Hubbard model and molecular electronic structure Hamiltonian we are able to provide circuits which exactly simulate e i arccos(H/λ) with T complexity O(N + log(1/ )) where N is the number of orbitals in a second-quantized representation of the system. In Table I and Table II we compare the T complexity of past quantum simulation methods for these problems.
In Theorem 1 and Theorem 2 we concisely state the T complexity and ancilla requirements of our approach to phase estimation for both the electronic structure Hamiltonian and Hubbard model Hamiltonian, respectively. Both of these theorems are established throughout the paper, but especially in Eq. (54), Eq. (55), Eq. (61), and Eq. (62). In addition to bounding the T complexity of our algorithms we provide explicit circuits for their construction and compile all bottleneck primitives down to surface code fault-tolerant gates (topological braiding diagrams). Therefore, the fault-tolerant aspect of our analysis goes further than prior estimates in the simulation literature [49], the most rigorous of which stopped at estimates of T complexity for Trotter based electronic structure phase estimation [45] and for a variety of techniques used to effect time-evolution of the one-dimensional Heisenberg model [50]. We show that one can perform fault-tolerant phase estimation on interesting instances of both Fermi-Hubbard and molecular electronic structure beyond the capabilities of known classical algorithms using roughly one million physical qubits in the surface code assuming an architecture with two-qubit error rates of about one part in a thousand. Theorem 1. Consider the electronic structure Hamiltonian in a basis of N spin-orbitals which diagonalizes the Coulomb operator, H = p,q T (p − q) a † p a q + p U (p) n p + p =q V (p − q) n p n q where {a † p , a q } = δ pq are fermionic raising and lowering operators. Furthermore, define λ = pq |T (p − q)| + p |U (p)| + p =q |V (p − q)|. Then, one can perform phase estimation to sample in the eigenbasis of H with an additive error of at most in the eigenvalue using circuits with a number of T gates scaling as 24 √ 2πN λ/ + O((λ/ ) log(N/ )) and a number of ancilla qubits scaling as log(λ 3 N 5 / 3 ) + O(1). Theorem 2. Consider the square planar Hubbard model with periodic boundary conditions in a basis of N spinorbitals, H = −t p,q ,σ a † p,σ a q,σ + (u/2) p,α =β n p,α n p,β where {a † p,α , a q,β } = δ pq δ αβ are fermionic raising and lowering operators and the p, q notation implies a summation over nearest-neighboring orbitals on the periodic planar lattice. Furthermore, define λ = 2N t + N u/2. Then, one can perform phase estimation to sample in the eigenbasis of H with an additive error of at most in the eigenvalue using circuits with a number of T gates scaling as 10 √ 2πN λ/ + O(λ log(N/ )/ ) and a number of ancilla qubits scaling as log(λN 3 / ) + O(1).
In Section II we overview the simulation strategy we will use to encode and sample eigenspectra via phase estimation. Section II A discusses how one can synthesize e i arccos(H/λ) within the linear combinations of unitaries query model requiring two oracle circuits: select and prepare. Section II B introduces a particularly precise variant of phase estimation which queries select and prepare oracles to estimate spectra with a precision exceeding the typical Holevo variance. Section II C analyzes the various sources of errors which we need to consider in this algorithm and then bounds the number of times we must query select and prepare in order to perform phase estimation.
Section III, Section IV and Section V focus on explicit constructions of select and prepare. Section III introduces important primitives for both select and prepare. In Section III A we describe circuits applying controlled unitaries such as the mapping | |ψ → | X |ψ with T complexity O(L) where L is the number of possible values of . In Section III B we show how to selectively apply a Majorana fermion operator, a primitive necessary for our implementation of select in later sections. In Section III C we use the result of Section III A to show a particularly efficient variety of Quantum Read-Only Memory (QROM) which we use for our prepare circuit. In Section III D we describe a general technique for implementing prepare in a fashion that keeps λ as small as possible. Section IV A and Section IV B discuss explicit constructions of select and prepare circuits for the electronic structure Hamiltonian. Section V A and Section V B discuss explicit constructions of select and prepare circuits for the Hubbard model Hamiltonian. Section IV C and Section V C focus on quantifying the number of T gates and ancillae required by the algorithms described in Section IV and Section V. These sections include investigations of the finite-size magnitude of the λ and target precisions required to implement our algorithms for interesting problems. In Section V D we discuss how our Hubbard model simulation techniques can be combined with recent results to achieve even lower scaling based on the locality of the Hubbard Hamiltonian.
Finally, Section VI discusses compilation of these routines to surface code fault-tolerant gates and estimates the physical resources required for error-correcting these algorithms under realistic assumptions about hardware. We conclude in Section VII with an outlook on future directions for quantum simulating correlated electron models.  (1). The scalings attributed to work on general methods of Hamiltonian simulation assume that one uses the best oracles for electronic structure available at that point in time; e.g., the scaling attributed to [25] assumes the use of oracles from [42] and the scaling attributed to [26] assumes the use of oracles from [43]. While absent from this table since it did not asymptotically reduce T complexity, [45] was first to explicitly compile a quantum chemistry simulation to Clifford + T gates.   [25] and [26] assume that one uses the select oracles from [42], which also work for the Hubbard model. We assume that the work of [52] would use oracles from [27]. The scalings attributed to this work assume our techniques are combined with those from [52] even though we do not focus on that strategy in our fault-tolerant analysis since those methods seem less effective for finite problem sizes near the classically intractable regime. Nonetheless, we discuss how our methods can be combined to provide the stated complexity in Section V D.

II. PHASE ESTIMATING SPECTRA OF HERMITIAN LINEAR COMBINATIONS OF UNITARIES
The primary contribution of this paper is to demonstrate a particularly efficient method of using quantum computation to sample the spectra of correlated electron Hamiltonians. Though details of our implementation are specialized to electronic systems, our high-level simulation strategy represents a general framework for spectral estimation. While aspects of this approach were introduced recently in [26,27], the techniques involved emerged from a history of advances in Hamiltonian simulation prominently involving Szegedy quantum walks [53], the "linear combination of unitaries" (LCU) query model [54], and the method of Hamiltonian simulation known as "qubitization" [25].
Oracular methods of Hamiltonian simulation assume that information about a Hamiltonian is provided by querying "oracle" circuits [55]. These techniques aim to reduce the number of times one must query these oracles in order to effect the intended simulation to target accuracy. The techniques in this paper implement oracles from the LCU query model introduced in [54]. As the name suggests, this approach begins from the observation that any Hamiltonian can be decomposed as a linear combination of unitaries, where w are scalars and H are self-inverse operators which act on qubits; e.g., H could be strings of Pauli operators. The convention in this paper is that the w are real and non-negative, with any phases included in the H . LCU simulation techniques are formulated in terms of queries to two oracle circuits. The first oracle circuit, the "preparation oracle", acts on an empty ancilla register of O(log L) qubits and prepares a particular superposition state related to the notation of Eq. (5), The quantity λ is the same as that in Eq. (3), and turns out to have significant ramifications for the overall algorithm complexity. The second oracle circuit we require acts on the ancilla register | as well as the system register |ψ and directly applies one of the H to the system, controlled on the ancilla register. For this reason, we refer to the ancilla register | as the "selection register" and name the second oracle the "Hamiltonian selection oracle", Note that the self-inverse nature of the H operators implies that they are both Hermitian and unitary, which means they can be applied directly to a quantum state.

A. Encoding Spectra in Szegedy Quantum Walks using Qubitization Oracles
The essential simulation primitive deployed here (a quantum walk operator based on select and prepare) was first introduced as a subroutine to the qubitization approach for Hamiltonian time-evolution [25]. However, the direct use of this primitive for phase estimation was first suggested more recently in [26]. In Section II B and Appendix A, we go beyond existing work and prove that so long as the eigenphase of the walk operator is bounded away from zero (so that the Hamiltonian is not frustration-free) then this algorithm can in principle learn as quickly as traditional phase estimation using the Cramér-Rao bound.
We begin our discussion with the observation that the state |L from Eq. (6) encodes H as a projection of select onto |L This encoding is a general condition for qubitization [25], but the LCU oracles select and prepare, as defined in Eq. (7) and Eq. (6), are not necessarily the only constructions that meet this criterion; we refer to the broader family of circuits satisfying Eq. (8) as "qubitization oracles". With this in mind, we discuss a walk operator W that encodes the spectrum of H as a function of the eigenphases of W, although the spectrum of W differs from that of the propagator e −iHt . One such walk operator W may be constructed as This construction takes the form of a Szegedy walk [53] since it is composed from a product of two reflection operations. The operation R L is manifestly a reflection operation, and it can be seen that select is a reflection operation because Figure 1 shows a circuit which implements W controlled on an ancilla.
A circuit realizing the Szegedy quantum walk operator W controlled on an ancilla qubit. The last three gates in the circuit on the right constitute the reflection RL controlled on an ancilla. Note that the Z gate with the 0-control is actually controlled on the zero state of the entire | register and not just a single qubit. Accordingly, implementation of that controlled-Z has T complexity O(log L) where log L is the size of the | register. However, that overhead is always negligible compared to the cost of the prepare and select operators in the constructions of this paper. To implement W or RL without the control on the ancilla, one should replace the controlled-Z gate with a −1 1 gate controlled on the zero state of the | register.
The action of W partitions Hilbert space into a direct sum of two-dimensional irreducible vector spaces. Through reasoning about these eigenspaces we can deduce the spectrum of W as well as the eigenvectors. In particular, we claim that the state |L |k and an orthogonal state |φ k span the irreducible two-dimensional space that |L |k is in under the action of W for arbitrary eigenstate |k of H with eigenvalue E k . This state |φ k is formally defined to be the component of W |L |k that is orthogonal to |L |k , which can be simplified using Eq. (8) to 6 The matrix elements of W can be computed for this state. The upper diagonal matrix element follows from Eq. (8), The upper transition matrix element between k| L| and |φ k is given from Eq. (10) and Eq. (11) as Note that because phase estimation on W projects the system to an eigenstate of W and because W and H share an eigenbasis, we are only concerned with the action of this operator for eigenstates. Eq. (12) and Eq. (13) give the first row of the action of W. The remaining entries can be calculated in a similar way, giving the action of W on this irreducible two-dimensional subspace as where Y is the Pauli-Y operator constrained to this two dimensional space spanned by |L |k and |φ k . Finally, we can see that the phases of the eigenvalues of W in this subspace are ± arccos(E k /λ). Whereas the work of [25] focused on transforming the evolution under arccos(H) into evolution under H, the more recent work of [26,27] made the simple observation that by performing phase estimation directly on W one can obtain the spectrum of H as spectrum (H) = λ cos (arg (spectrum (W))) (15) where arg is the argument function arg(e iφ ) = φ.

B. Heisenberg-Limited Phase Estimation of the Qubitized Quantum Walk
Since the original work of [6], many approaches have been proposed for estimating eigenphases of a unitary operator. Whereas in the past iterative phase estimation approaches have been more popular in quantum simulation, here we propose using an entanglement based approach. This has the virtue of requiring a number of applications of the unitary that saturates the Heisenberg limit. The ultimate precision that can be reached when one applies phase estimation by controlling a unitary when an ancilla is in |1 and applying the identity gate when the ancilla is in |0 is a Holevo variance of tan 2 (π/(2 m+1 + 1)) where the total number of applications of the unitary is 2 m+1 − 1 and m is the number of control qubits used. The Holevo variance is cos(φ − φ) 2 − 1 where φ is the phase andφ is the estimate of the phase given by the measurement. It is a convenient measure of variance for phase because it enables simple analytic results and is close to the mean-square error for narrowly peaked distributions. The states for these optimal phase measurements were given in [56]. To apply them to phase estimation of a unitary, one can take the control qubits to be in this superposition state, rather than in a uniform superposition of computational basis states.
We will perform a slight optimization of that approach by applying the inverse unitary instead of the identity for the ancilla in the |0 state. Taking |φ to be an eigenstate of the unitary with eigenvalue e iφ , this means that instead of applying |0 |φ → |0 |φ and |1 |φ → e iφ |1 |φ we apply |0 |φ → e −iφ |0 |φ and |1 |φ → e iφ |1 |φ . This doubles the effective phase difference and turns out to have the same complexity. As shown in Figure 2, we accomplish the controlled inverse by removing controls from W n and inserting controlled reflection operators R L into the circuit which will cause us to apply either (W † ) n or W n depending on the state of the ancilla. We can see why this works by examining the relation which holds for any integer n as a consequence of the self-inverse nature of R L and W. Moreover, because W n always ends with an R L operation, each controlled R L can be combined in the circuit with the W n , to yield a complexity no greater than the complexity of just performing the W n operations. This trick will result in measuring the phase modulo π. To eliminate the π ambiguity, an additional controlled W can be performed without this trick. That is shown as the first controlled operation in Figure 2. For m control qubits the Holevo variance is still tan 2 (π/(2 m+1 + 1)), but the complexity is reduced by approximately a half to 2 m applications of the unitary W.
Heisenberg limited phase estimation circuit for learning the eigenphase of W with m bits of accuracy with Holevo variance π 2 /2 2(m+1) where RL is (2 |L L| ⊗ 1 1 − 1 1) and χm prepares the resource state from Eq. (17), which was shown to be optimal in [56]. Both χm and the inverse quantum Fourier transform (QFT † ) have gate complexity O(m) which is completely negligible compared to the overall gate complexity of phase estimation which scales as O(2 m ). The controlled RL and W = RL · select gates are implemented as shown in Figure 1. As a consequence of Eq. (16) this circuit involves only 2 m − 1 applications of RL and as many applications of select. Note that the first unit of RL · W · RL is replaced by W controlled on the zero state of an ancilla in order to help disambiguate the outcomes of arccos(E k /λ) and arccos(E k /λ) + π.
As seen in Figure 2, our modified phase estimation algorithm begins with a unitary χ m which prepares the state To prepare this state with cost O(m) we first perform Hadamards on m + 1 qubits (initially in the |0 state) to give Next, we perform a series of m controlled rotations with each of the first m qubits as control and qubit m + 1 as target. For control qubit k, the rotation on the target qubit m + 1 is e iπ2 k Z/(2 m +1) . Perform a further rotation of e iπZ/(2 m +1) on qubit m + 1, and the resulting state is Perform a Hadamard on qubit m + 1 and measure in the computational basis. Measuring |1 gives the state The probability of success is given by the normalization: (1 + 2 −m )/2. The scheme can be made unitary and deterministic via a single step of amplitude amplification. Clearly, this preparation scheme scaling as O(m) will not dominate the cost of our overall phase estimation which scales as O(2 m ), as we will discuss in the next section.

C. Error Scaling and Query Complexity
Three sources of error will enter our simulation: error due to performing PEA to finite precision, pea , error due to approximate preparation of the Hamiltonian terms within the implementation of the prepare oracle, prep , and the error in synthesizing the inverse QFT, QFT . We will choose to measure error through the root-mean-square error of the estimator used within phase estimation, i.e.
where the distance considered above is the angular distance between the estimated phase and the actual phase. Provided phase estimation is performed on a unitary operation, the error in the estimate of the energy is at most the error in implementing the unitary [45]. We will break up the estimated phase as the sum of two contributions φ est = φ + prep + φ true . Here φ is a random variable with zero mean E(φ) and Holevo variance V H (φ) describing the output of phase estimation and prep represents the systematic errors in the phase that arise because of gate synthesis. In the limit of small variance, we can express this with high probability over the true phase as where m ancillary qubits are used within the phase estimation algorithm. Note that such a division of the error is suboptimal since the cost involved in reducing the error for phase estimation is exponentially larger than that involved in increasing the accuracy of the circuit synthesis [45]; however, we take the two errors to be equal for simplicity. Equation (15) implies that error in the energy is at most That suggests the number of bits of phase we are to estimate m, can be chosen as (24) and the target errors can be chosen as Thus, using the phase estimation procedure from Section II B we will need at most queries to the select oracle and at most twice as many queries to the prepare oracle in order to estimate spectra to within error ∆E. Supposing that the circuit prepare can be applied at gate complexity P and the circuit select can be applied at gate complexity S, the gate complexity of our simulation (ignoring for now the cost of χ m and the cost of the QFT † since they scale as O(m)) is then approximately bounded from above by √ 2πλ (S + 2P ) ∆E .
This paper will discuss implementations of select and prepare that minimize S and P without increasing λ.
To implement the inverse QFT which appears in Figure 2 we will use the semiclassical algorithm described in [57]. This version of the QFT requires just m − 1 rotation gates and m Hadamards when implemented on m qubits. Thus, the error in each rotation must be at most QFT /(πm), which implies that the inverse QFT will have T complexity scaling as O(m log(m/ QFT )). As this is an additive cost to other parts of our phase estimation algorithm with T complexity scaling as O(2 m ), the cost of performing the QFT within the required error budget can be safely neglected.
How errors in the coefficients of the implemented Hamiltonian propagate into prep is slightly harder to bound owing to the fact that the error in the eigenphase is a nonlinear function of the error in the Hamiltonian implementation. In particular the error can diverge for frustration-free Hamiltonians owing to the singularity of arccos. The main result, shown in Appendix A, is that prepare should be implemented so that if w is the effective coefficient of H in the approximately implemented Hamiltonian then

III. LOW T COMPLEXITY PRIMITIVES FOR LCU ORACLES
In this section we will introduce three circuit primitives which are helpful for implementing select and prepare oracles with low T gate complexity. We use these primitives for electronic structure simulation, but expect them to be useful more generally. These primitives enable black box implementations of select and prepare for any problem with lower asymptotic complexity than prior constructions in the literature. They also have low T counts at finite size. We will use these constructions extensively in Section IV and Section V of this paper.
In Section III A, we will introduce a technique for "streaming" bits of an iterator running over a unary register. One application is that this technique can be used to coherently apply operations controlled on a register with log L qubits in superposition (e.g. the selection register in Eq. (6) and Eq. (7)) using a number of T gates scaling as O(L), as opposed to O(L log L) as one might normally expect. However, what is even more important is the versatile way that these constructions can be applied.
In Section III B we show how one can use the results of Section III A to implement a primitive corresponding to controlled application of a Majorana fermion operator. This primitive is used directly in our implementation of select in Section IV A and Section V A.
In Section III C, we show a straightforward application of the techniques in Section III A which allow us to develop a particularly efficient quantum data lookup which we refer to as "Quantum Read-Only Memory" (QROM). In particular, for coherently querying a database with L words our implementation of QROM has T complexity of 4L − 4 with no dependence on the word length, which is an asymptotic and constant-factor improvement over all prior literature. We have another paper in preparation which discusses QROM in more detail [58].
In Section III D, we discuss a technique for initializing a state with L unique coefficients (provided by a classical database) with a number of T gates scaling as 4L+O(log(1/ )) where is the largest absolute error that one can tolerate in the prepared amplitudes. This routine improves asymptotically over the gate complexity of prior constructions for a black box prepare. It also has the advantage of implementing prepare without increasing the value of λ (from Eq. (6)) which has been a frequent problem with other implementations of prepare [42,43,59] which attempted to obtain scaling sublinear in the number of terms in the linear combinations of unitaries decomposition.

A. Unary Iteration and Indexed Operations
Many of the circuits in this paper rely heavily on a technique we refer to as "unary iteration". The unary iteration process gradually produces, and then uncomputes, qubits indicating whether an index register is storing specific values (with respect to the computational basis). We call the process "unary iteration" because the indicator qubits are made available one by one (iteration), and they correspond to the one-hot (unary) encoding of the index register value. For an index register storing an index in the interval [0, L), the space overhead of converting the index register into a unary register (as in [27]) would normally be L qubits. By comparison, our unary iteration technique is exponentially more efficient in space without any increased T complexity, requiring only log L ancillae. Our unary iteration has a T-count of 4L − 4 and can be parallelized if needed without increasing T-count. Despite its efficiency, unary iteration is still the dominant source of T complexity in our algorithms. We use it for indexed operations, Majorana operators, reversible preparation of states, and database lookup; all of which have T-counts that scale like O(L).
To explain unary iteration, we will first focus on how it is used to implement a controlled indexed not operation: where |c is the control register, | is the selection register, |ψ is the system register, and the subscript on X indicates that the not operation acts on qubit of the system register. From Eq. (7) it should not be surprising that this primitive is helpful for our constructions of select.
FIG. 3. Example total control circuit for performing a controlled indexed X operation, with 0 ≤ < L = 11. This is the (naive) starting point for producing a unary iteration circuit, before optimizations which asymptotically improve the T complexity. When indices outside the specified range do not occur, the highlighted runs of off-type controls reaching the right hand side of the circuit can be removed without affecting the circuit. (There are also other controls which could be removed, but for our purposes this would be counterproductive due to interfering with later optimizations.) A simple (but suboptimal) way to implement Eq. (29) would be to totally control the application of X on all possible values that could occur in the register | , as shown in Figure 3. For instance, in order to apply X 158 when | = |158 , the total-control approach would place a not gate targeting the qubit 158 in the system register |ψ , but with a control on each index bit. The control's type (on or off) would be determined by the binary representation of 158 (158 10 = 10011110 2 ), so there would be a must-be-off control on the low bit of the index register (because the low bit of 158 in binary is 0), a must-be-on control on the next bit (because the next bit of 158 in binary is 1), and so forth. In order to cover every case, a separate not gate with corresponding controls would be generated for every integer from 0 up to L − 1. This would produce L different not operations, each targeting a different qubit in the target register and each having a number of controls equal to the size of the index register (i.e. log L). Thus, it takes O(L log L) T gates to apply Eq. (29) using this approach. Unary iteration will improve this T-count to 4L − 4.
Computing and uncomputing and operations, defined in terms of Toffoli gates and in terms of Clifford+T gates [60].
Computing an and consumes 4 |T states, and is equivalent to applying a Toffoli gate to a target qubit known to be |0 .
Uncomputing an and consumes no |T states, and is equivalent to applying a Toffoli gate to a target qubit guaranteed to end up in the |0 state. Drawing and operations as "corners" instead of as ⊕ symbols is a visual cue that the target qubit will be (or was) off after (before) the operation. This is worth highlighting because it affects the T-count of synthesizing the operation and whether the target is available for reuse as an ancilla in later operations.
Consider that the controls for the operation targeting the qubit at offset = 158 are almost identical to the controls for the operation targeting the qubit at offset = 159. They differ only on the low bit of the index register, where 158 requires the bit to be off whereas 159 requires the bit to be on. If we combined the log L − 1 other qubits of the index register into a single representative qubit that was set if and only if those controls were met, we could use this representative qubit once for the = 158 case and again for the = 159 case. Using the representative qubit twice, instead of computing it twice, decreases the total amount of work done. Unary iteration is the result of taking this kind of representative-reuse idea to its natural limit.
FIG. 5. The "sawtooth" circuit resulting from removing unnecessary bits from Figure 3 and then adding and operations from Figure 4 to combine the controls for performing a controlled indexed X operation with L = 11 possible targets.
We define our unary iteration construction by starting with a total-control circuit, and then applying a fixed set of simple transformations and optimizations. Figure 3 shows an example starting point, a total-control circuit for L = 11. For unary iteration, we require that the index register never store an out-of-range value ≥ L. For example, consider what occurs when the X 10 operation from Figure 3 is not conditioned on 0 (the least significant bit of the index register). This would cause an X 10 to be applied to the target when = 11, but this is fine since we know = 11. We use the < L condition to omit several controls from the circuit. For each possible , we will look at the X operation and remove the control on the b'th index bit when the following two conditions are true: (i) the b'th bit of L − 1 isn't set and (ii) setting the b'th bit of would change into a value larger than L − 1. Visually, this removes 6. When two and operations are adjacent, the uncomputation-and-recomputation can be replaced by cnot and not operations. Each such merger saves 4 T gates.
7. An L = 11 unary iteration circuit which applies X to the qubit in the system register |ψ , where is the value stored in the selection register. The circuit is obtained by merging and operations from Figure 5 using the method shown in Figure 6. Computes 10 and operations, and so has a T-count of 10 × 4 = 40 = 4L − 4.
"runs" of must-be-off controls that manage to reach the right side of the circuit as highlighted in Figure 3. After removing the specified controls, we expand the remaining controls into nested and operations (the and operation is defined in Figure 4), always nesting so that lower controls are inside higher controls. For clarity, we consistently place the ancillae associated with an and operation just below its lowest input qubit. The result is the "sawtooth" circuit shown in Figure 5. By iteratively optimizing adjacent and operations as shown in Figure 6, the sawtooth circuit from Figure 5 is optimized into the circuit shown in Figure 7. This is our unary iteration circuit for L = 11. The optimized circuit always ends up with L − 1 and computations (even when L is not a power of 2), each and takes 4 T gates to compute, and we have no other T-consuming operations in the circuit. Thus, the T-count of this construction is 4L − 4.

B. Selective Application of Majorana Fermion Operators
Now that we have described unary iteration, we can begin using it to construct primitives relevant for the select oracle. As discussed in detail in Section IV and Section V below, our approach for implementing select will require that we have a circuit capable of selectively applying the Majorana fermion operator where the last equality holds under the Jordan-Wigner transformation [61]. In this section we describe explicit circuits which accomplish the mapping of Eq. (30).
In Section III A we discussed selectively applying X operations as a representative example of how one might use unary iteration. However, nothing intrinsic to the unary iteration construction requires that the indexed operation be so simple. For example, we could switch from applying X to applying Z halfway through the circuit. Or each X could be replaced by multiple Pauli operations targeting multiple qubits. In general, each index could be associated with its own unique set of Pauli operators to be applied to various target qubits.
We can also apply transformations to our quantum unary iterators (analogous to transformations of classical iterators). Iterators can be mapped, filtered, zipped, aggregated, batched, flattened, grouped, etc. For instance, given a classical stream of bits, one can aggregate over it with the ⊕ operation. This produces a new iterator, which iterates over bits equal to the parity of the values-so-far from the original iterator. It is possible to apply this xor-aggregation idea to the quantum unary iteration process. Introduce an "accumulator" qubit and, as each iterated unary qubit is produced, cnot it into the accumulator. In effect, if the index register is storing then the accumulator will stay off Applies the G operation to a range of values, instead of to a single value, by using an accumulator. The accumulator is guaranteed to be cleared after the final CNOT targeting it (drawn as a line merging into an ancilla qubit). This occurs because (unless control is not set and the accumulator simply stays unset) exactly one of the unary bits must have been set, and we targeted the accumulator with CNOTs controlled by each of those bits in turn. G [p,q) refers to G being applied to every qubit index k satisfying p ≤ k < q.
until the 'th qubit toggles it on. The accumulator will then stay on until the end of the iteration process, where it is uncomputed by a cnot from the control qubit. By conditioning X on the accumulator qubit, instead of on the original unary qubits, efficient ranged operation such as | |ψ → | G · G +1 · · · G L−1 |ψ are produced. We show an example of an accumulator-based ranged operation in Figure 8.
By using both the accumulator qubit and the original unary qubits, we can apply a ranged indexed operation and an indexed operation in a single unary iteration which gradually sweeps over the possible target qubits. The resulting combined operation, shown in Figure 9, is a crucial part of our select circuit, effecting the transformation of Eq. (30).
Application of a selected Majorana fermion operator, | |ψ → | Y · Z −1 · · · Z0 |ψ as described in Eq. (30). This is accomplished by performing a ranged operation (as shown in Figure 8) and an indexed operation (similar to what is shown in Figure 7) with a single pass through the selection register | . It has a T-count of 4L − 4, where L is the number of integer values that can be held by the selection register | .

C. Quantum Read-Only Memory (QROM) for Low T Complexity Data Lookup
In this section we explain how one can use the techniques of Section III A in order to implement a particular efficient form of what we call Quantum Read-Only Memory (QROM), which is useful in the context of the subprepare routine, a subroutine of the prepare circuit described in Section IV B (in Figure 16). Many quantum algorithms assume the existence of a hypothetical peripheral called "Quantum Random-Access Memory" (QRAM) [62] which allows classical or quantum data to be accessed via an index under superposition. The purpose of QROM is to read classical data indexed by a quantum register, i.e. to perform the following transformation: where is the index to read, α is the amplitude of | , and d is the word associated with index in a classical list d containing L words. Our implementation of QROM is shown in Figure 10. Note that our notion of QROM is unrelated to the discussion of ROM on a quantum computer in Ref. [63].
The read-only aspect of QROM makes it distinctly different from QRAM in that one can read from QROM but cannot write to it during the course of a computation. A few algorithms, such as the procedure introduced in [64] actually do require that one write to QRAM; thus, for such use cases QROM would not be appropriate. A notable difference between this paper and most previous work on QRAM [62,[65][66][67] is that we describe the cost of QROM in terms of a fault-tolerant cost model: the number of T gates performed and the number of ancilla qubits required. Under such cost models, the "bucket brigade" QRAM design of Giovannetti et al. [62,65] has T complexity (and thus, also time complexity under reasonable error-correction models) of O(L) regardless of the fact that it has depth O(log L) because implementing it as an error-corrected circuit consumes O(L) T gates and O(L) ancillae qubits. Our implementation of QROM consumes only 4L T gates and log L ancillae, which is a constant-factor improvement in T-count and an exponential improvement in space usage over the construction of Giovannetti et al.
FIG. 10. Finite-sized example of the "QROM" database loading scheme used in our implementation of subprepare. If the index register contains , the output register ends up storing d where d is some precomputed data vector used when constructing the circuit. The top part of the circuit is performing unary iteration as described in Section III A. The bottom part of the circuit is loading classical data associated with each possible index. The classical data is encoded into the presence or absence of CNOTs on the data lines. The "?" marks in the diagram indicate that one should decide whether or not to have a cnot gate on each line, depending on the value of the data to load. This circuit has T-count of 4L − 4, which is due entirely to the unary iteration. The T gate cost of this circuit is independent of the number of bits used to store each element of the database.

D. Subsampling the Coefficient Oracle
In this section we introduce a technique for initializing a state with L unique coefficients (provided by a classical database) with a number of T gates scaling as 4L + O(log(1/ )) where is the largest absolute error that one can tolerate in the prepared amplitudes. This result constitutes a general procedure for implementing prepare such that the cost of circuit synthesis is additive, rather than multiplicative (as in most prior schemes). In particular, it improves on the database scheme from [42] (based on the procedure of [68]) which requires a number of T gates scaling as O(L log(L/ )). Importantly, our scheme does not increase the value of L or λ, which would usually be the case for most "on-the-fly" strategies for implementing prepare [23,42,59].
Generalizing the requirements of Eq. (6), we begin with the observation that it would be acceptable to have a prepare circuit which initializes the state where |temp is an unspecified junk register entangled with | . Equivalently, any pure state |L would suffice if Because select only uses | to control operations, phase error (including entanglement with the junk register) in the state produced by prepare will commute across select and be corrected by prepare † . However, select itself necessarily introduces entanglement between its target register |ψ and the index register | (plus associated junk register). So prepare † will not exactly restore | or the junk register. Although we only specify the action of prepare on the |0 state, prepare will be applied to other states due to this imperfect uncomputation effect. This is accounted for by (i) requiring qubits coming out of prepare † to be kept and fed back into the next prepare operation and (ii) the reflection step between prepare † and prepare only affecting the |0 state. Given the observation that the existence of an entangled junk register is acceptable, we will seek to implement a circuit which effects the transformation where ρ ≡ w /λ are probabilities characterizing the approximate Hamiltonian we are encoding. Whereas the exact Hamiltonian would be associated with probabilities ρ ≡ w /λ, the value ρ is a µ-bit binary approximation to ρ such that where the expression for δ comes from Eq. (A10) of Appendix A, and bounds the largest acceptable deviation in the coefficients of the terms in a Hamiltonian approximating the one we mean to implement. The second log in Eq. (36) is O(1), because we do not take ∆E larger than λ. The Hamiltonians we will be considering will be frustrated, so H /λ is no larger than a constant (less than 1), and the third log in Eq. (36) is O(1) as well. w /λ| |temp where "temp" is temporary garbage data that will be uncomputed when uncomputing the preparation of this state after application of select. The data loading parts of this circuit use the QROM implementation described in Figure 10 in Section III C. The uniformL circuit is used to initialize the initial superposition over .
The idea behind our scheme will be to create the superposition in an indirect fashion which involves starting in a uniform superposition over an initial index then using a precomputed binary representation of a probability (loaded from QROM), keep , to decide whether we should keep or swap it with a classically precomputed alternate index alt which is also loaded from QROM (see Figure 11). Specifically, our procedure creates a uniform superposition in | over L values and then uses QROM (see Figure 10 and Section III C) to load |alt and |keep . Note that if L is not a binary power one can prepare the initial superposition using the amplitude amplification circuit discussed later in Figure 12. The procedure described thus far prepares the state: We will then construct a circuit which coherently swaps the registers | and |alt with probability keep to create the state in Eq. (34). In order to create the state in Eq. (34) from Eq. (37) we will need to introduce one additional register of size µ, which we refer to as |σ . We will put this entire register into a uniform superposition, and then compare it to the probability represented by keep . If keep ≤ σ, we will swap registers | and |alt . Thus, after the procedure is finished, i.e. in Eq. (34), the garbage register will be in the state where the rightmost qubit is the result of a comparison between keep and σ. For Eq. (34) to give the correct state, we need |temp to be normalized, which means that we require 12. A circuit that uses amplitude amplification [69,70] to conditionally and reversibly prepare the uniform superposition | , where L is odd, starting from the |0 state. The circuit spans k + 2 log L + O(1) qubits and has a T-count of 2k + 10 log L + O(log(1/ )). If the control is omitted, the T-count drops to 8 log L + O(log(1/ )). If L = 1, the Rz rotations are not needed and the T count drops to 2k. If L = 1 and the control is omitted, the T count is zero.
We can find the values of keep and alt from Eq. (39) in a sequential way. At any step, let L denote the set of for which we have already found these values. Then we can rewrite Eq. (39) as This expression involves only known quantities on the right-hand side, which we callρ for short. We will show by induction that the average ofρ for / ∈ L is 1/L, and theρ are non-negative. These are clearly true initially, because thenρ =ρ . Now assume that these conditions are true at some step. If the valuesρ are all equal for / ∈ L, then we can just take alt = and any value of keep , and we will satisfy Eq. (40) for all remaining . Otherwise, there will be one value, 0 , whereρ 0 is below the average 1/L and another, 1 , whereρ 1 is above 1/L. For 0 , we choose keep 0 = 2 µ Lρ 0 and alt 0 = 1 . Then 0 is added to the set L and the values ofρ are updated. According to Eq. (40), the only value ofρ that is updated is that for = 1 , where we replace it withρ 1 +ρ 0 − 1/L. That ensures the average value ofρ for / ∈ L is still 1/L, and since we hadρ 1 > 1/L, the new value is non-negative. A more intuitive way to understand our approach to preparation is that it is equivalent to generating with the probability distributionρ by the following classical procedure: 1. Select uniformly at random from [0, L).

Look up alt and keep .
3. Return with probability keep /2 µ , otherwise return alt .
The procedure for determining the alt and keep is then to work backwards starting from the distributionρ , and update this distribution by shifting probabilities from 1 to 0 until we obtain a uniform distribution.
This procedure is illustrated in Figure 13. One starts with a histogram of the desired distribution and looks for a bar that is too small, fixes this by transferring probability from a bar that is too high, and so on until all bars have the correct height. Each probability transfer permanently solves the bar that was too low, and the remaining bars form a smaller instance of the same problem. Thus, it is not possible to get stuck in a loop or a dead-end. See also the module utils/_lcu_util.py in version 0.6 of OpenFermion (www.openfermion.org/) [71] for open source python code that performs this iterative matching process (and also handles discretizing the distribution) in O(L) time.

IV. CONSTRUCTIONS FOR THE ELECTRONIC STRUCTURE HAMILTONIAN
Using the appropriate discretization into a basis of N spin-orbitals, the electronic structure Hamiltonian can be written as where a † p,σ and a p,σ are fermionic creation and annihilation operators on spatial orbital p ∈ {0, · · · , N/2 − 1} with spin σ ∈ {↑, ↓}, and n p,σ = a † p,σ a p,σ is the number operator. These operators satisfy the canonical fermionic anticommutation relations {a † p,α , a † q,β } = {a p,α , a q,β } = 0 and {a † p,α , a q,β } = δ p,q δ α,β . Mapping to qubits under the Jordan-Wigner transformation [61,72], Eq. (41) becomes where we have introduced the notation − → Z which will be used throughout the paper, which we now explain. The tensor factors on which Pauli operators act can always be interpreted as some integer. For instance, (p, σ) is mappable to an integer under a particular choice of canonical ordering in the Jordan-Wigner transformation. When placed between two Pauli operators the notation A j − → Z A k denotes the operator A j Z j+1 · · · Z k−1 A k . The exact mapping between a spin-orbital indexed by (p, σ) and a qubit indexed by an integer is discussed later on.
The forms of Eq. (41) and Eq. (42) encompass a wide range of fermionic Hamiltonians including the molecular electronic structure (aka "quantum chemistry") Hamiltonian in any basis that diagonalizes the Coulomb potential [43]. The particular coefficients will depend on the discretization scheme and basis functions chosen to represent the system. One such representation, derived for use in quantum simulations in [43] prescribes the coefficients where each spatial orbital p is associated with an orbital centroid r p = p (2Ω/N ) 1/3 , and Ω is the computational cell volume. The momentum modes are defined as k ν = 2πν/Ω 1/3 with ν ∈ [−(N/2) 1/3 , (N/2) 1/3 ] ⊗3 . When dealing with molecular potentials, R j and ζ j are the position and charge of the j th nucleus, respectively. As discussed in [43], the Hamiltonian of Eq. (43) corresponds to discretization in a basis composed of rotated plane waves known as the "plane wave dual" basis. The basis set discretization error associated with the dual basis is asymptotically equivalent to a Galerkin discretization using any other single-particle basis functions, including Gaussian orbitals [43]. Thus, Eq. (43) is a general expression of the electronic structure problem that is asymptotically equivalent to any other representation. While well suited for simulating periodic materials, despite asymptotic equivalence, this basis set is not particularly compact for the simulation of molecules. Another basis set compatible with Eq. (41) and Eq. (42) while being much more appropriate for molecules is the so-called "Gausslet" basis [48]. Gausslets are derived from a ternary wavelet transformation [73] of Gaussian orbitals and have similar intrinsic basis set discretization error to standard Gaussian orbitals [48].
The simulation procedures here will make use of the structure in Eq. (42). Specifically, our algorithm will make use of the fact that the Hamiltonian consists of only four types of terms: Z p , Z p Z q , X p − → Z X q and Y p − → Z Y q and that there are only 3N/2 unique values of the coefficients. Our algorithms do not utilize any particular structure in the dual basis Hamiltonian in Eq. (43) beyond the fact that it satisfies the form of Eq. (41). This is important since it implies that the techniques of this paper are compatible with other representations of the electronic structure Hamiltonian, such as the finite difference discretization [43], finite element methods, and Gausslet basis sets [48] which produce Hamiltonians consistent with Eq. (41) but not Eq. (43).

A. Electronic Structure Hamiltonian Selection Oracle
In order to implement the select and prepare oracles for the electronic structure Hamiltonian of Eq. (41), one must first define a scheme for indexing all of the terms. For the case of the general electronic structure Hamiltonian in Eq. (41) we will index terms with the registers |θ , |U , |V , |p , |α , |q and |β . The |p and |q registers are little-endian binary encodings of integers going from 0 to N/2 − 1, thus using log N − 1 qubits each; the other registers are each a single bit which we use to specify the unitary that select should apply to the system register |ψ .
The |α and |β bits are used to specify the spins {↑, ↓}, which together with the spatial orbital specifications p and q, index a spin-orbital. Thus, a register set as |p |α |q |β will be indexing a Hamiltonian term that involves action on the spin-orbitals indexed by (p, α) and (q, β). Next, whenever |U = |1 , it will be the case (by construction of our circuits) that (p, α) = (q, β) and we will apply the Z p,α terms. If |V = |1 , we will apply the Z p,α Z q,β terms. If |U |V = |0 |0 and p < q, it will also be the case that α = β and we will apply the X p,α − → Z X q,α terms; if |U |V = |0 |0 and p > q, it will again be the case that α = β and we apply the Y q,α − → Z Y p,α terms. Finally, the |θ register encodes whether the unitary should have a negative phase (if |θ = |1 ). Thus, our select circuit will meet the following specification (where undefined means this case should not occur): select chem |θ, U, V, p, α, q, β |ψ = (−1) θ |θ, U, V, p, α, q, β ⊗ We present our implementation of select chem in Figure 14. Our circuit relies on subroutines we describe in Section III A which provide a method for selectively applying strings of Pauli operators to a system register of size N , with controls on log N qubits. Important notation for these subroutines is also defined in Section III A, and thus that section is necessary for understanding the details of Figure 14.
Since p and q are actually three-dimensional vectors with elements taking integer values p ∈ [0, (N/2) 1/3 − 1] and σ ∈ {↑, ↓} we should clarify how the spin orbitals (p, σ) are mapped to an integer representing qubits. For ease of exposition, we will define the following mapping function for a D dimensional system, where for chemistry D = 3, for the Hubbard model D = 2. The δ function behaves as one might expect; δ ↑,↓ = 0 and δ ↓,↓ = 1. Thus it should be understood that X p,σ implies the X operator acting on qubit f (p, σ).
FIG. 14. selectchem circuit, with a T-count of 12N + 8 log N + O(1), which implements the functionality specified by Eq. (44), conditioned on a "control" qubit. The unitaries performing Majorana and indexed operations (each requiring 4N T gates) are explicitly constructed in Section III A. As described in Figure 9, the unitaries labeled as − → Z Aj apply the operation Z0 · · · Zj−1Aj to the target register, depending on the value from the input |p register. These operations will require an extra log N ancillae, so the overall circuit spans N + 3 log N + O(1) qubits. The operation which targets the system register with Zq is a variant of Figure 7 with the X gate replaced by Z . All indexed operations reuse the same ancillae.

B. Electronic Structure Coefficient Preparation Oracle
We see from Eq.
where the values of the coefficients and the state of |θ , related to the coefficients in Eq. (42), are defined as The T (p) coefficient inside the square root in Eq. (47) differs from the coefficient in Eq. (42) by a factor of two since it occurs for each type of term only once depending on whether p < q or p > q.
To implement prepare we first synthesize a unitary referred to as subprepare which acts as follows, Since in this step we initialize a state on O(log N ) qubits, the techniques of [68] would allow one to implement subprepare with a T-count of O(N log(1/ )). However, in Figure 15 we show an even more efficient method for synthesizing subprepare with T gate complexity O(N +log(1/ )), based on the techniques introduced in Section III D.
Using subprepare, we can implement the entire prepare circuit with the same asymptotic T complexity. In our subprepare circuit, is really a vector of integers; thus, we use "modular vector indices" such that if v is a 3dimensional vector within a rectangular space with each dimension having M values then the function application   Although we only specify the behavior of the circuit when the U , V , and p qubits start in the |0 state, the circuit will also be invoked in contexts where this is not the case.
While applying subprepare to create the state in Eq. (48) we also initialize the |α qubit in the |+ state with a Hadamard. We then use the uniform L circuit from Figure 12 to initialize the |q register in an equal superposition in a way that is controlled on the |U ancilla qubit being in the state |0 U . Subsequent to this step the state becomes The register labeled as |d in Eq. (48) will ultimately become our |p register, but immediately after subprepare it is more appropriate to think of it as encoding a value |p − q . As we can see in Eq. (46), when |V = |1 and p = q, it is necessarily the case that α = β. The middle part of our prepare circuit is dedicated to correctly initializing this tricky part of the superposition. To do it, we use an ancilla to apply a Hadamard gate to |0 β only when |V = |1 and |d = |0 ⊗(log N −1) . In the event that |V = |1 and |d = |0 ⊗(log N −1) we apply a CNOT gate with an open control on |α which targets |0 β , thus ensuring that |β = |α when p − q = 0. Then, we set |β = |α for the U and T part of the superposition by applying a Toffoli gate with regular control on |α and open control on |V , targeting |β . After these operations the state can be expressed as The final step consists of converting the |d register to values representing |p . To do this, we must add the |q register into the |d register when |U = |0 so that |d + q = |p − q + q = |p . However, we also want to copy the |d register into the |q register when |U = |1 ; thus, prior to this operation we also implement a Fredkin gate which swaps |d and |q , conditioned on |U = |1 . After the Fredkin gate and the addition of |d into |q , Then, simply by relabeling d = p − q whenever |U = |0 and d = p whenever |U = 1 we see that our state is identical to the desired one (from Eq. (46)). We show how to use subprepare to implement prepare chem in Figure 16. The gate complexity of subprepare is O(N + log(1/ )) and the gate complexity of all other components of this circuit are O(log N ). Thus, the overall gate complexity of prepare is O(N + log 1/ ). 16. prepare circuit for the electronic structure Hamiltonian. Implements the unitary in Eq. (46) with a T count of 6N + O(µ + log N ) where µ is defined in Eq. (36). The subprepare subroutine is the dominant cost, and is implemented as in Figure 15. When U is set by subprepare, the Fredkin gates and the addition copy p's value over q. When V is set, uniform superpositions over q and β are prepared except, if p − q = 0, then β is instead set to be opposite to α in order to guarantee (p, α) = (q, β). Beware that this conditional preparation of β doubles the weight of p = q cases (relative to p = q cases) and must be accounted for in the LCU coefficients given to subprepare. When neither U nor V is set, α is copied into β and a uniform superposition over q is prepared. The remaining operations contribute a negligible O(µ + log N ) T gates. The uniform ⊗D M operation prepares D registers in a uniform superposition of basis states going up to M , as in Figure 12. Controls on multi-qubit lines are conditioned on every qubit within the line, e.g. what appears to be a Toffoli gate is actually a not gate with 1 + D log M controls. The action of our preparechem circuit is only specified in the case where the inputs are all |0 . During actual execution this is not the case, because the effects of the select operation will prevent prepare † chem from exactly uncomputing the U , V , p, q, α, and β qubits. This is expected behavior and it is accounted for by requiring that the potentially-not-uncomputed qubits be kept and used as inputs for the next preparechem circuit.

C. Resources Required for Electronic Structure Simulation
The parameter λ from Eq. (6) has significant implications for the complexity of our algorithm; as seen in Eq. (27), our circuit size will scale linearly in λ. For the case of general electronic structure, we can see from Eq. (41) that λ is This expression and the extremely naive assumption that all coefficients are O(1) would imply that λ ∈ O(N 2 ). For the case of quantum chemistry in the dual basis, i.e. Eq. (43), the work of [43] obtains the same bound: where the last relation holds when studying electronic structure systems that grow with fixed density N ∝ Ω, which is the usual situation. For encoding the electronic structure Hamiltonian we also determined that P = 6N +O(log(N/ )) and S = 12N +O(log N ) in terms of T complexity. Thus, from Eq. (27) we can conclude that the overall T complexity of our procedure is roughly which for the electronic structure Hamiltonian is rigorously bounded by O(N 3 /∆E). Ancilla required for our electronic structure simulation come from three sources: qubits required for our entanglementbased phase estimation (given by Eq. (24)), qubits required to store coefficient values in QROM (given by Eq. (36)), and ancilla actually required for our implementation of prepare and select, which for the electronic structure Hamiltonian simulation is 5 log N + O(1). Putting these sources together, the total ancillae required are where the additive constant is small and can be usually neglected for problem sizes of interest. This expression gives the ancilla count in Theorem 1. We now estimate resources required for specific problem instances. In practice, we find that λ scales better than O(N 2 ), but exactly how much better is system dependent. For a particular material the value of λ can be influenced by a number of factors. These factors include the particulars of the bases used, the geometry and atomic composition of the material, and whether one scales toward continuum or thermodynamic limits.
Perhaps the simplest chemistry system which is classically intractable is a molecule without nuclei: the uniform electron gas, also known as jellium. Jellium is a system of η electrons with real kinetic energy and Coulomb interactions confined to a box of finite volume Ω with periodic boundary conditions. Plane waves are a near-ideal basis for the simulation of jellium; the system is naturally expressed using the discretization of Eq. (43) with a constant external potential, i.e. ζ j = 0. Jellium is an interesting system to simulate on early quantum computers due to its simplicity, classical intractability [43], historical significance tied to breakthroughs in density functional theory [74] as well as the fractional quantum Hall effect [75], and tradition as a benchmark for classical electronic structure calculations.
The phase diagram of jellium is typically parameterized in terms of the Wigner-Seitz radius which characterizes the electron density in three dimensions as r s = (3Ω/(4πη)) 1/3 where η is number of electrons. Although the ground state of jellium at high densities (metallic, r s ∼ 1 Bohr radii per particle) and at very low densities (insulating, r s ∼ 100 Bohr radii per particle) is well known, the phase diagram in the intermediate density regime is less certain [76][77][78][79][80][81]. Whereas perturbation theory performs well in the high-density regime [82,83], quantum Monte Carlo has been the    (55) and the number of T gates is computed using Eq. (54). These estimates assume (rather conservatively in comparison to classical limitations) that we should target additive chemical accuracy of ∆E = 0.0016 Hartree. These problem sizes are large enough that contemporary classical methods cannot reliably provide unbiased estimates with low enough systematic error to resolve competing phases within the fixed-size basis.
most competitive simulation tool in the low-to intermediate-density regimes [84][85][86][87]. For systems with more than fifty electrons, quantum Monte Carlo simulations of jellium typically introduce a bias to control the sign problem, such as the fixed node approximation, full configuration quantum Monte Carlo with initiators, or auxiliary field quantum Monte Carlo with a constrained phase bias. The systematic error from these biases is thought to be as large as half a percent in the energy [77,84], which is on a similar scale to the energy difference between competing phases in the intermediate density regime. Even for modest system sizes such as fifty electrons and twice as many spin-orbitals, quantum simulations can offer bias-free results that cannot be obtained by quantum Monte Carlo. We include numerics in Figure 17 which empirically estimate a tighter bound on λ for jellium in the classically challenging regime corresponding to r s = 10 Bohr radii at half-filling N = 2η . Those numerics, shown in Figure 17, indicate an empirical scaling of λ = O(∼ N 5/3 ). If we target chemical accuracy of ∆E = 0.0016 Hartree then from Eq. (54) we see that roughly twenty-million T gates would be required for jellium with 54 orbitals, two-hundred million T gates would be required for jellium with 128 orbitals, and about a billion T gates would be required for jellium with 250 orbitals. While these numbers are promising, for small sizes these simulations require a number of ancilla comparable to N . T counts and ancilla resources are tabulated for several jellium problem instances in Table III.
The dual basis of Eq. (43) is also a natural choice for periodic condensed phase systems (e.g. solids) besides jellium. Considering only this basis, there are two parameters that determine the accuracy of the simulation with respect to the true material. First, is the number of plane waves used to discretize the cell, which determines the spacing of the quasi-points in the dual basis [43]. That is, more plane waves equate to a finer grid and more accurate discretization. The second parameter is the size of the supercell, which determines the error one incurs by representing an infinite system with a finite, periodic one, also known as the finite-size error. There are different ways of reducing the finitesize error for a physical system. One common method used in density functional theory utilizes Bloch's theorem to divide the sampling problem into so-called "k-points" within the first Brillouin zone [88]. The smoothness of the energy with respect to the k-points and additional symmetry provided can offer advantages in certain approaches at the cost of increased complexity often resulting in a complex Hamiltonian representation at non-zero k-points. The origin k-point, also called the gamma point, maintains a real Hamiltonian for the appropriate basis functions. An alternative to k-point sampling is increasing the size of the supercell, which increases the relevance of the gamma point. For simplicity, here we only consider the gamma point, so the natural parameter to change is repetitions of the unit cell that fix the size of the supercell being simulated. A larger supercell tends to incur less finite-size error as the system is scaled to the thermodynamic limit.
It is clear that these two parameters are not entirely independent with respect to accuracy of representation of the true system. For example, a much larger supercell with the same number of grid points clearly offers a more coarse and less accurate representation of the true system. Moreover, in classical methods it is common practice to extrapolate along both parameters to increase accuracy for a given computational cost [88]. We do not introduce such complexities here, but rather show empirically how the choice of these parameters influences the parameter λ which determines the cost of our algorithms, leaving optimizations such as extrapolation schemes to future work. Figure 18 shows the value of λ as a function of the number of qubits being used to discretize the material cell at a fixed supercell size for several real materials. The number of qubits here is equal to twice the number of plane waves since spin is being treated explicitly. Equal numbers of plane waves along each of the reciprocal lattice vectors of the supercell are used, as opposed to the more common spherical energy cutoff schemes [89], as this enables the use of the plane wave dual basis [43]. The exponent (slope on the log-log plot) of a least-squares linear regression to the data is listed alongside the material, and we see empirically that the value of λ scales just under λ = O(N 2 ) in the number of basis functions while keeping the size of the supercell fixed, which matches the analytical bound rather closely. From the form of the Hamiltonian in Eq. (43), one can see that at a fixed number of plane waves, increasing the volume of the supercell Ω tends to decrease λ such that λ = O(∼ Ω −1/2 ), due to representing lower frequency modes with respect to the plane wave representation of the kinetic energy, despite increasing total nuclei present. We show this effect empirically in the center of Figure 18.
The first two panels of Figure 18 leave open the question of the impact on λ of increasing the supercell size while maintaining a constant density of dual quasi-points or constant density of plane waves, as we expect impact of the last two aspects to compete in some way. Empirically this is shown in the right portion of Figure 18, which plots the values of λ for a fixed density of points in increasing supercell sizes. Note that this scaling is most comparable to past studies on single molecules since molecular volume tends to grow as one adds electrons. We observe that the scaling in this case is more favorable as a function of the number of qubits than simply refining the grid alone, and in all cases is better than λ = O(∼ N 3/2 ). From Eq. (54), this numerical data would suggest that the T complexity of our overall algorithm is empirically bounded by O(∼ N 5/2 /∆E) when the goal is simulation of real materials. To treat molecules properly one should further consider pseudopotentials [90], methods of extrapolation to continuum and thermodynamic limits, and embedding methods [91,92]. We will leave such a thorough comparison of fault-tolerant resources required for specific instances of real materials other than jellium to future work. However, by comparing Figure 17 and the right panel of Figure 18 it is apparent that jellium is a reasonable proxy for other materials in that λ values are comparable. As the rest of the simulation circuit is identical up to the particular angles of certain single-qubit rotations, one can estimate the cost of simulating these materials from our analysis of the fault-tolerant overheads required to simulate jellium in Section VI.

V. CONSTRUCTIONS FOR THE HUBBARD MODEL
In this section we described specialized implementations of the select and prepare oracles for simulation of the planar repulsive-interaction Fermi-Hubbard model and then estimate the overall T complexity of simulating such models. The Hubbard model is a canonical model of a many-electron system often used to model superconductivity in cuprate superconductors. Despite its simplicity, the Hubbard model exhibits a wide range of correlated electron behavior including superconductivity, magnetism, and interaction-driven metal-insulator transitions [93].
The Hubbard model is essentially a special case of Eq. (41) when the model is restricted to a planar grid. The Hamiltonian can be expressed as where the notation p, q implies that terms exist only between sites that are adjacent on a planar lattice with periodic boundary conditions. This Hamiltonian can be expressed under the Jordan-Wigner transformation as We will focus on the Hubbard model with periodic boundary conditions (which is a more typical system to study than the Hubbard model with open boundary conditions).

A. Hubbard Model Hamiltonian Selection Oracle
We see from Eq. (56) that there are only three unique coefficients in the Hubbard Hamiltonian: the coefficient of −XZX and −Y ZY terms is t/2, the coefficient of ZZ terms is u/8 and the coefficient of local −Z terms is u/4. This makes implementation of the prepare circuit exceptionally simple. Ultimately, we will show that the prepare circuit for the Hubbard model can be implemented at cost O(log(1/ )). This scaling virtually guarantees that for all problem sizes of interest the scaling of the overall algorithms will be dominated by the cost of the select circuit.
We will index terms in the Hubbard Hamiltonian using the registers |U |V |p x |p y |α |q x |q y |β . Note that it is important for us to explicitly separate p x and p y in our construction of the Hubbard model circuits since this structure is fundamental to the efficiency of our scheme. Our indexing scheme will be nearly identical to the scheme used for the arbitrary chemistry Hamiltonian in Eq. (7) but here we will not need the θ parameter since we know the sign of the parameters in advance. Thus, our select circuit for Hubbard will meet the following specification: where for ease of exposition p = p x + p y M and q = q x + q y M , consistent with the convention of Eq. (45) for D = 2. By exploiting translational invariance in the Hubbard model we are able to implement select hub in a slightly more efficient fashion, achieving a T count of only 10N +O(log N ). We show this more efficient implementation in Figure 19.

B. Hubbard Model Coefficient Preparation Oracle
Our prepare circuit for the Hubbard model will have the following specification where the first line above corresponds to −Z and ZZ terms, the second line corresponds to −X − → Z X terms and the final line corresponds to the −Y − → Z Y terms. Note that we are looking at a Hubbard model with periodic boundary conditions so wherever something like |p x + 1 appears we really mean |(p x + 1) mod M , which we omitted from the above equation for clarity. Our implementation of prepare begins for the Hubbard model by initializing a two-qubit state containing the three distinct coefficients for the U , V , and T terms. This is done with standard circuit synthesis techniques with a T-count of O(log(1/ )) [68]. We then spread these coefficients over the various cases. We depict our implementation in Figure 20. If control is off, the circuit has no effect. Otherwise, if |U |V = |1 |0 , it will be the case that (p, α) = (q, β), and our circuit applies −Zp,α. If |U |V = |0 |1 , we again have that p = q but this time we also have that α = 0 and β = 1 so the circuit applies Zp,0Zq,1. If |U |V = |0 |0 and p < q the circuit performs −Xp,α − → Z Xq,α but if p > q the circuit performs −Yp,α − → Z Yq,α. The larger gates in this circuit are Majorana operators described in Figure 9 and an indexed operation explained in Figure 7 (except the X gate is replaced by a Z gate). The Majorana operators each have a T count of 4N , but the indexed operation has no dependence on α and so has a T-count of 2N . All other circuit components have T counts in O(log N ).  N/ )). The Ry operations are used to prepare the three distinct LCU coefficients which, including multiplicity, are tN/λ (for the 2N T -type terms), N u/(4λ) (for the N = 2M 2 U -type terms), and N u/(8λ) (for the N/2 = M 2 V -type terms). We only specify the action of this circuit in the case where the inputs are all |0 . During actual execution, the effects of the select operation will prevent prepare † from exactly uncomputing the U , V , px, py, and α qubits as well as the bottom two ancilla qubits. This is expected behavior and it is required that the potentially-not-uncomputed qubits be kept and used as inputs for the next preparehub circuit.

C. Hubbard Model Resources
For the case of the planar Hubbard model in Eq. (56) it is readily apparent that assuming that we are dealing with the spinful model with periodic boundary conditions. We also determined that P ∈ O(log(N/ )) and that S = 10N + O(log N ). Thus, the total T cost of the Hubbard algorithm is Ancillae required for our Hubbard model simulation come from two sources: qubits required for our entanglementbased phase estimation (see Eq. (24)) and ancillae actually required for our implementation of prepare and select, which for the Hubbard model is 3 log N + O(1). Putting these sources together, the total ancillae required are where the additive constant is small and can usually be neglected for problem sizes of interest. This expression gives the ancillae count in Theorem 2. While numerically exact solutions to the Hubbard model are available for one-dimensional [94] and infinitedimensional systems [95], no known polynomial time scaling classical methods can provide reliable solutions to the planar model in all parts of its phase diagram [93]. For state-of-the-art approximate methods the most challenging low temperature phase of the model appears to be the intermediate interaction regime due to the presence of many competing phases, around u/t = 4 [93]. Accordingly, we will focus our analysis on this regime. If u = 4t then λ = 4N t. An interesting and classically challenging-to-obtain accuracy (beyond the agreement of state-of-the-art numerical methods [93]) for this regime would be in the vicinity of ∆E ≈ t/100 [96]; these choices would suggest a T complexity of approximately and an ancilla count of approximately We summarize these resources for various interesting sizes of Hubbard model simulation in Table IV.   (56). The dimension of the system indicates how many sites (spatial orbitals) are in each side of the square model. The number of system qubits is thus twice the number of spatial orbitals. The number of logical ancillae is computed as Eq. (64). Finally, the number of T gates is computed using Eq. (63) which assumes that u/t = 4 and ∆E = t/100. The first three problem sizes in the table are near the classically intractable regime.

D. Exploiting Locality in Simulations of Lattice Hamiltonians
Looking forward, another way that our circuits can be applied is to accelerate the recent Lieb-Robinson simulation method of [52]. Lieb-Robinson bounds reveal an intriguing fact about local Hamiltonians: interactions spread out in a light-cone similar in form to the causal diamonds used in relativity to indicate the regions of space-time that can have an impact on an event at a point in spacetime [97]. More specifically, Lieb-Robinson bounds show that information propagates at finite speeds (up to exponentially small error) in systems with nearest-neighbor interactions. The idea behind [52] is to exploit this structure to break up the evolution into sub-pieces that can be independently simulated, thus reducing the cost of simulation.
We formalize this by envisioning that we have a lattice of N sites, Λ, and a Hamiltonian that consists of terms that act upon these sites H = X⊆Λ h X . Here each h X is local in that if h X and h Y act on different sites in the lattice then [h X , h Y ] = 0 and h X only has support on sites that are a constant Euclidean distance away from each other. Note that this definition of locality also incorporates the terms within the Hubbard model. The final concept that we need to explain the method is that of distance between sites. We assume that for all X, Y ⊆ Λ that dist(X, Y ) yields the minimum Euclidian distance between any two points within the lattice vectors contained within sets X and Y . For example, given a lattice 1D on 10 sites dist({3, 4, 5}, {8, 9, 10}) = 3. The following lemma (a restatement of Lemma 6 in [52]) explains the impact that the locality imposed by the Lieb-Robinson bound has on simulation.
Lemma 3 (Patching Lemma). Let Λ be a lattice on N sites with a Hamiltonian H = X⊆Λ h X where each h X is a local bounded Hamiltonian for every X ⊆ Λ. Let A, B, C be subsets of Λ and let H P1···Pq for any sequence P : {1, . . . , q} → {A, B, C} q be for integer q ≥ 1 H P1···Pq = X∈P1 ··· Pq h X (for example, H AB = X⊆A B h X . There are constants v ≥ 0, called the Lieb-Robinson velocity, and µ > 0 such that Note that in the above terminology (A B C) \ (A B) \ C is the boundary of the sets AB and C meaning the set of all terms within the Hamiltonian that act on sites contained in both A or B and C. Lemma 3 is the core of the simulation method. The central idea behind the proof is to use the patching lemma recursively to break up the evolution into a product of evolution operators, each of which contains terms that act on one or two of the constituent subsets of sites in the problem. This is conceptually similar to a Trotter decomposition; however, as the error in this approximation can be made exponentially small by choosing the patches in Lemma 3 to be linearly far apart, the error can be controlled in a tighter fashion without requiring short time steps (unlike Trotter decompositions [22,98]). For example, consider regions A, B, C, D. Then we can write In order to achieve scaling that is polylogarithmic in 1/ , the evolution of each patch needs to be simulated using a method with poly-logarithmic scaling in 1/ such as the truncated Taylor series simulation result [23] or qubitization [52]. Our circuits can be used to optimize this result since qubitization remains the best way to simulate the evolution and our select and prepare circuits meet the requirements of qubitization oracles. This result is formally given as Theorem 1 of [52], a special case of which is restated below for convenience. We claim that our approach can be used to achieve O(N/ ) scaling for simulating the Hubbard model with nearestneighbor interactions. The Hubbard model satisfies the preconditions because each term in the Hamiltonian is local on the fermion lattice [52]. By using our constructions for the prepare and select circuits, we can reduce constant factors (and some log factors in T complexity) involved in the qubitized simulation, while saturating the O(τ N ) scaling of Theorem 4. We then choose τ ∈ O(1/λ) and apply phase estimation on the result. In order to estimate the eigenvalue to within error with high probability we need O(λ/ ) repetitions of the circuit. Thus by multiplying the two results gives that the overall scaling for simulating such a Hubbard model is O(N/ ), as claimed.
This approach requires some follow-up work in order to determine exact T counts. Specifically, we need to implement a full qubitized simulation (rather than e −i arccos(H/λ) ). This transformation is known to be achievable with a polylogarithmic sized circuit [25]. While our work provides a highly optimized method for implementing the oracles needed in this process, more work remains to estimate constant factors associated with this simulation.

VI. RESOURCE ANALYSIS FOR FAULT-TOLERANT IMPLEMENTATION
Throughout this work, we have focused on the number of T gates as the primary cost model of interest. The reasons for this are our focus on hardware consisting of a 2D nearest-neighbor coupled array of qubits, the intention to use the surface code [31][32][33][34][35], and the high relative overhead of T gates compared to all others in that context. In this section, we shall discuss the overhead of the complete algorithm in detail.
When using the surface code, each T gate is implemented by first preparing a magic state that is consumed during the gate. The gate is probabilistic and 50% of the time T † is actually applied instead of T. When the gate implemented is not as desired, an S gate must be inserted to correct. Preparing T states requires a substantial amount of time and hardware, which we shall make precise below. In an effort to minimize the number of physical qubits required, we shall therefore only prepare a single T state at a time. We shall assume the availability of a correlated-error minimum weight perfect matching decoder [99] capable of keeping pace with 1 µs rounds of surface code error detection, and capable of delivering feedforward in 10-20 µs. We shall calculate the qubit and time overhead for physical gate error rates p = 10 −3 and p = 10 −4 . The overhead shall be approximated by considering only the overhead of the Majorana operator circuit from Figure 9 and the data lookup circuit from Figure 10. It is expected that these circuits will account for over 90% of the total algorithm overhead. These circuits break down into a number of common pieces, compute ands, uncompute ands, naked CNOTs, and active subcircuits. The number of these pieces, in terms of the number of algorithm target qubits N , is shown in Table V. compute ands uncompute ands naked CNOTs subcircuits Figure 9 N − 1 N − 1 0.5 N 0.5 N Figure 10 1 Breakdown of the various elements that make up the Majorana operator circuit from Figure 9 and the data lookup circuit from Figure 10. N is the number of spin orbitals in the system the circuits are being applied to.
A surface code implementation of compute and Figure 4 is shown in Figure 21. The regular geometric structure can be decomposed into plumbing pieces, namely cubic volumes each containing a single small light-colored cube. The  (Figure 4) circuit. The bottom pair of white lines (coming in from the back instead of from the left) represents the injection of a |T state. Each dark ring is a CNOT. Each dark U-shape labeled T* is a random T or T † gate (as determined by a measurement in a location not shown in the diagram connecting to the dark U-shape). The boxes labeled S* immediately following each T* are S or S † gates that are included if the random T or T † gate results in the incorrect gate. The final two boxes are Hadamard and S † operations, respectively. b) Compressed version. The final three boxes can be compressed to a single box as an arbitrary single-qubit Clifford can be performed inside using twists [100] and other techniques [101]. Distinct dark structures can be made to touch provided this occurs in a single place [36]. compressed [36] version has a depth of 15 plumbing pieces. The circumference of each string-like structure (defect) is the surface code distance d, and the minimum separation of defects of the same color is also d. In the temporal direction (left-right), each unit of d is a round of error detection. In the spatial directions (plane perpendicular to temporal), each unit of d corresponds to two qubits. Note that a single CNOT, after compression, takes depth 1 plumbing piece as drawn. The overhead of any algorithm ultimately needs to be expressed as some number of qubits (space) and seconds (time). A plumbing piece is a convenient device physics and code distance independent measure of space-time volume. As described above, the (5d/4) 3 cubic volume of a plumbing piece can easily be converted to qubits and seconds given a code distance d and single-round error detection time.
A plumbing piece depth 8 surface code implementation of uncompute and Figure 4 is shown in Figure 22. This could be compressed by performing the measurement differently so no initial Hadamard would be necessary, but the current surface code form is more easily identified with the original abstract form, and further compression is not necessary as the execution time of the algorithm, as we shall see, is limited by our serial preparation of T states.
An effective plumbing piece depth 5 surface code implementation of the Majorana operator active subcircuit is  Figure 9, noting that the circuit has been somewhat modified to reduce its surface code spacetime volume. In particular, the controlled-Y operations from Figure 9 have been propagated through the controlled-Z operations, producing cnot operations which are cheaper to perform. This creates phase error which must be corrected by an S gate on the control of the entire Majorana operator. Also not shown are initial Hadamard gates on every target qubit.  Figure 9) must be applied to the entire system or else to half of the system. For electronic structure, there are three Majorana operators of size N per query to W (see Figure 14). For Hubbard, there are two Majorana operators of size N and one of size N/2 per query to W (see Figure 19). Column 6 estimates the number of times QROM lookups (see Figure 10) of size L = 3N/2 are performed. This does not occur in our Hubbard model circuits, but happens twice per query to W in our electronic structure circuits (once in prepare and once in prepare † ; see Figure 15). The final column contains the maximum number of data qubits required at any point in the algorithm, which occurs while applying the Majorana operations in select. This number does not include space to prepare T states, which will be discussed separately.
shown in Figure 23. The unusual pair-of-horns structure is to permit an uncompute and circuit to fit in before the final CNOT. The inner loop of the data circuit ( Figure 10) is just a pair of single-control multiple-target CNOTs, and a single additional CNOT. This can be implemented in plumbing piece depth 4 and is not shown. Preparing a T state is an involved process [102,103] shown for discussion purposes in Figure 24. The important features for our purposes are the fact that this can be tiled vertically (meaning in time) every 6 plumbing pieces, and the whole structure occupies an area of 160 plumbing pieces. A significant amount of fast classical feedforward is required as many T gates are followed potentially by S gates, and the paths of the connections from the first (small) level of distillation to the second (large) level must be determined based on which succeed. Our assumption of a 10-20 µs latency decoder is sufficient to make this work. We are interested in the overhead of solving instances of the electronic structure and Hubbard Hamiltonians discussed in prior sections. We must choose a target inaccuracy to fix the number of data logical qubits and gates required. To first order, the dependence of the gate count on can be ignored. We shall choose = 10 −3 . Table VI summarizes the circuit input parameters we will study.   Given Table V, Table VI, and the plumbing piece depths of the various circuit elements, we can calculate the total number of data plumbing pieces N data P P and hence the code distance required to ensure no more than a 1% chance of logical error in any data plumbing piece using p L (d, p) 2d(50p) (d+1)/2 < 1/(100N data P P ). Similarly, knowing the compute and circuit contains 4 T gates, and no other part of the data or Majorana operator circuits contain T gates, we can calculate the total number of T gates N T and hence the target T state error rate from distillation of 1/(100N T ). We also calculate the total number of T distillation plumbing pieces N T P P and a code distance to ensure the chance of T plumbing piece error is below 1/(100N T P P ). We have elected to keep algorithm error rates low to ensure on average only a few repetitions are required. For both p = 10 −3 and p = 10 −4 and all algorithm instances considered, T state distillation of the form in Figure 24 was sufficient to achieve the target logical error rate. This information is collectively sufficient to calculate the qubit and time overheads, shown for all cases in Table VII. The previous paragraphs described a manual overhead estimation method. There are a number of approximations that go into such an estimate, in particular assuming that it will always be possible to route gates in 3D spacetime without overhead beyond that of where the data qubits are stored. In order to strengthen the relevance of the presented results, we have also used a software automated approximation method. The software is an improved version of the tool from [104]. Automated overhead approximation starts from a Clifford+T representation of the circuit to be analyzed (e.g. Figure 8), and ends with a full surface code layout, having each gate translated into a corresponding configuration of plumbing pieces. Thus, the automated estimation work flow is similar to the manual one. However, certain circuit particularities were analyzed differently and so similarities and differences between the two methods will be discussed.
The Clifford+T circuit is prepared according to a worst-case scenario, based on the available hardware restrictions, plumbing pieces layout problems and T gate correction mechanisms. The preparation of a single distilled T state at a time is a restriction which influences the resulting surface code layout: the Clifford+T gates have to be scheduled (laid out) in such a way that the T gates will be executed as soon as possible, but not earlier than the availability of distilled T states. The state distillation form in Figure 24 implies that a T gate can be executed on average every 6 plumbing pieces along the time axis. Additionally, T gate implementations are probabilistic and S gate corrections may be necessary. Thus, our scenario considers that all T gates are followed by the corrective S gate, resulting in a synthetic increase of the Clifford+T circuit depth. Circuit preparation is followed by an optimization procedure, where as many Clifford gates as possible are scheduled between two subsequent T gates. The software simulates the availability of the distilled T states, and places T gates whenever their execution is possible. If no T states are available, the T gates are delayed, which will increase later the approximated time overhead.
Finally, the Clifford+T circuit is translated into the surface code layout. From a resource estimation perspective, the complexity of this task is increased because the software currently includes only partially the optimization strategy presented in Figure 21b: final boxes can be compressed to a single one, but distinct dark structures are not allowed to touch (for verification/debugging purposes). Due to this fact, the automated approximation used a slightly different Clifford+T realization of the computing and gate (cf. Figure 4), which had the advantage of being more suitable for automatic placement in stairway-structured circuits (i.e. the arrangement of and gates in Figure 9). Automatic placement of those Clifford+T sub-circuits resulted in a shorter depth of the generated surface code layouts. Overhead of the two basic circuits in units of plumbing pieces can be found in Table VIII. This data converted into qubits and time can be found in Table IX. The automatically generated estimations are comparable to the manual, though generally slightly higher, exceeding the manual estimates by 10-20%. While the automated method is penalized by missing optimization strategies that are possible when analyzing circuits manually, and the need to provide explicit TABLE VIII. Automatically generated resource estimates of the Majorana and QROM circuits (Figure 9 and Figure 10). The area width, height, and time columns give the dimensions of bounding box (e.g. Figure 25) in units of plumbing pieces. The last column is the number of plumbing pieces estimated to be actively used within the bounding box. The volume numbers do not include the volume of the T factory, but do include idle qubits that are present in the algorithm as a whole but not the individual circuits. The "braided volume" refers to the amount of actively used volume, i.e. non-empty space with defects used to encode qubits and operations. QROM circuits are indexed like QROM 3 2 N because in context the QROM index size L is always 50% larger than the number of orbitals N . communication paths for long-range gates, some of this penalty is canceled by using algorithmic methods too complex to perform manually. The fact that both approximation methods lead to such comparable qubit and time overheads strengthens our confidence in these estimates. The time estimates, in particular, are practically identical.
The data is highly encouraging, with physical qubit counts of order a million and times in hours for all but the largest cases considered. Significant further reduction is expected to be possible. For example, the N qubits in the |ψ register are only operated on by the Majorana operator circuit and this circuit targets just one of these qubits at a time. This implies the remainder can be stored more compactly in square surface code patches while not being interacted with. This could easily reduce the overhead of these N qubits, which account for 70-80% of the physical qubits, by a factor of six. This would conservatively lower overall physical qubit requirements by a factor of two. TABLE IX. Automatically generated qubit and time overheads of general chemistry and Hubbard circuits assuming gate error rates of p = 10 −3 and p = 10 −4 , a 2D array of nearest-neighbor coupled qubits, and a surface code error detection cycle time of 1 µs. The execution time being estimated is the duration of one complete run of the phase estimation process.

VII. CONCLUSION
In this work we introduced especially efficient fault-tolerant quantum circuits for using phase estimation to estimate the spectra of electronic Hamiltonians. Unlike past work which has focused on realizing phase estimation unitaries encoding e −iHτ , corresponding to time-evolution under H for duration τ , we focused on a recent idea that one might more cheaply realize phase estimation unitaries encoding the quantum walk e i arccos(H/λ) where λ is a parameter closely related to the induced 1-norm of the system Hamiltonian [26,27]. We construct explicit quantum circuits for realizing this quantum walk with T complexity linear in basis size for both the planar Hubbard model and electronic structure Hamiltonians in second quantization. We showed that phase estimation projects these systems to an eigenstate and estimates the associated eigenvalue to within additive error by querying the quantum walk operator an optimal number of times scaling as O(λ/ ). To accomplish this we introduced general techniques that we conjecture are nearoptimal for streaming bits of a unary register and for implementing a quantum read-only memory. We introduced a new form of Heisenberg limited phase estimation specialized to linear combinations of unitaries based simulations and provided bounds on T complexity and ancilla count which remain tight even at small finite sizes.
In addition to providing explicit Clifford + T circuits we compiled the bottleneck components of these simulations to fault-tolerant surface code gates in order to rigorously determine the resources that would be required for error correcting interesting problem instances. We performed this compilation both by hand and by using automatic tools and found similar overheads in both cases. We found that classically intractable instances of jellium and the Fermi-Hubbard model could be simulated with under one-hundred million T gates and would require about one million physical qubits in the surface code with two-qubit error-rates on the order of 10 −3 . At error rates of 10 −4 about an order of magnitude fewer physical qubits would be required. We also priced out simulations of realistic solid-state materials such as diamond, graphite, silicon, metallic lithium and crystalline lithium hydride and found that only slightly more than one billion T gates and a few million physical qubits would be required. Despite focusing on different systems, our results are most readily comparable to the previous state-of-the-art results from [45]. Even though [45] sought empirical estimates of the T complexity rather than rigorous upper bounds as we did, they estimated that approximately 10 15 − 10 16 T gates would be required for a 108 qubit simulation of the FeMoco molecule active space. By comparison, our upper bounds on the T complexity required to solve the classically intractable electronic structure problems studied here were roughly a million times less. Because our simulations require only a few times more physical qubits than is required by a single T factory, it is reasonable to expect that the simulations we outline here will become practical on the first universal fault-tolerant quantum devices, many years before the simulations discussed in [45] would be viable.
Several important directions for future research pertain to the extension of these simulation techniques to representations that would be more effective for single molecules. While the dual basis described in [43] is well suited to treating solid-state materials such as the ones explored here (e.g. jellium, solid-state silicon, graphite, diamond, lithium and lithium hydride), by combining our techniques with the "Gausslet" basis sets of [48], we should also be able to simulate single molecules with similar resolution to Gaussian orbitals -thus extending our results to systems such as FeMoco with similar overheads to those observed in this work. However, deploying the Gausslet basis functions to systems with large atomic nuclei such as iron (as in FeMoco) will require further research. If basis errors are a concern, then future work should combine results from this paper, [105] and [26] in order to determine the cost of encoding first-quantized electronic spectra in quantum circuits; in first quantization, basis errors are suppressed exponentially in the number of qubits used to represent the system. Another remaining challenge is to compute a tighter upper bound on the number of physical qubits required by the algorithm. At the first moment that it becomes technologically possible to distill magic states, the number of physical qubits available on one machine will still be extremely limited. Getting a meaningful computation to fit at all will be difficult. Fortunately, the qubit count estimates of this paper were fairly conservative; in this paper we only explored surface code constructions which we were able to validate and implement in software; these constructions are not necessarily optimal. For example, the logical qubit representation used in lattice surgery [106] requires fewer physical qubits than the double-defect representation used in the estimates of this paper. Furthermore, there are several places in our circuits where we preferred small multiplicative improvements in T-count over small additive improvements in logical qubit count. For example, we delay uncomputing QROM lookups in order to avoid recomputation and, when performing phase estimation, we minimize the number of oracle queries by using a full-size phase register instead of a single phase qubit. Since we have managed to show that with error rates of 10 −3 one can solve interesting problems in chemistry using on the order of a million physical qubits within the surface code, a next natural goal would be to try to further reduce the resources required to be on the order of a hundred-thousand physical qubits.