Fault-tolerant resource estimate for quantum chemical simulations: Case study on Li-ion battery electrolyte molecules

We estimate the resources required in the fusion-based quantum computing scheme to simulate electrolyte molecules in Li-ion batteries on a fault-tolerant, photonic quantum computer. We focus on the molecules that can provide practical solutions to industrially relevant problems. Certain fault-tolerant operations require the use of single-qubit"magic states"prepared by dedicated"magic state factories"(MSFs). Producing and consuming magic states in parallel is typically a prohibitively expensive task, resulting in the serial application of fault-tolerant gates. However, for the systems considered, the MSF constitutes a negligible fraction of the total footprint of the quantum computer, allowing for the use of multiple MSFs to produce magic states in parallel. We suggest architectural and algorithmic techniques that can accommodate such a capability. We propose a method to consume multiple magic states simultaneously, which can potentially lead to an order of magnitude reduction in the computational runtime without additional expense in the footprint.


I. INTRODUCTION
Quantum computers are capable of carrying out computational tasks that are intractable for classical computers, such as integer factorization [1], combinatorial optimization [2], simulation of interacting quantum manybody systems [3,4], and quantum chemistry [5] to name a few.
One of the most anticipated applications of quantum computers is first-principles quantum chemical simulation of complex molecules [6]. While such studies can be performed on classical computers as well, the computational cost typically scales exponentially unless one adopts certain assumptions and approximations [7,8]. Quantum computers are expected to enable quantum chemical simulation from first-principles -with fewer assumptions and approximations -and thus accurately model the properties of various molecules. In turn, such studies may lead to new insights into those molecules that are otherwise difficult to obtain.
Recently, there have been continued innovative developments in quantum algorithms for quantum chemistry [9][10][11][12][13][14][15][16][17][18], which have led to many orders of magnitude reduction in the resources required to simulate molecules at accuracies beyond the reach of classical computation. These developments have been driven by a focus on a few molecular systems that take part in complex reactions that underlie high-impact open problems in chemistry (e.g., elucidating the mechanism of nitrogen fixation at FeMoco active sites [10,19]). Importantly, the cost reduction techniques found in these studies are expected to be applicable to other chemical systems as well.
Thus, it is of interest to understand the cost of these state-of-the-art algorithms when applied to a broader class of industrially relevant quantum chemistry problems and their corresponding molecular systems. In this article, we propose one such concrete application: the study of electrolyte molecules for Li-ion batteries. In conventional Li-ion batteries, the liquid electrolyte plays an important role in determining the performance and the stability of the batteries. Quantum chemical simulation of the underlying constituent molecules can help us better understand electrochemical reactions occurring in the liquid electrolytes, and serve as a useful guide in establishing design principles for better electrolytes.
Realizing the promise of quantum computation to accurately simulate complex chemical systems like Li-ion battery electrolyte chemistries will require devices capable of executing hundreds of billions of operations on thousands of logical qubits. Currently, a huge technological effort is underway to build a fault-tolerant quantum computer. As attention in the field turns toward engineering noise-resilient devices capable of performing such large computations, it is increasingly important to move beyond hardware-agnostic resource estimates in terms of qubits and gates, and account for the overhead of faulttolerance.
In that spirit, we first describe resource estimates in terms of architecture-agnostic parameters like the number of qubits and gates, followed by compilation of these hardware-agnostic parameters into the standard primitives of our fault-tolerant architecture, culminating in architecture-specific size and runtime estimates. The fully-compiled resource estimate is markedly different to prior studies, in that it makes use of the recentlyintroduced fusion-based quantum computation (FBQC) scheme [20,21].
FBQC is a quantum computing paradigm in which the fundamental building blocks of the computation are resource states; joint, entangling measurements called fusions; and feed-forward operations. For a conceptual overview of FBQC, see Appendix A. A promising technology that can realize these operations in practice is photonics [20]. Physically, the resource states can be created using a resource state generator (RSG), a gadget that emits the resource states encoded in a finite number of photons [20]. The requisite fusion measurements can be performed in a variety of ways using only linear optical elements and photon-detection [20,22].
A salient feature of the photon-based FBQC scheme is the possibility of interleaving [21]. Photons can be stored in an optical delay line of large length without significantly degrading the quality of the photonic qubits. This leads to a compelling ability to easily trade off between the footprint of the device and the runtime of the computation, potentially leading to a more modest engineering overhead in building a large-scale quantum computer. Naturally, the overall cost of the algorithm will greatly depend on the length of the delay line.
Taking into account the distinctive features of FBQC, we perform a detailed analysis of the resources required to simulate a variety of molecules relevant to Li-ion electrolyte chemistry on a fault-tolerant quantum computer. The result of this analysis is succinctly summarized in Fig. 1.
The scope of this article is really two-fold: after first providing concrete, fully-compiled resource estimates on the cost of simulating the aforementioned chemical systems, we study the degree to which considerations of parallelization at both the algorithmic and architectural level can reduce the overall resource requirements to perform these algorithms. Certain fault-tolerant operations require the use of single-qubit "magic states" prepared by dedicated "magic state factories" (MSFs) [23]. According to our resource estimates, having an MSF that produces magic states serially is suboptimal for simulating the systems under consideration. For small quantum algorithms (i.e., few qubits and gates), the footprint of the factory constitutes a substantial fraction of the quantum computer; see Ref. [24][25][26] for recent attempts to optimize the factory. However, as the number of logical qubits needed for the algorithm increases, the relative size of the MSF will become smaller. In that regime, the (relative) extra cost of adding another magic state factory will be small; while it has been known that this will happen eventually as the number of logical qubits increases (for instance, recent studies [14,27] have considered multiple MSFs in their estimates), the exact point where this crossing occurs has remained unaddressed in the literature. We fill this gap by performing a detailed study; the answer, which depends on the number of qubits and the T-gate count, is plotted in Fig. 1. Our study suggests that there are quantum algorithms with practical applications that lie outside the oft-assumed regime in which the size of the MSF is relatively large.
For these "intermediate"-sized fault-tolerant quantum computations, it is beneficial to use multiple MSFs, and to use them efficiently. We provide algorithmic and architectural adjustments that can accommodate the injection of multiple distilled magic states whilst making the Clifford gate count relatively small. On the algorithmic front, we perform a detailed analysis on the cost of parallelizing some of the key subroutines in quantum chemistry simulation algorithms. While many of the results we summarize in this paper are well-known, some are new. For example, we introduce a novel method to apply a specific sequence of Givens rotations used in fermionic basis change (i.e., "Gizens" rotations) in logarithmic depth (a proof is provided in Appendix E). On the architectural side, our work builds upon the scheme of Litinski [25], which is an architecture that can implement an arbitrary quantum algorithm using n T T-gates in O(n T d) time, where d is the code distance of the individual logical qubits, independent of the number of Clifford gates in the algorithm. We propose a further optimization of this architecture, leading to a computation time of O(n T d/m) for a variable parameter m while maintaining the code distance without increasing the footprint, assuming that (i) m is small compared to d and (ii) that the circuit is structured in such a way that m T-gates can be applied simultaneously in the same time step.
We emphasize that while the explicit resource estimates of RSG count and runtime only apply to the FBQC paradigm, the results in other sections (particularly on the size of the magic state factories) apply more generally to any surface code-based architecture. The parallelization techniques detailed in this article may reduce the runtime requirements by an order of magnitude; however, while the numerical results were based on a Monte Carlo simulation, the performance of our new architectural proposal that uses multiple MSFs has not been simulated in detail. We leave this to future studies.

II. MATERIALS AND METHODS
We first describe the Li-ion battery chemistry use-case and the molecules under consideration and detail the classical computational pipeline required to specify the problem and the quantities of interest, then we describe the quantum algorithms we use to calculate the quantities of interest and quantify the resources required to implement these algorithms using a fault-tolerant cost model. We then detail how we convert these hardwareagnostic estimates into architecture-specific size and runtime estimates.

A. Battery chemistry use-case
In conventional Li-ion batteries, the liquid electrolyte plays an important role in determining the performance Time required (hours) cc-pVTZ LiPF6 cc-pVDZ LiPF6 cc-pVTZ PF 6 cc-pVDZ PF 6 cc-pVTZ LFEC cc-pVDZ LFEC cc-pVTZ FEC cc-pVDZ FEC cc-pVTZ LREC cc-pVDZ LREC cc-pVTZ LEC cc-pVDZ LEC cc-pVTZ EC cc-pVDZ EC FIG. 1. (left) Ratio of magic state distillation (MSD) footprint to total computational footprint for different numbers of logical qubits and T-count. Footprint is measured in terms of number of RSGs required. We assume a linear data block with two multi-level 15-to-1 factories as depicted in Fig. 3. For comparison, we plot the resource estimates for other algorithms, such as: simulation of the Fermi-Hubbard model [63]; of crystalline materials [64]; of FeMoco [14]; and breaking RSA encryption [27]. Where necessary, resource estimates for quantum chemistry algorithms have been amended to produce eigenenergies to within chemical accuracy. We assume logical error rate parameters A = 0.45, B = 1.35 (i.e. the average of the two regimes considered in Section III C), with magic states initially prepared with a logical Pauli error rate of 0.1%. (right) Footprint and time estimates to perform fault-tolerant quantum computations for various molecules in the cc-pVTZ and cc-pVDZ bases based on Table IV. Using interleaving, we can linearly trade-off space and time resources, as displayed. We plot resource estimates for a range of interleaving ratios between 1 and 1000, as such ratios can be achieved introducing negligibly higher error rates [21]. We assume logical error rate parameters A = 0.5, B = 1.6, with magic states initially prepared with a logical Pauli error rate of 0.1%. and stability of the batteries [28][29][30]. Electrolytes consist of three components; solvent molecules like ethylene carbonate (EC), conducting-salt molecules like LiPF 6 , and additive molecules like fluoroethylene carbonate (FEC). The solvent is the host for the salt and the additive, and thus needs to have high solvation, low viscosity, inertness, and safety. Conducting-salts dissociate into ions in the solvent, and these ions carry electricity and thus need to have sufficient solubility into solvent and high stability against side reactions with other chemical components. Additives are substances added to electrolytes to enhance specifically targeted properties, and need to provide durable performance in the long term.
Quantum chemical simulation can be used to understand the electrochemical reactions that occur in these constituent molecules, and thus aid in designing better electrolytes [31][32][33]. As an example, let us assume that we aim to elaborate the effect of additives to Li-ion solvation [34]. Although the entire process involves several intermediate steps and may evolve along different reaction paths, the Li-ion solvation ultimately requires Li to dissociate from salt molecules. Quantum chemical simulation can be used to calculate the dissociation energy of the salt molecules in two different environments: one with solvent-only, and another with solvent as well as additives. In the classical chemical regime, we are interested in determining the reaction rate, a quantitative proxy for the relative dominance of two chemical reactions. This quantity is proportional to the Boltzmann factor of the change in enthalpy of the two reactions. Thus, the calculated Li-dissociation energies can be used to determine which environment is more favorable for both the Li-dissociation as well as the solvation, quantitatively.
While quantum chemical simulation can be carried out from first-principles (solving Schrödinger's equation without approximations), it is common to adopt approximations and assumptions because of the high computational expense originating from the exponentially-scaling complexity in a first-principles approach. Although such approximations and assumptions have been practically applied to a variety of chemical studies at a compromised accuracy, there is a fundamental limit in both the accuracy of simulation results and the spectrum of chemical systems that can be simulated. In particular, as the focus of chemical studies continues to shift to smaller-scales and evolves to include more comprehensive descriptions of the environments the chemical systems under study inhabit, demand for more precise simulations consistently increases. Quantum computers are believed to mitigate the exponentially-scaling computational complexities in certain problems and thus enable quantum chemical simulations in a full configuration-interaction scheme.
In this study, we take EC, LiPF 6 , and FEC to represent the solvent, conducting-salt, and additive molecules, respectively. We estimate the computational resources required to study their interactions with Li ions via quantum chemical simulation. Our proposed scenario is to perform the geometric optimization using density functional theory (DFT) calculations classically, and then calculate the energy via quantum computation. We consider quantum phase estimation (QPE) for the quantum part of the computation; in particular, we consider recently developed quantum algorithms [13,14].

Chemical accuracy
Here we comment on the required accuracy for the energy calculations. Ultimately, the quantity we are really interested in is the reaction rate, which is proportional to e −β∆E , where β is the inverse temperature and ∆E = E 1 − E 2 is the difference between two energies E 1 and E 2 . When the uncertainty in ∆E is ≈ 1.36 kcal/mol at room temperature (β −1 ≈ 0.5922 kcal/mol), computed reaction rates and experimentally determined reaction rates differ by an order of magnitude. For this reason, the desired so-called "chemical" accuracy for calculating ∆E is historically taken to be approximately 1 kcal/mol [35]. We further note that while one would like to calculate ∆E, we cannot calculate this difference directly; the algorithm at our disposal (QPE) computes individual energies E 1 and E 2 , not ∆E. To achieve chemical accuracy in the calculation of ∆E, but given only the ability to calculate individual energies, one must in principle compute E 1 and E 2 to chemical accuracy due to propagation of errors. For instance, suppose E 2 calc = E 2true + E 2error at the initial state of a chemical reaction and E 1 calc = E 1true + E 1error at the transition state. Then, the energy difference is calculated to be It is common to assume that E 2error is similar to E 1error , and so cancel them out; this is a reasonable assumption to make for some molecules (mostly oxides and fluorides) but it is not for the other molecules. While our battery chemistry use-case may not explicitly require the calculation of the individual energies to chemical accuracy, as we mention above, we aim to push the precision achievable with computational methods via quantum computation to match (or even exceed) experimental accuracy, and so we opt to calculate each individual energy to great accuracy, namely 1 mHartree.

Active spaces
Here we note that unlike most studies on quantum chemical simulation via quantum computation, we do not consider any active spaces, i.e., we consider all electrons, including core electrons. We note that the core electrons are not absolutely necessary to capture the chemistry we are interested in. Further, it would not necessarily be appropriate to choose an active space since these molecules do not exhibit strong correlations. However, the aim is to relax several assumptions and approximations typically made in classical computational methods, and to estimate the cost of QPE when pushing the boundary of quantum chemical simulations for molecules -applying some of the hardest conditions for the simulability of important quantum chemical problems (i.e. including all electrons, including core electrons that one may not typically treat).

B. Molecules and computational description
Here we briefly describe the computational specification of the electrolyte molecules under study. The structures of EC, PF − 6 , and FEC molecules mentioned above in Section II A are illustrated in Fig. 2. Note that PF − 6 is in an anodized state while the others are in a neutral state. In addition to these molecules, their variants with additional Li attached (LEC, LiPF 6 , and LFEC, respectively) and EC-variant with additional Li substituted (LREC) are also examined considering potential reactions of electrolyte molecules with Li-ions. The geometric optimization of these molecules is processed via a Density Functional Theory (DFT) calculation to obtain the optimal positions of the atoms. The generalized gradient approximation is applied using the Perdew-Burke-Ernzerhof parametrization [36], as implemented in the Vienna Ab-initio Software Package [37][38][39][40]. A cutoff energy of 520 eV is used and the k-point mesh was adjusted to ensure convergence of 1 meV per atom. The molecules are placed in an empty 15 × 15 × 15(Å 3 ) supercell, and the volume and shape of the supercell are fixed during the relaxation. See Tables I and II for the atomic coordinates of these molecules after geometric optimization. Note that the plane wave basis set is used only for the purpose of the geometric optimization. When the computational resources required for the quantum computation are estimated, Gaussian-type basis sets are considered.

C. Molecular Hamiltonian
Once the geometric optimization is complete, we obtain a fermionic Hamiltonian in second-quantized form, which can be written as: where h ijkl and t ij are real numbers, σ, σ ∈ {↑, ↓ } are spins, and i, j, k, and l are the indices of the molecular orbitals. Here a † iσ and a iσ are fermion creation/annihilation operators, obeying the canonical anticommutation relation: Note that molecular orbitals are expressed by the basis set. In this study, STO-3G, DZ, 6-311G, cc-pVDZ, and cc-pVTZ basis sets are considered to represent various accuracies. We also assume the simulations are carried out in an open-shell scheme if the molecular system to be simulated contains unpaired electrons. In Ref. [13], the two-electron tensor h ijkl over N orbitals is factorized by performing two levels of eigendecompositions, as follows: where  Table III. We note that the values for R, M , and the one-norm α calculated for each molecule and basis set considered in this article are larger than those provided in Refs. [13,14] for FeMoco.

D. Quantum algorithm description
The main goal of the quantum component of the computation is to estimate the ground state energy of Eq. (1) for each molecule and basis set. The dominant approach in the literature is to use quantum phase estimation (QPE), which, given an eigenstate of the Hamiltonian describing a system, H, and a unitary encoding of a function of H, produces the corresponding eigenenergy.
For the unitary input, we choose a modern method known as qubitization [41]. The dominant part of the quantum computation lies in the qubitization subroutine, and as such, optimizing this part of the computation is crucial (see Appendix B). Recently, methods to utilize the sparsity structure of the tensor h ijkl have shown great promise [12][13][14]. These developments have led to several orders of magnitude reduction in the computational cost for studying the FeMoco Hamiltonian, the oft-cited "poster-child" high-impact application of quantum computation [10]. Our work builds upon these important recent developments.
Specifically, we shall adopt an approach taken in Ref. [13] to estimate the computational cost of estimating the ground state energy of the molecules described above. We note that other methods, i.e., the ones described in Ref. [12,14] may lead to a lower overall computational cost. We leave those studies for future work.

Ground state approximation
One may be concerned with the necessity to prepare an eigenstate of the Hamiltonian as input to QPE; indeed, in practice, for most molecules of interest it is infeasible to prepare an exact eigenstate of H, and so one must instead prepare a trial ansatz wavefunction with non-negligible overlap with the true eigenstate (and in particular, the ground state). Concerns over the viability of preparing such an ansatz efficiently are encapsulated by the "orthogonality catastrophe", the observation that the overlap between the true ground state and some ansatz wavefunction decreases exponentially as the system size increases [42]. However, it has been shown that there are sophisticated classical methods for preparing trial wavefunctions with sufficient overlap, for states of up to O(100) orbitals using simple-to-prepare states such as the single Slater determinant obtained from the Hartree-Fock method (alternatively, methods for multideterminant state preparation can be used) [43]. It should be noted that the resource counts for the number of qubits used in the algorithm presented in this article (shown in Table IV) include not only the system qubits whose state represents the wavefunction of the physical system, but also a plethora of ancillae. This state prepa-ration routine only needs to act on the small subset of qubits that represent the system, which is only the number of orbitals (only O(10)−O(100) for all molecules and basis sets; see Table III).

E. Fault-tolerant quantum algorithm cost model
The cost of a quantum computation depends not only on the algorithm used, but also on the gate set used in the quantum computer and the cost of each gate. We consider the Clifford + T gate set. For our purposes, it will suffice to quantify the cost of the algorithm in terms of the following three quantities: (i) T-count, n T ; (ii) T-depth, D T ; and (iii) the number of logical qubits, n L . T-count is the number of T-gates used in the quantum circuit, T-depth is the number of layers of commuting Tgates, and n L is the number of logical qubits that appear in the circuit description of the algorithm. We treat the cost of Clifford gates to be effectively equal to zero. As detailed in Appendix D 4, after compiling the quantum circuit in a certain way, the requisite cost of the Clifford gates will be negligible.
The parameters used to determine the scaling of these quantities depend greatly on the Hamiltonian preprocessing method used. In our case using the double factorization procedure (detailed in II C), these quantities can be determined by four parameters. These are (i) the number of orbitals, N ; (ii) the ranks M and R that appear in the truncation procedure, and (iii) the norm of the Hamiltonian after the truncation, α. See Table  III for the calculated values of these parameters for the molecules introduced above.
Computational volume When considering the overall cost of an algorithm, it is useful to consider the overall computational volume of an algorithm; that is, a proxy for the space required to encode the problem of interest, multiplied by the time it takes to execute the algorithm. The space roughly corresponds to the number of logical qubits n L , whereas the runtime roughly corresponds to the number of time slices needed to perform the gates in the computation. We typically define the overall volume to be V n := n L n T . However, in circumstances where magic state factories occupy a small percentage of the overall footprint, thus allowing us to potentially increase the number of magic state factories on the quantum computer, we may instead consider V D := n L D T . Viewed in this way, V n sets an upper bound on the volume of the algorithm as it assumes the serial application of T gates, and V D sets a lower bound on the volume of the algorithm assuming that one can perform multiple T gates in parallel.

F. Fault-tolerant overheads
Here we detail the method by which we convert the algorithmic-level costs into architecture and hardware- level costs. We take into account the overhead due to quantum error correction and fault-tolerance, and detail the metrics and primitives specific to our faulttolerance approach. We utilize the recently-introduced fusion-based quantum computing (FBQC) [20], a universal model of quantum computation particularly suited to photonic quantum computation (for a brief introduction, see Appendix A). For the computations detailed in this article, we use FBQC to implement the more familiar surface-code type topological quantum error correction, and take our fault-tolerant operations to be lattice surgery operations on patches of surface-code encoded qubits (as we note in Section III 1, many of the methods discussed here and the results that follow are general in nature, and are not features native exclusively to the FBQC paradigm).
Fault-tolerance overhead metrics The primitive building blocks for a fusion-based quantum computer are resource-state generators (RSGs), devices that generate resource states on a fixed clock rate f RSG . For simplicity, we consider resource states made up of 6 qubits in a ring called the "6-ring" fusion network, which has been shown to be able to perform surface-code type error correction [20]. The relevant metric for the size of the computation is the number of resource state modules n RSG required and the number of RSG cycles n cycles required to implement the quantum algorithm of interest. From n cycles we can determine the total algorithm time t algo by dividing by the clock rate f RSG .
To estimate the FT overhead, the algorithmic parame-  TABLE IV. T-count nT and qubit count nL for each molecule and basis set that minimizes the computational volume nT × nL. Note that these counts assume the serial application of magic states and do not make use of any of the parallelization techniques discussed in Section III.
ters we require are the T-gate count n T and the (logical) qubit count n L . We set a tolerable failure rate for the entire computation total -this is the rate at which the computation will fail due to errors in the device. We assume here that total = 1% is sufficient, as the computation can be rerun for increased certainty as required.
For example, since the required logical error rate is well below 10 −10 , an order of magnitude change in the target error makes at most a 10% change in the code distance, which does not affect the estimates by much. From this we determine the tolerable error rate per gate gate , given by gate = total /(n T n L ) -this is the max error per gate that we can tolerate in order to keep the overall computational error rate below total . Noise model and logical operations Logical operations are implemented by specific fusion networks realizing space-time channels for fault-tolerant surface code operations. The fault-tolerant operations we consider are based on lattice surgery [25,44], noisy magic state preparation [21,45,46], along with magic state distillation [26,[47][48][49]. Each T-gate is realized by a lattice surgery operation on a specified number of target logical qubits along with a distilled encoded |T = (|0 + e iπ/4 |1 )/ √ 2 state. Each fusion-network must be large enough such that the logical error rate per gate is less than gate .
The resource states that are produced are in general noisy, with further errors occurring during their propagation and measurement. In addition, fusions between resource states are also subject to noise. One can consider the combined effect of all error processes as resulting in erroneous outcomes on the fusions (for more details, see Ref. [20]). In particular, we characterize the noise on each fusion measurement by two parameters: p P and p E , known as the Pauli error rate and erasure error rate, respectively. We assume that for each measurement, there is a probability p E that the outcome is erased, a probability of p P (1−p E ) that the outcome is incorrect (i.e. bit-flipped but not erased), and a probability of (1 − p E )(1 − p P ) that the measurement is correct. We refer to p = (p P , p E ) as the physical error rate. These parameters also determine the error rate that magic states can be initially prepared with. By post-selecting based on erasure and syndrome information, one can prepare the initial magic states with logical Pauli error rates comparable to the physical Pauli error rate, provided physical error rates are sufficiently low [50]. Such protocols require only modest overhead.
For a physical error rate p, a generic lattice surgery operation in our scheme has a logical error rate where A and B are parameters depending on the error rate p that are estimated by numerical simulations, and d is the code distance (the minimal number of elementary errors required to cause a logical fault). We refer to the exponential decay of gate (as a function of A, B, and d) as the sub-threshold scaling. Each lattice surgery operation requires 2d 2 RSGs in footprint and takes d clock cycles (using the doubly-encoded patch lattice surgery method of Ref. [25]). The size of the fusion network is therefore determined by the tolerable gate error rate and numerically estimated parameters d = 1/B log(A/ gate ). Distillation protocol With the code distance d fixed, we can estimate the overhead required to distill encoded |T states. Distilling encoded |T states is a highly optimized protocol that depends on the parameters gate , A, B, d, and p. It requires n distill additional RSGs to prepare and route an encoded |T state every d clock cycles with error rate less than gate . We consider distillation factories based on multiple levels of the 15-to-1 protocol [48,51] and in particular, the FBQC analogue of the implementation in Sec. 3 of Ref. [26]. Each level of this protocol consists of a distillation block in which the 15-to-1 distillation circuit takes place, and a connector block which receives the output magic states of lower level distillation blocks and performs the required lattice surgery operation. Output magic states from each level are routed to higher level connector blocks using interleaving fiber. The number of distillation blocks at each level is chosen such that the magic state production rate is not less than the magic state consumption rate of the next higher level. We utilize two such factories, with a total footprint denoted by n distill .

III. RESULTS
Here we provide an analysis of the algorithmic and architectural techniques used in this paper. On the algorithmic side, first we show the scaling of this algorithm in terms of qubits n L , gate count n T , and depth D T with respect to the metrics introduced in II C. Given that the system sizes we consider are large enough such that the estimated footprint of magic state distillation factories is relatively small (as mentioned above and detailed later), next we summarize strategies for parallelizing the circuits. We then compare computational cost of the serialized and parallelized circuits to each other.
On the architectural side, we first describe the faulttolerance overhead of these algorithms in the FBQC model, and provide concrete runtime and footprint estimates for these algorithms. Then we comment on the relative footprint of the magic state factory, which leads us to a discussion on a novel method that allows injection of multiple distilled magic states (while keeping the number of Clifford gates comparatively small), and estimate the potential reduction in runtime associated with these techniques (this is fully detailed in Appendix D).

Generality of results
Here we emphasize that (most of) the results we present are applicable to a wide variety of architectures and are not specific to the FBQC model; in particular, the estimate of the relative size of the magic state factory to the data blocks is a result of how expensive it is to perform lattice surgery operations when consuming Tstates sequentially on large numbers of qubits, and is not a result of the FBQC model. This conclusion would be true in any architecture that uses the 15-to-1 distillation protocols we highlight. The only results that are specific to the FBQC model are the ability to use interleaving to perform space-time tradeoffs, and the underlying physical implementation of these schemes (i.e., counting RSGs rather than the more familiar notion of "physical" qubits, and the clockspeed that is specific to our photonic platform).

A. QPE cost estimate: scaling
Here we discuss the algorithmic resource estimates for performing quantum phase estimation (QPE) for the molecular systems considered in this article. As mentioned in Section II D, the goal of QPE is to produce an estimate of an eigenenergy given unitary access to a Hamiltonian and an eigenstate of said Hamiltonian. The overall number of T gates n T for QPE can be expressed as the cost of embedding the Hamiltonian into a unitary circuit n T,Q , and the number of queries one must make to this circuit to approximate an eigenenergy up to error P . In general, to approximate an eigenenrgy up to precision P , one must query the unitary input to QPE O 1 P times. We can express the asymptotic T gate count as: where n T,Q ( Q ) is the T-count contribution from the qubitization subroutine [41] (i.e., the cost of embedding the Hamiltonian into a unitary circuit; see Appendix B), and (απP )/ P is the T-count contribution from phase estimation (or the number of times one must query the qubitized unitary to produce an estimate of the eigenenergy). Here the constant P depends on the choice of phase estimation technique; we take this to be 1/2, following the analyses presented in [11] and [52]. Notice here that the number of queries scales linearly with the norm of the Hamiltonian α (introduced in Section II C). This is because QPE estimates an eigenphase which lies between −π and π, but the spectrum of the Hamiltonian may not lie in this range. When we block encode/qubitize the Hamiltonian (as discussed Appendix B), we rescale the Hamiltonian's eigenspectrum by a factor of α, so we need to query the qubitized unitary a number of times proportional to this rescaling factor. Finally, P is the precision chosen in phase estimation, and Q is a catch-all for the error due to approximations in the qubitization procedure. To achieve the optimal T-count, the above expression must be minimized with respect to P and Q , subject to an imposed constraint on the overall error budget to achieve chemical accuracy. Similarly, the depth of the circuit D T can be expressed as: where D T,Q is the T-depth of the qubitization subroutine, and (απP )/ P is once again the number of times we must query the qubitization subroutine.
Following the analyses and gate-counting arguments presented in Refs. [13,14], these quantities, in the leading order, can be expressed as follows: for a tunable, integer parameter λ and β = 5.652 + log 2 N · α/ Q . While the expressions for n T,Q and D T,Q are broadly similar, note the linear vs. logarithmic dependence on N for these expressions, respectively. This difference in dependence is explained in the following subsection (III B).

B. Algorithmic parallelization
As mentioned in Section II E, it is useful to consider the computational volume of a computation, which is lower (upper) bounded by V D (V n ). For the qubitization algorithm described in this paper, there are a handful of techniques and augmentations to subroutines one can exploit to realize a V D that is significantly smaller than V n . These techniques range from known, trivial techniques, to novel and/or non-trivial optimizations. We detail these techniques in Appendix C, and quote the relevant results here.
Gap between V D and V n In order to get a sense of the magnitude of computational volume improvements possible, we consider the resource costs of three scenarios: (i) optimizing for V n and applying T gates serially; (ii) optimizing for V n but applying multiple T gates in parallel; and (iii) optimizing for V D and applying T gates in parallel. We provide the resource estimates for each molecule and basis set for each scenario in tables IV, V, and VI. Here we quote the order of magnitude cost for the smallest and largest classically intractable instances considered: full configuration interaction (FCI) picture of PF − 6 in the cc-pVDZ basis as the smallest instance, and FCI picture of LFEC in the cc-pVTZ basis as the largest. Across all instances, applying T gates in parallel (either optimizing for V n or optimizing for V D ) allows for an average order of magnitude reduction in resource requirements (given the ability to distill multiple magic states in parallel). See tables V and VI.

C. Fault-tolerant overheads
We now estimate the resources required to implement the above algorithms in a photonic fusion-based quantum computing (FBQC) architecture [20,21] based on surface codes [53][54][55][56]. The overhead estimates in this section are based on performing T-gates sequentially, one at a time. We will discuss advantages of performing T-gates in parallel in the next subsection.
The total number of RSGs in the computation is just the sum of the number of RSGs necessary for each block of the overall architectural layout depicted in Fig. 3; i.e., the data / ancilla block where we perform lattice surgery operations on d × d surface code patches, and the magic state distillation (MSD) block where we distill and inject |T states. For the data / ancilla block we require 2d 2 n L RSGs, and we denote the footprint of the MSDs as n distill for a total of n RSG = 2d 2 n L + n distill RSGs. The number of cycles required for the total algorithm is the product of the number of gates n T and the code distance d.
Input parameters We perform estimates for the fault-tolerant overheads of both footprint n RSG and time taken n cycles /f RSG for a given operating point (given by a physical error rate p). As noted in Ref. [21], all of the relevant components for resource state generation operate on GHz timescales -for instance sources [57], electronics and electro-optical modulators [58,59] -the relevant clockspeed for resource state generation is f RSG = 1 GHz. For many photonic architectures, loss is a dominant form of error. For photonic encodings such as the 'dual rail' encoding, loss is heralded and in addition to fusion 'failure' leads to erasures of the fusion outcomes. As such, we consider a simple noise model on the fusion outcomes that is dominated by erasures. See Ref. [20] for more details. We consider a ray in the noise parameter space defined by (p P (x), p E (x)) = (x/10, x), x ∈ [0, 1]. The threshold for the 6-ring fusion network along this ray occurs at x * = (4.71 ± 0.02) × 10 −2 , such that (p P (x * ), p E (x * )) = (4.71 × 10 −3 , 4.71 × 10 −2 ). Here, the threshold is the maximal tolerable error rate (for this ray), below which, information can be protected arbitrarily well in the limit of large network size (see for example [55] for thresholds in the context of toric codes and Ref. [20] for thresholds in the context of FBQC). We give estimates for two regimes: i) High physical error rate:  V. T-count and T-depth comparison when optimizing Vn. The last column shows the potential savings offered by parallelizing the application of T gates (merely swapping out T-count for T-depth). All instances offer a potential savings over 8x, with the largest instance reducing by over 13x.
ii) Moderate physical error rate: where the logical error rate satisfies A = 0.5, B = 1.6.
In the above, the parameters A and B defined by Eq. ((5)) are obtained by numerical simulations using the union-find decoder [60] on the bulk 6-ring fusion network. We remark that under this noise model and decoder, the 6-ring fusion network has marginal thresholds of p * P = 0.94% and p * E = 12% [20], where the Pauli threshold can be increased using other decoders such as minimum weight matching [55,61] (at the cost of increased runtime). We assume an algorithmic target error rate of total = 10 −2 . We also assume magic states are initially prepared with a logical Pauli error rate of 0.1%.
Algorithmic footprint and runtime We plot footprint and time estimates in Fig. 1 and note that the specific numbers can be found in Table VII. In the above high error rate (moderate error rate) estimates, code distances required for the data logical qubits range from d = 34 to d = 44 (d = 24 to d = 31), meaning with trivial interleaving, between 1156 to 1936 (576 to 961) resource state generators are required to produce an idling logical qubit. Additionally, with trivial interleaving in the high error rate (moderate error rate) regime, the number of RSGs required for the full computation is between n RSG = 6.4×10 6 and n RSG = 3.9×10 8 (n RSG = 3.2×10 6 and n RSG = 2.0 × 10 8 ), while the time taken in hours is  TABLE VI. T-count and T-depth comparison when optimizing VD. The second-to-last column shows the potential savings offered by parallelizing the application of T gates (merely swapping out T-count for T-depth). The final column shows the overall volume reduction when optimizing for VD vs. optimizing for Vn. All instances offer a potential savings over 8x, with the largest instance reducing by over 15x. These savings are greater than the savings demonstrated in Table V. between t algo = 0.60 and t algo = 1.3 × 10 3 (t algo = 0.42 and t algo = 9.1 × 10 2 ).
Interleaving trade-offs Finally, we remark on the possible reductions in footprint due to interleaving, a feature specific to FBQC. As we have mentioned, interleaving ratios of up to several thousand are certainly possible using low-loss optical fiber. One can readily reduce the number of RSGs required by a factor ∼ 5 × 10 3 while increasing the time taken by the same factor, compared to the estimates performed with trivial interleaving. In this regime, a single RSG is sufficient to generate several logical qubits. This brings the required footprint for the present computations to between ∼ 10 3 and ∼ 10 5 RSGs, even in the case of high error rate. We remark that the additional fiber loss at these interleaving ratios has very little impact on the photon loss threshold [21]. Thus we have neglected any potential changes to the sub-threshold scaling for these interleaving ratios, as we expect them to be small.

D. Relative footprint of magic state factories
We remark (perhaps surprisingly) that the footprint utilized for magic state distillation in the above computations is no more than 2% of the total computational footprint. For the more costly basis sets of cc-pVDZ and cc-pVTZ, the magic state distillation utilizes less than 0.3% of the footprint. Indeed, in the above estimates (which use only two levels of distillation -the first level of which is relatively small) the footprint required to distill and route a magic state every d clock cycles requires the equivalent RSG footprint of between 78 and 120 full size logical qubits (i.e. 78d 2 to 120d 2 RSGs). As the qubitization approach for the above molecules requires many thousands of logical qubits, the distillation overhead to produce one |T state per logical clock cycle is comparatively small. However for other methods that use relatively fewer logical qubits (such as those based on Trotterization), the distillation cost becomes more significant. Moreover, one may utilize higher-rate distillation protocols and perform more T gates in parallel (where possible) to perform faster quantum computations (as we discuss in Appendix D).
Here we again reiterate that the relative footprint of magic state factories is a consequence of how expensive lattice surgery operations are in a surface code computation over many logical qubits, and not a consequence of the FBQC paradigm.

E. Constant-time PPMs
The resource estimates above are based on the subthreshold scaling behavior of the FBQC scheme in Ref. [20], making use of one of the quantum computing architectures in Ref. [25]. We propose an alternative architecture in this section, in the hope that it may reduce the overall space-time cost of the algorithm. As we shall later see, we have reasons to believe that the newly proposed architecture can reduce the computation time without incurring a significant amount of extra footprint. However, let us emphasize that, unlike the results above, we have not carried out a Monte Carlo study on the threshold and sub-threshold behavior of this new scheme. A fair comparison between the two can be made only via a rigorous numerical study, which we leave for future work.
We previously observed that if we use a state-of-theart quantum algorithm to simulate the molecules relevant to battery research, the magic state factory constitutes only up to 2% of the entire quantum computer. Therefore, if magic state distillation is the main bottleneck of the quantum computation, one may be able to reduce the computation time by increasing the number of magic state factories, and injecting the distilled magic states appropriately.
In this section, we propose a different scheme which is based primarily on Litinski's scheme [25,26]. Litinski's scheme uses Pauli Product Measurement (PPM) and preparation of states in the set {|0 , |+ , |T }. PPM refers to a non-destructive measurement of the following observable: where P n ∈ {I, X, Y, Z} and P n is the logical Pauli operator of the n'th qubit, which ranges from 1 to N .  [62], one can speed up the computation. However, in this scheme, a k-fold increase in speed necessitates a k-fold increase in footprint.
In Appendix D, we detail an O(1)-time implementation of an arbitrary PPM which does not incur such additional footprint. Moreover, all the physical operations can be made local in two spatial dimensions. Therefore, optimistically speaking, in the regime in which the number of |T -state factories is abundant so that more than one |T state can be distilled in a single logical clock cycle, one can expect to be able to inject them all in a single logical clock cycle, provided that there are no more than O(d) of them. First in D 2 we discuss an abstract circuit model that implements a PPM; then in D 3 we explain how this protocol can be implemented on the surface code [54] and show that each step of this protocol can be executed in constant time; then in D 4 we explain how this method can be used to compile a general quantum algorithm; and finally in D 5 we discuss the expected speedup from this approach.  Expected speedup using O(1)-time PPMs Being able to execute an arbitrary PPM in constant time importantly does not mean one can execute an entire sequence of arbitrary PPMs in O(1) time. There are two important constraints that put a cap on the expected achievable runtime speedup: our compilation method is sensitive to whether or not successive PPMs in a sequence commute with one another, and the total number of PPMs one can execute in a single logical clock cycle must be smaller than the code distance d. These constraints are discussed in detail in Appendix D 5.
With these constraints in mind, the speedup we can achieve using this approach is limited by the ratio of Tcount and T-depth. In our setup we can optimistically expect up to a 15-fold speedup with a negligible change in the footprint of the device. However, note that whether this is possible or not depends on many microscopic details, such as the timescale needed for different physical operations. Moreover, we would like to emphasize that the threshold and the sub-threshold behavior of the code may change when we implement the transversal gates. The exact extent to which we can speed up the computation using our approach can only be determined by carefully inspecting all of these different factors. These studies are beyond the scope of this paper and are left for future work.
Keeping these caveats in mind, we can estimate the optimistic computation time for a single phase estimation for the molecules described to be on the order of 1 ∼ 4 hours for the cc-pVDZ basis set and 35 ∼ 90 hours for the cc-pVTZ basis set.

IV. DISCUSSION
A great deal of progress on the fault tolerant resource estimates of Hamiltonian simulation algorithms has been driven by a focus on specific molecules, like FeMoco. Such studies have served as benchmarks for the introduction of new algorithmic techniques and improvements [10][11][12][13][14]. Here we propose a new class of commerciallyinteresting benchmark molecules: the constituent chemicals of Li-ion battery electrolytes. We estimate the quantum computational cost of simulating these molecules, first in terms of architecture-agnostic parameters like Tcount and qubit count, and then by compiling the logical operations into the primitives of a fusion-based quantum computing architecture [20].
For the smallest instance that is classically intractable (full configuration interaction simulation of PF − 6 using the cc-pVDZ basis at 1 mHartree precision), we estimated a total of 16, 382 logical qubits and 2.32 × 10 12 T gates (see Table IV); after compilation, this corresponds to an estimated footprint of ∼ 24 million RSGs and a runtime of less than 1 day, in the moderate error rate case (see Table VII). Using the recently-introduced interleaving technique [21], we can perform linear space-time trade-offs between runtime and footprint; e.g. using an interleaving ratio of 24 for the same aforementioned instance, we can instead have a footprint of ∼ 1 million RSGs and a runtime of ∼ 2.5 weeks.
Molecules like FeMoco live in a regime where the size of the magic state factory (MSF) is relatively large, so as to make infeasible distilling and injecting multiple distilled magic states in the same time step; by exploring larger instances of molecules of commercial interest, we challenge this intuition. In this algorithmic regime with 10 4 − 10 5 logical qubits, the relative size of the MSF is small enough that opting to have multiple MSFs is no longer a prohibitive cost.
Understanding the cost of MSFs in this regime to be minimal, we explore the degree to which we can parallelize the distillation and consumption of magic states in the algorithm studied here. The drastic gap between the T-depth and T-count of various subroutines in the algorithm suggests on average an order of magnitude potential reduction in computational runtime (up to ∼ 15x for the largest instance; see Table V). In order to exploit this potential parallelization, we need not only multiple MSFs, but also fast measurements so as to not be bottlenecked by the consumption of magic states. With this motivation in mind, we propose a novel method for performing constant time PPMs. We emphasize that the effect of the transversal gates used in this procedure on the threshold and sub-threshold scaling is not known; we leave this study for future work.
We note the great deal of flexibility in space-time tradeoffs afforded us both algorithmically and architecturally. Algorithmically, similar magnitudes of computational volume reduction (∼ 10x) are possible in a variety of ways, mostly revolving around the choice of λ in the ubiquitous QROM subroutine (discussed in Appendix C 4). Naïvely, these savings translate to runtime savings in the fully-compiled resource estimate; however, interleaving allows us to easily distribute these savings between runtime and footprint. If we could perform the constant time PPMs detailed above and exploit the ∼ 10x reduction in computational volume, the resource estimate for the small instance mentioned above could go from a footprint of ∼ 1 million RSGs and a runtime of ∼ 2.5 weeks to ∼ 0.1 million RSGs and ∼ 2.5 weeks, ∼ 1 million RSGs and ∼ 1.7 days, or ∼ 0.5 million RSGs and ∼ 3.5 days by distributing the savings fully into footprint, fully into runtime, or evenly between the two, respectively.
Future studies in this regime should explore alternative Hamiltonian simulation algorithms that may be amenable to massive parallelization, as well as further algorithmic improvements tailored to depth considerations. The computational costs of the problems studied here would decrease dramatically given the ability to distill and consume multiple magic states in the same time slice. Ultimately, the overall volume savings that can be had may benefit greatly from continued co-development between architectures and algorithms. The methods introduced here serve to motivate further research into algorithmic and architectural techniques in order to realize the potential gains to be had in reducing the computational volume of problems in quantum chemistry.

V. ACKNOWLEDGMENTS
The authors would like to thank Daniel Litinski for both helpful discussions on the method for fast magic state injection, as well as support calculating the relative footprint of magic state factories. We would also like to thank Sara Bartolucci The authors declare that they have no competing interests. Data availability: All data needed to evaluate the conclusions in the paper are present in the paper. Additional data related to this paper may be requested from the corresponding authors. Correspondence and requests for materials should be addressed to William Pol and Eunseok Lee. Fusion-based quantum computation (FBQC) is a universal model of quantum computation particularly suited to photonic quantum computation (for more details, see Ref. [20]). FBQC shares similarities with the wellknown measurement-based model of quantum computation (MBQC) [65][66][67], in the sense that the computation is carried out by a sequence of measurements and feedforwards instead of unitary gates; however, there are also important differences. In MBQC, one often begins with a single resource state (i.e. a cluster state) over a large number of qubits, and carries out the computation via a sequence of single-qubit measurements and feed-forward operations. In contrast, in FBQC one begins with a set of small resource states, and the computation proceeds by applying certain one-and two-qubit measurements onto this set of resource states. In FBQC quantum information is stored and processed by repeatedly teleporting it between different resource states. The resource states required in FBQC are small, fixed entangled states involving a few qubits (constant in the size of the computation) which are "fused" together using one-and twoqubit measurements. Any (fault-tolerant) computation can be expressed as a sequence of such fusions (i.e. measurements) between a number of resource states arranged in (2+1)D spacetime. This layout of resource states in spacetime and the measurements between them is known as a fusion network.
For simplicity, we focus on the the 6-ring fusion network introduced in Ref. [20], where the resource states are 6-qubit cluster states on a ring, and the necessary fusion measurements consist of bell basis measurements, i.e., the measurement of both two-qubit Pauli operators X ⊗ X and Z ⊗ Z, along with single-qubit Pauli measurements and (X + Y )/ √ 2 magic state basis measurements. The 6-ring fusion network can be understood as implementing surface-code type topological quantum error correction.
Resource state generators and interleaving The primitive building block for a fusion-based quantum computer is a resource-state generator (RSG). RSGs are arranged in arrays, each producing resource states at a fixed clock rate f RSG , the qubits of which can be routed to different measurement devices to undergo single-and two-qubit measurements in a manner determined by the fusion network. For fusion networks based on the surface code -such as the 6-ring fusion network -a 2D array of RSGs is sufficient for fault-tolerant universal quantum computation. With access to a quantum memory, such as optical fiber, a single RSG can produce many simultaneously-existing resource states, which can allow for large quantum computations to be performed using much fewer RSGs than one might typically expect. Namely, with optical fiber as a quantum memory, one can utilize interleaving to perform linear space-time tradeoffs in the fault-tolerant quantum computation [21]. Thus in effect, interleaving reduces the number of required RSGs by a factor L intl , while increasing the time required to execute a fusion network by the same factor L intl . We refer to the case with L intl = 1 as trivial interleaving.
Note that optical fiber offers exceptionally low transmission loss rates of 0.2 dB/km (4.5%/km) or less (at 1550 nm wavelengths) [68]. It has been shown that for the 6-ring fusion network, using fibers of length 1000 m or more is readily possible without introducing too much error or significantly degrading the threshold (indeed, only some of the photons in an interleaved 6-ring fusion network must travel through this long fiber) [21]. For RSGs with gigahertz clock rates, roughly 5000 photonic qubits can simultaneously be stored in a 1 km fiber [21]. Thus we presume interleaving ratios L intl of up to 5000 to be achievable in practice.

Appendix B: Summary of block encoding techniques and primitives for double-factorized Hamiltonians
Below we summarize the strategy for qubitizing the double-factorized form of the Hamiltonian. Quantum phase estimation (QPE) requires inputting a unitary to then extract eigenvalues. For a Hamiltonian of interest H, the input unitary is some function of the Hamiltonian; for example, e −iHt . However, an attractive alternative that has led to the most efficient resource estimates for quantum chemistry simulations is "qubitization", where one can implement alternative functions of the Hamiltonian; up to errors due to rotation gate synthesis, some unitaries can be implemented exactly [41]. Performing qubitization relies heavily on "block-encoding" B, a process by which one encodes a Hamiltonian in a "block" of a unitary acting on a larger Hilbert space: This allows us to embed a non-unitary matrix (such as a Hamiltonian) into a larger unitary circuit, which can now be input into QPE. Block-encoding primitives. Block-encoding requires an input model to access the terms and coefficients of the Hamiltonian. A common input model is the "linear combination of unitaries" (LCU) model, where one can express the Hamiltonian as a sum of easier-toimplement unitaries (typically Pauli matrices) [69]. In its simplest form for un-factorized Hamiltonians, this only requires two subroutines: one to load the Hamiltonian coefficients onto the ancillary register (often referred to as prepare [11,12,14]), and another to selectively apply the corresponding Hamiltonian term (referred to as select [11,12,14]) onto the system register.
Block-encoding + LCU strategy. The idea is that the ancilla register can be prepared in a superposition state over all possible indices for the terms in the Hamiltonian, weighted by the square root of the coefficient of each term (normalized by the norm α), and then this same index ancilla register can now be used to apply the corresponding Hamiltonian term onto the system register, weighted by the appropriate amplitude. Finally, we uncompute the state preparation routine on the ancilla register; now, in the subspace where the ancillae are in the all-zero state, the Hamiltonian has been applied onto the system qubits.
Data-loading oracle. Another important and ubiquitous subroutine is the data-lookup oracle, often referred to as QROM [11,12]. Given a list of K b-bit elements a = [a 0 , ..., a K−1 ], QROM performs the following task: One can use a number of ancillae to trade between space and time, resulting in space-time optimized costs for this unitary [12,70]. The asymptotic T-count of the QROM in Ref. [12,70] is: where λ is a tunable power-of-two number of copies for the b-bit approximation of the elements in a. We highlight this subroutine for 3 reasons: (i) it is used throughout the double-factorized qubitization circuit both as a standalone routine and as a subroutine for prepare; (ii) the plurality of the space-time cost of qubitizing the double-factorization comes from one large data-lookup (as summarized in Table VIII); and (iii) because the choice of how to tune λ greatly informs the overall cost of the algorithm either in terms of T-count or T-depth. Double-factorization. The block-encoding must be modified in order to account for the double-factorization of the Hamiltonian. The high-level alterations follow from the following observation: by adopting the Majorana representation of the fermion operators, one can reexpress the Hamiltonian as a sum of products of one-body terms in the following way (see Ref. [13], and Ref. [14], Appendix C): One L r ||L r || SC , where T 2 (x) = 2x 2 − 1 is a Chebyshev polynomial, ||L r || SC is the one-norm of the eigenvalues of each matrix L r , the coefficients Λ r result from normalizing each L r , and γ Rm is a basis-rotated Majorana operator. The one-body term is expressed as a sum of products of basisrotated Majorana operators, and the two-body term is expressed as the sum of polynomials of one-body terms. Also, H DF contains an identity offset that can effectively be ignored as it contributes a constant shift in energy.
High-level strategy. The ultimate objective of qubitizing the double-factorized Hamiltonian, as with naïve qubitization, is to selectively apply indexed terms of the Hamiltonian, weighted by the appropriate coefficients in the Hamiltonian sum. However (as noted in Eq. (B4)), the terms we would like to selectively apply in the double-factorized Hamiltonian are non-trivial basis-rotated Majorana operators. When we perform the Hamiltonian factorization, we are performing an eigendecomposition which leaves us in the eigenbasis of each L (r) matrix, hence "basis-rotated" operators. The amendments made to the naïve qubitization circuit are due to the machinery needed to load basis-rotating angles onto an ancilla register, and then selectively apply a series of rotations onto the system register conditioned on this ancillary register. These rotations are necessary in order to rotate the system qubits out of the eigenbasis of each L matrix, to then apply a standard Majorana operator (a Pauli), and then rotate back into the original basis. Loading and then applying these angles requires two ancilla registers, one of size log R and another of size log M , to serve as index registers over the ranges r ∈ R and m ∈ M respectively (and a host of additional workhorse ancillary registers). Following Figure 16 in Ref. [14] (but using the notation of Ref. [13]), one can interpret the initial series of state preparation routines and data-loaders as one large prepare subroutine that loads the appropriate rotation angles, and then the sequence of rotations (and its dagger) and the applied Pauli as one large select.
Dominant subroutines. We note that the plurality of the T-count comes from applying the series of rotations and a data-loader that loads the angles for these rotations, and a vast majority of the total qubits are used in these two steps (summarized for each molecule and basis set in Table VIII). Here we highlight that the brunt of the asymptotic expressions in the overall algorithm's resource estimate comes from the gate and qubit count of the basis-rotation QROM. Parallelizing the T-count of QROM in Eq. (B3), the list of elements we must load is a list of M + N angles, each approximated by N · β bits of precision, leading to an asymptotic T-count: While each QROM in the circuit has its own tunable parameter λ, the λ for the basis-rotation QROM contributes the largest number of ancillas of all of the dataloaders, and can be taken to be the same proxy λ shown in the algorithm's asymptotic scaling expression in Eq. (??).

Appendix C: Details on algorithmic parallelization techniques
While we typically take the T-count n T to be a measure of the runtime of the algorithm, following the numerics in Fig. 1 showing the relatively small size of the magic state factory, one could instead take the T-depth D T as this measure. The T-depth of an algorithm is a proxy for how parallelizable the algorithm is; given the ability to perform multiple T-gates in parallel (see Section D) and having ∼ (n T /D T ) magic state factories available, one can perform up to n T gates in D T time slices.
In these circumstances where multiple magic states may be injected and consumed in parallel, it may be preferable to take V D as the measure of computational volume rather than V n , which assumes the serial application of T gates. For the qubitization algorithm described in this paper, there are a handful of techniques and augmentations to subroutines one can exploit to realize a V D that is significantly smaller than V n . These techniques range from known, trivial techniques, to novel and/or non-trivial optimizations. First we comment on the two largest-impact non-trivial augmentations, and in the following subsection we comment on the the other techniques considered. Together, the two subroutines discussed below constitute the majority of the overall computational volume of the algorithm (see Table VIII for percent contributions per molecule and basis set).

QROM depth vs. QROM count
The data-lookup subroutine is used throughout the qubitized doublefactorization circuit; however, a particular QROM used for loading angles for basis-changing rotations (the next subroutine considered) constitutes a large portion of the overall T-cost of the algorithm. For this particular QROM, we must load a list of M + N angles, each approximated by N · β bits of precision, leading to the asymptotic T-count given in Eq. (B5). While each QROM in the circuit has its own tunable parameter λ, the λ for the basis-rotation QROM contributes the largest number of ancillas of all of the data-loaders, and can be taken to be the same proxy λ shown in Eq. (??).
The T-count of data-loaders has a linear dependence on both the number of items being loaded, and the bits of precision used to approximate each element (see Eqs. (B3) and (B5)). On the other hand, the depth of a data-loader has linear dependence on the number of items being loaded, but no dependence on the bits of precision [70]. For the data-loader that loads the basischanging rotations, we can compare the asymptotic Tcount given in Eq. (B5) with the asymptotic T-depth given below: This drastic difference results from one of QROM's primitives, the so-called SwapUp network [70,71]. We can use this improvement throughout the circuit in all of the standalone QROMs and the ones inside of each prepare. However, this lack of dependence on the bits of precision is particularly significant in the context of the large data-loaders mentioned above--which make up a plurality of the computational volume (see Table VIII)--because the bits of precision for these dataloaders is N β. Removing this dependence can drastically reduce the overall T-depth. Log-depth basis rotations Like the previous savings contribution, this savings results from exploiting a drastic difference between the T-count and T-depth of one of the largest percentage share subroutines in the qubitization circuit. Here we introduce a novel construction of the basis-changing rotations (detailed in Appendix VII.C.1 and VII.C.2 in [13]) that achieves logarithmic depth in the number of target qubits, as opposed to linear depth, without the need for extra ancillae, and without altering the T-count. Since this subroutine contributes the next largest amount of computational volume to the overall circuit after the QROM mentioned above (see Table VIII), and reduces the T-cost of this circuit from O(N ) to O(log N ), this aids in drastically reducing the overall computational volume.
The circuit for implementing block-encodings of products of basis-rotated Majorana fermions is described in Lemma 8 and Eqs. (57,58) of [13]. The dominant cost of the circuit is a basis rotation that converts a "rotated" Majorana fermion γ u = j∈[N ] u j γ j to the "unrotated" Majorana γ 0 , which can be readily implemented by a single Pauli under the Jordan-Wigner encoding. In [13], Lemma 8, it is shown that the fermionic basis rotation can be constructed from a ladder of operators of the form V p = exp{θ p γ p γ p+1 }, where each angle in the set {θ p } is chosen to zero a single term in the basis-rotated Majorana γ u . Under the Jordan-Wigner encoding, V p becomes a Givens rotation. A circuit diagram is reproduced in Fig.  4 from [13], Eq. (58).
It is somewhat trivial to note that the basis change circuit requires N − 1 Givens rotations to zero N − 1 parameters in γ u , and so we cannot reduce the rotation count any further. However, the overlapping Givens rotations in Fig 4 do not commute and so the rotation depth in this instance scales identically to the rotation count. This need not be the case. Rather than V p , we can construct a circuit from the generalized fermionic operator V p,q = exp{θγ p γ q }. Under the Jordan-Wigner encoding, these operators become: Given the inserted Zs, we'll refer to this operator as a

FIG. 4. (left)
A circuit diagram of the basis rotation circuit from [13] for N = 8. Each Givens rotation is a pair of commuting PPRs with opposite angle, and therefore has rotation count 2 and rotation depth 1. (right) A circuit diagram of the Gizens basis rotation circuit for N = 8. Each Gizens rotation is also a pair of commuting PPRs with opposite angle, and therefore has rotation count 2 and rotation depth 1.
"Gizens" rotation. The approach is similar to that used for state preparation by Grover and Rudolph in [72]. Given the ability to compute partial sums of the amplitudes u p , we apply a binary tree of Gizens rotations with angles chosen such that the correct partial sum is obtained at either side of the bifurcation. An example circuit is shown in Fig  4. The proof that this construction evaluates the correct basis change circuit is the same as in [72]; we include it in Appendix E.

Further parallelization techniques
T-depth vs. T-count of a Toffoli A trivial speedup that can be implemented is in how we account for the T cost of Toffoli gates. In the resource estimates provided in Refs. [13,14], the cost of the subroutines is given in terms of the number of Toffoli gates, whereas we provide the cost in terms of T-gates. One can implement a Toffoli using 4 T-gates [73], and so all of our gate count expressions are a factor of 4 larger than those provided in Refs. [13,14]. However, this factor of 4 can be removed by using the T-depth of a Toffoli instead of its count, which is known to be only depth-1 [73,74]. The 4Tgate, depth-1 Jones Toffoli [73] only uses a single clean ancilla (which can be borrowed from other clean ancillae in the algorithm, and re-used to decompose each Toffoli in the computation). Each Gidney compute-uncompute pair used in the selects within QROMs first introduced in Ref. [11] can similarly be parallelized to depth-1 using a modification of the Jones construction in Ref. [73].
Controlled swaps in depth-1 For the doublefactorized qubitization circuit depicted in Figure 16 in Appendix C of Ref. [14], we must perform a series of controlled swaps a total of 4 times, once before and once after each round of the basis-changing rotations detailed in Appendix C 1. Each sequence of controlled swaps requires the application of N Toffoli gates, where N is the number of orbitals; however, a sequence of swaps targeting disjoint sets of qubits but with control on the same control qubit can be implemented in O(1) depth [70,75], taking the T cost from a Toffoli count of 4N to a constant T-depth independent of N . This low depth of sequences of controlled swaps can also be exploited within each prepare subroutine used in the circuit, as the "alias sampling" version of prepare ends with a sequence of controlled swaps [11]. Achieving this depth requires no additional ancillae.
Log-depth comparators Each "alias sampling" version of prepare also uses a comparator to perform an inequality test between two registers of qubits. Comparators are made up of adders; the trade-off between count and depth of adders is detailed in Ref. [76]. The T-count of most adders have linear dependence in the size of the addend registers, and depth-efficient adders tend to have worse constants than count-efficient adders. However, in the regime where performing T-gates in parallel is cheap, the time-savings from log-depth adders over linear-depth adders dwarfs the constant disadvantage in count. Our battery chemistry application falls squarely in this regime, so we opt for low-depth adders in our comparator constructions. The required number of ancillae used for either the linear or logarithmic depth comparator is linear in the register size [76,77].

Optimizing for VD vs. Vn
When considering T-depth D T in addition to T-count n T and qubit count n L , one must decide which computational volume, V n or V D to optimize. In order to get a sense of the magnitude of computational volume improvements possible, it is worth considering the resource costs of three scenarios: optimizing for V n and applying T gates serially; optimizing for V n but applying multiple T gates in parallel; and optimizing for V D and applying T gates in parallel. We provide tables with the resource estimates for each molecule and basis set (Tables IV, V, and VI), but here for each scenario we quote the smallest and largest classically intractable instances considered: full configuration interaction (FCI) picture of PF − 6 in the cc-pVDZ basis as the smallest instance, and FCI picture of LFEC in the cc-pVTZ basis as the largest.
Serial execution of T gates As an initial starting point (and to make a fair comparison with the estimates of prior studies [13,14]), it is informative to first note the algorithmic resource estimates in the case where we apply T gates serially; that is, optimizing for V n and also taking V n to be the computational volume. The T-counts and qubit counts for all molecules and basis sets are provided in Table IV. For the smallest (largest) instance we have: n L = 16,382(100,139) qubits and n T = 2.32×10 12 (1.06× 10 14 ) T gates. Thus, the computational volume for serial application of T gates ranges from O(10 16 ) to O (10 19 ).
Swapping count for depth It is also informative to see what kind of volume savings are possible by merely swapping out n T for D T in the volume estimate; that is, minimizing V n but taking the resulting V D to be the computational volume. Performing this naïve substitution (without consideration of depth-optimization) exploiting the difference between depth and count for the subroutines highlighted above, we can achieve an overall computational volume reduction of ∼ 8x for the smallest instance, ∼ 13x for the largest instance, and ∼ 10x for most instances. The detailed savings for each molecule and basis set are reproduced in Table V. Thus, the computational volume for parallel application of T gates while minimizing T-count n T ranges from O(10 15 ) to O(10 18 ).
Depth optimization Given that we are in a regime where we can opt to use V D as our computational volume, it makes sense to see how far we can push the potential savings by optimizing for V D instead of V n . Optimizing for V D (alternatively, performing the minimization prescribed in Eq. (7), we ultimately find that the potential volume reduction is slightly greater than that offered by simply swapping out count for depth, resulting in savings between ∼ 9x and ∼ 15x . This is shown in detail in Table VI. Note that no qubit counts are provided as (remarkably) the V D -optimal volume uses the same number of qubits as the V n -optimal volume.
The optimization of V D consists of choosing an optimal set of parameters λ for each QROM with respect to this metric. As shown in Table VIII, a plurality of the computational cost of the algorithm comes from performing a large QROM; however, it is also noted that QROMs are used throughout the qubitization procedure, both as standalone subroutines and also as part of each prepare subroutine encountered in the circuit. Each of these data-loaders has its own λ that can be optimized either independently of or dependently on each other.

Various λ optimization strategies
The choice of λ greatly influences the overall resource estimates. Tuning the parameter λ for any specific QROM results in a trade-off between T-count and qubit count [70]; if it is large, the T-count is reduced at the expense of increasing the number of qubits, and vice-versa. However, another trade-off noted in Ref. [70] is trading depth for count: one can continue to increase λ past the point of minimal T-count, and decrease the T-depth at the expense of the number of qubits and the T-count. The choice of λ for each data-loader in the algorithm depends on whether one is trying to minimize the computational volume given as V n = n T × n L or given as Minimizing V n A natural choice in order to make a fair comparison with the methods of Refs. [11][12][13][14] (thus, assuming the serial application of T gates) is to choose λ to minimize n T × n L . The minimum is achieved by λ = O( M/N β), which leads to the following asymptotic expressions: The exact numbers for the chemical systems introduced in Sections II A and II B are shown in Table IV.
For clarity, we note two points about these resource estimates. First, the resource estimates for the molecules and basis sets that we consider are generally larger than those provided in Refs. [13,14] (especially for the larger basis sets cc-pVDZ and cc-pVTZ). This is to be expected, as the factorization for the Hamiltonians of these molecules yield larger values for the ranks R and M , and the norm α than those for molecules the size of FeMoco; additionally, those references consider the Toffoli count whereas here we consider the T-count (which is a factor of ∼ 4x larger). Second, we note that the values in Table IV assume the serial application of magic states (and so, these estimates do not make use of the algorithmic and architectural improvements we outline in Sections III B and III E). In order to achieve the optimal spacetime costs for these computations, it is not obvious that one should opt to minimize n T × n L . As shown in Fig. 1, for the T-counts and qubit counts quoted in Table IV, the estimated size of the magic state factories necessary to distill T-gates is remarkably small. In such a regime, it may be preferable to minimize D T × n L and parallelize the application of magic states.
Minimizing V D The optimization of V D consists of choosing an optimal set of parameters λ with respect to this metric. Recall that a plurality of the computational cost of the algorithm comes from performing a large QROM; however, it is also noted that QROMs are used throughout the qubitization procedure, both as standalone subroutines and also as part of each prepare subroutine encountered in the circuit. Each of these dataloaders has its own λ that can be optimized independently of each other. Previously, this independent optimization was done in order to minimize V n for each of these subroutines; a natural consideration is to instead choose each λ to minimize V D for each QROM.
However, this naïve consideration does not lead to the most optimal computational volume. Instead, to achieve the smallest volumes shown in Table VI, we found that it was most beneficial to optimize each λ contingent on all other λ values. The strategy is to determine which QROM, with λ chosen to minimize Eq. (B3), contributes the largest number of ancillary qubits: We then assign all other λs to each QROM by dividing Q max by the bits of precision for each QROM. By finding the maximum number of ancillae used in a data-loader Q max , this ensures that we have enough qubits allocated to borrow from in order to later set λs for the rest of the QROMs that may indeed be larger than their λ min-count value. It is noted in Ref. [70] that one can increase the value of λ in a data-loader greater than its λ min-count that minimizes Eq. (B3) in order to minimize depth at the expense of count. This is exactly what we do in order to minimize V D ; note that the counts reproduced in Table VI exceed those reproduced in Table V (hence the enormous ratios in the penultimate column of Table VI).
Flexibility The conclusion of the two preceding subsections is not that one should necessarily optimize V D instead of V n . This decision depends not only on the absolute smallest computational volume, but also on a number of other factors. Depending on one's architecture, one may employ alternative optimization strategies.
To give two brief examples, one could instead determine λ min-depth values that minimize the V D of each QROM. Overall, this leads to similar decreases in computational volume, but a larger T-depth and approximately an order of magnitude reduction in qubit count compared to the values reproduced in Table VI. As another example, opting for the strategy for minimizing V n may be preferable to that for optimizing V D if one is concerned about the overall number of magic state factories on your device. One would need on average (T-count/T-depth) many magic state factories in order to fully exploit the speedups presented in these sections; this ratio is markedly large, between 20 and 50, for the strategy that minimizes V D (shown in detail for each molecule in the penultimate column of Table VI).
When assessing potential volume reductions by considering T-depth in addition to T-count, there is a good deal of flexibility depending on the relative importance one gives to qubit count, number of magic state factories, etc.. All of these potential savings are close in magnitude, so it is not obvious which optimization strategy one should opt for. The underlying architecture of a given approach may best inform which strategy to use. As yet, it is unclear how the transversal gates used for constant-time PPMs will affect the threshold and subthreshold scaling of the codes used, and the potential runtime speedups will greatly depend on the timescales one expects of physical operations on a specific platform; we leave these studies for future work. The resource estimate that produced the numbers in Table IV was based on the sub-threshold scaling behavior of the FBQC scheme in Ref. [20], making use of one of the quantum computing architectures in Ref. [25]. We propose an alternative architecture in this section, in the hope that it may reduce the overall space-time cost of the algorithm. As we shall later see, we have reasons to believe that the newly proposed architecture can reduce the computation time without incurring a significant amount of extra footprint. However, let us emphasize that, unlike the results above, we have not carried out a Monte Carlo study on the threshold and sub-threshold behavior of this new scheme. A fair comparison between the two can be made only via a rigorous numerical study, which we leave for future work.
We previously observed that if we use a state-of-theart quantum algorithm to simulate the molecules relevant to battery research, the magic state factory constitutes only up to 2% of the entire quantum computer. Therefore, if magic state distillation is the main bottleneck of the quantum computation, one may be able to reduce the computation time by increasing the number of magic state factories, and injecting the distilled magic states appropriately.
Optimal injection of multiple distilled magic states is a nontrivial problem, since what is best can depend on a number of different factors, including the specific magic state distillation protocol used, routing of qubits, and the way in which Clifford gates are implemented. One possibility is to simply distill the |T = (|0 + e iπ/4 |1 )/ √ 2 state and inject it. Alternatively, one can distill a |CCZ = CCZ |+ + + state [24,73,78], which can be readily and efficiently used in the data-loader described in the discussion on parallelization above [13,14].
In this section, we propose a different scheme which is based primarily on Litinski's scheme [25,26]. Litinski's scheme uses Pauli Product Measurement (PPM) and preparation of states in the set {|0 , |+ , |T }. PPM refers to a non-destructive measurement of the following observable: where P n ∈ {I, X, Y, Z} and P n is the logical Pauli operator of the n'th qubit, which ranges from 1 to N .  [62], one can speed up the computation. However, in this scheme, a k-fold increase in speed necessitates a k-fold increase in footprint.
The main result of this section is an O(1)-time implementation of an arbitrary PPM which does not incur such additional footprint. Moreover, all the physical operations can be made local in two spatial dimensions. Therefore, optimistically speaking, in the regime in which the number of |T -state factories is abundant so that more than one |T state can be distilled in a single logical clock cycle, one can expect to be able to inject them all in a single logical clock cycle, provided that there are no more than O(d) of them. The rest of this section is structured as follows. We: (a) discuss an abstract circuit model that implements a PPM; (b) explain how this protocol can be implemented on the surface code [54]; (c) explain how this method can be used to compile a general quantum algorithm; and (d) discuss the expected speedup using this approach. We remark that while the fault-tolerant protocols in this section are described in terms of surface codes (for simplicity and familiarity), they may readily be expressed in FBQC in terms of the 6-ring network of Section II F.

PPM Circuit
To understand our protocol, it is helpful to first consider an abstract circuit model which captures its spirit. Suppose we have N qubits and we would like to implement a PPM. Without loss of generality, we can write down a Pauli as follows: where s ∈ {0, 1, 2, 3} and m X j , m Z j ∈ {0, 1}. Hypothetically, if we were to measure individual P j s, we could use an ancillary qubit initialized in the |+ state and apply the following entangling gates: where the control is the ancillary |+ state and the target is the qubit being measured. Here the S † gate acts on the control qubit. The measurement is performed in the X basis. One can verify that Z is applied to the ancillary state if and only if the qubit being measured is in the +1 eigenstate of the Pauli chosen.
Of course, what we really want is to measure P without measuring the individual P j s explicitly. For that purpose, one can use the |CAT state, defined as follows: Now, one can pair up the j'th qubit that forms the |CAT state to the j'th data qubit and apply the gates in Eq. (D3). If we measure the tensor product of X on the Paulis, the parity of those measurement outcomes is precisely the non-destructive measurement outcome of P ; see Fig. 5 for an example.

Fast PPM for surface code
Now, let us take a step back from this abstract circuit picture and think about how one can implement all the procedures we have explained so far fault-tolerantly, in particular, using patches of surface-codes [54]. Recall that the surface code is defined on a lattice with open boundary conditions, with two "smooth" and two "rough" boundaries. The qubits are arranged on the edges of the lattice. The surface code is a stabilizer code with a stabilizer group generated by a set of "star" operators, which are tensor products of Xs along the edges that are incident on a vertex, and a set of "plaquette" operators, which are tensor products of Zs along the edges that surround the plaquette; see Fig. 6. Since our goal is to implement the circuit in Fig. 5 in constant time, we must understand whether the gates that appear in this circuit diagram can be implemented in constant time on the surface code. We shall see towards the end of this section that preparation of an analogue of the |CAT N state (as opposed to the exact |CAT N state over the logical qubits) and measurement in the X-basis can be done in constant time. So let us focus instead on whether we can implement the unitary gates in Eq. (D3).
Because the surface code is a CSS code, it allows for a straightforward implementation of a transversal CNOT gate. Moreover, the CZ gate can be viewed as a CNOT gate conjugated by Hadamards. By folding a surface code in half along its diagonal line, it is possible to implement the logical Hadamard by applying a transversal Hadamard gate on all the qubits followed by a local relabeling of the qubits [79] (which builds upon the local unitary equivalence between the the color code and surface code [80]). We remark that this folding operation is accessible in an interleaved architecture [21]. Therefore, the CZ gate can also be implemented in constant time. Implementation of S can also be done in constant time, either by injecting a +1 eigenstate of the Pauli-Y operator, or by applying a transversal gate in the folded surface code picture. The transversal gate option, however, requires an extra entangling gate which may adversely affect the threshold and the sub-threshold scaling. On the other hand, the injection requires an additional logical qubit for every logical data qubit, incurring up to a 50% increase in the footprint.
Fortunately, it is possible to have the best of both worlds. Below, we explain a method to modify Eq. (D3) so that one does not need to apply the S † gate explicitly. The price we have to pay is modest; we only need two additional logical qubits, one that keeps the +1 eigenstate of Y and another one dedicated to increasing the size of the |CAT state to N + 1.
The key idea is to not apply the S † explicitly, but instead update the observable that needs to be measured. In this case, the Pauli string simply becomes a tensor product of Xs and Y s. In the surface code, measurement in the X basis is easy, but measurement in the Y -basis is less straightforward; the former can be done in constant time, but to the best of our knowledge, measurement in the Y basis requires an implementation of a singlequbit Clifford gate such as H or S. However, this Pauli string (say Q), up to the stabilizer of the |CAT state, is "almost" equal to a Pauli string consisting purely of Xs. An important observation is that the |CAT N +1 state obeys the following stabilizer constraint, both before and after the gates in Eq. (D3) are applied: for all n, m ∈ {1, . . . , N + 1}. Therefore, if Q has an even number of Y s, where n Y is the number of Y s in Q. If n Y = 0 mod 2, up to a phase, the joint parity of Q can be measured by measuring all the qubits in the X-basis, an operation that can be performed in constant time. If n Y is odd, then there will be one leftover Y . In that case, one can apply the gate corresponding to the P j = Y case for the j = N + 1'th data qubit, which will be initialized to the +1 eigenstate of Y . This makes the last Pauli in Q equal to Y , making n Y even. Moreover, the +1 eigenstate of Y is returned to its original state after this process, which can be reused later; see Fig. 7 for an example. Therefore, we have seen that one can perform a PPM using a single |CAT state and a +1 eigenstate of Y , as well as CNOT, CZ, and measurement in the X-basis. Once the eigenstate of Y is prepared at the beginning of the computation, if the |CAT state can be prepared in constant time, every operation discussed so far can be done in constant time. Next, we explain how the state preparation can be carried out, thus completing the description of our protocol.
Naïvely, preparation of the |CAT state seems to require O(d) time. A straightforward way to do this is to initialize all the logical qubits in the |+ state on a one-dimensional array and measure the joint parity of Z jZj+1 . While the first step requires constant time, the second step requires O(d) time if we insist on interacting the surface code patches only via their boundary, using lattice surgery [44].
However, for our purposes, it is possible to get away with preparing the |+ state over a long surface code patch; see Fig. 8. Of course, this long surface code patch is not exactly the |CAT state. However, they are equivalent up to a Pauli-X measurement along the qubits that connect the neighboring patches. Upon measuring these qubits, one must eventually apply a correction. The ancilla block is initially prepared on a "long" surface code patch. By measuring the red qubits and applying a Pauli correction, one can prepare the |CAT state over the ancilla block.
These partial X measurements by themselves do not seem to make the preparation of the |CAT state faulttolerant. However, used in conjunction with the final X-basis measurements on the remaining qubits in the ancilla block, we can measure the Pauli Product operator fault-tolerantly. This can be seen by inspecting the stabilizer generators with support strictly on the ancilla block after applying the gates described above in D 3. One can verify that these include all the star operators of the long ancilla block. Moreover, while the errors will undoubtedly spread from the data block, they will spread in a local manner thanks to the transversal nature of the gates used. Therefore, the standard argument in surface code-based error correction [55] implies that one can simply measure the entire ancilla block in the X-basis and decode, thus fault-tolerantly extracting the measurement outcome of the logical X operator of the long ancilla block. This measurement outcome is precisely the PPM. Since the preparation of the |CAT state on the long ancilla block can be performed in O(1) time, by the discussion in D 3, we conclude that it is possible to perform a O(1)-time fault-tolerant PPM.
To summarize, we have shown that an arbitrary faulttolerant PPM can be performed in constant time. Moreover, all of these operations can be implemented on a planar architecture, involving two folded layers of surface code patches that are superimposed together; see Fig. 9.
Injection of distilled magic states can be implemented by simply increasing the number of logical qubits in the data and the ancilla blocks. Specifically, suppose we can create m |T states in a single logical clock cycle. It will suffice to add m extra logical qubits in the data and  [79]. After folding all the surface code patches, we obtain two layers of folded surface code patches.
the ancilla block and store the magic states in the newly allotted qubits in the data block. Since the number of magic state factories is typically much smaller than the number of logical qubits, this additive overhead is expected to be small. Moreover, the magic state factories can be spaced out sufficiently far apart from each other to avoid any routing issues. In the regime studied in this paper, once the number of magic state factories exceeds ∼ 10, the gain from having an extra magic state factory is expected to be insignificant. Since the smallest number of logical qubits needed exceeds a few thousand, we can take the average spacing between the magic state factories to be at least a few hundred, enough room to have independent magic state factories that do not interfere with each other. Of course, one may opt to use more sophisticated distillation protocols that generate multiple magic states at the same time [47][48][49]51]. In those cases, the extra logical qubits will need to be physically close to each other.
One may worry that assigning a fixed location for the magic states will cause a routing problem; as we explain below, this is not the case. Thanks to the fact that we can apply arbitrary PPMs, the location of the magic states does not matter.

Compilation and Execution
We now explain how the newly introduced PPM can be used in practice to compile a quantum algorithm. What we describe below mostly follows from Litinski's observations [25]. Suppose we are given a description of a quantum circuit in terms of Clifford and T-gates. One can, without loss of generality, rewrite this circuit as a sequence of gates of the following form: where P is a Pauli operator, followed by Cliffords and single-qubit measurements. Instead of applying the Cliffords directly, one can simply replace the single-qubit measurements by PPMs. Moreover, Litinski showed that a gate in Eq. (D7) can be implemented by consuming a single |T state and a PPM, followed by a Clifford correction. Therefore, an arbitrary quantum computation can be expressed as a sequence of PPMs on |T states and a data block, followed by a Clifford correction. In fact, it is possible to remove the need to apply the Clifford correction entirely. The idea is to dynamically update the gates used in the circuit. Without loss of generality, suppose we have converted the circuit in the form described above, namely, as a sequence of PPMs followed by Clifford corrections of the following form: for some k ∈ Z, where P is the Pauli being measured. One can view this as a low-level instruction to the faulttolerant quantum computer. Given such an instruction, we can execute the circuit on a quantum computer in the following way. Define the Clifford frame, which is a lookup table of size 2N × (2N + 1), where N is the number of qubits in the circuit. The Clifford frame is similar to the tableau first introduced by Aaronson and Gottesman [81] and is a compact representation of a Clifford gate. Since the Clifford group generates an automorphism on the Paulis, it suffices to specify their action on the generators, which we choose to beX n andZ n for n = 1, . . . , N , each representing the logical X-and Z-operators of each surface code patch. It is convenient to view the Clifford frame as a table that tells us how to convert each PPM and Clifford correction based on the Clifford corrections that precede them. Specifically, let X n and Z n be the n'th Pauli operator appearing in the instruction. The Clifford frame converts these Paulis in the following way, where m X j , m Z j ∈ {0, 1} and s ∈ {0, 1, 2, 3} and similarly for Z n . The Clifford frame table contains this data; see Table IX for an example.  When executing a quantum circuit, the instruction will remain as is, but the Clifford frame will change dynamically. In the beginning, the Clifford frame will be initialized such that In this very first step, we will have a PPM, which can be implemented using either the procedure described above or the one in Ref. [25]. Then, suppose the result of the PPM indicates that a Clifford update must be carried out. Instead of applying this Clifford, say C, we can update all the remaining PPMs and Clifford corrections by updating the Pauli that underlies them as P → C † P C. (D11) However, the instruction set for practical quantum algorithms will be lower bounded by the number of T-gates used, which is often on the order of at least n T = 10 10 even for state-of-the-art algorithms [13,14]. Updating every gate appearing afterwards will lead to a number of updates of order O(n 2 T ) ≈ 10 20 , which is a significant amount of computation even for a modern computer. Therefore, this is an unwieldy approach. Instead, it is better to update the table such as in Table IX. This amounts to updating X n → C † X n C and Z n → C † Z n C for every n.
Happily, such a calculation is extremely simple. For instance, given a X n , if it commutes with the Pauli operator P appearing in the correcting Clifford defined in Eq. (D8), no update is needed. Checking the commutation relation only requires O(1) number of bitwise operations and parity calculations for N -bit strings. If they do not commute, X n is updated as X n P up to a phase. Both the phase and the updated Pauli can also be computed using O(1) number of bitwise operations and parity calculations for N -bit strings. Thus, the amount of classical computation scales as O(N ) parity calculations and bitwise-operations over N -bit strings.
Let us emphasize that this calculation can be trivially parallelized to O(1) parity calculations and bit-wise operations distributed over 2N processors. Moreover, modern CPUs can perform such operations over thousands of bits on the order of nanoseconds. Therefore, we do not expect the update of the Clifford frame to be a significant bottleneck for the computation.

Speedups
At first, one may think that a sequence of PPMs can be implemented in O(1) time, independent of whether they commute with each other. While we cannot completely discount this possibility, we will demand the sequence of PPMs to satisfy an important constraint. If we were to implement k PPMs in a single logical clock cycle, we demand all the Paulis that underlie the PPMs to commute with each other.
If they do not, we encounter the following difficulty. Decoding a logical qubit requires syndrome measurement outcomes in a d × d × d block, where the last factor of d comes from time [55]. Upon collecting all of these measurement outcomes and decoding, one must apply a correction operation. If certain layers of the d × d × d block undergo the transversal gates described above, the measurement outcome of the ancilla block may, depending on the outcome of the decoding, have to be retroactively changed. Doing so would necessitate a change in the PPM outcome that occurred in the past, which would in turn signal that a Clifford correction in Eq. (D8) was applied incorrectly, resulting in an incorrectly-updated Clifford frame. The incorrect Clifford frame may have triggered an incorrect PPM, which may have already happened.
Fortunately, such a problem is moot if the Paulis that underlie the k PPMs commute with each other. In that case, the Clifford corrections also commute with the PPMs; see Eq. (D8). Therefore, the Clifford corrections do not change the PPMs within that block. One can simply perform all the PPMs and then update the Clifford frame before we move to the next d × d × d block.
With this constraint in mind, the speedup we can achieve using this approach is limited by the ratio of Tcount and T-depth. In our setup we can optimistically expect up to a 15-fold speedup with a negligible change in the footprint of the device.
Another constraint is the code distance. At best, one can apply O(d) PPMs in a single logical clock cycle using our method. The estimated code distance in Section III C ranged from 24 to 44, suggesting that the best-case scenario of a 15-fold speedup may be attainable. However, note that whether this is possible or not depends on many microscopic details, such as the timescale needed for different physical operations. Moreover, we would like to emphasize that the threshold and the sub-threshold behavior of the code may change when we implement the transversal gates. The exact extent to which we can speed up the computation using our approach can only be determined by carefully inspecting all of these different factors. These studies are beyond the scope of this paper and are left for future work.
Keeping these caveats in mind, we can estimate the optimistic computation time for a single phase estimation for the molecules described below to be on the order of 1 ∼ 4 hours for the cc-pVDZ basis set and 35 ∼ 90 hours for the cc-pVTZ basis set.