Fault-tolerant quantum simulation of materials using Bloch orbitals

The simulation of chemistry is among the most promising applications of quantum computing. However, most prior work exploring algorithms for block-encoding, time-evolving, and sampling in the eigenbasis of electronic structure Hamiltonians has either focused on modeling finite-sized systems, or has required a large number of plane wave basis functions. In this work, we extend methods for quantum simulation with Bloch orbitals constructed from symmetry-adapted atom-centered orbitals so that one can model periodic \textit{ab initio} Hamiltonians using only a modest number of basis functions. We focus on adapting existing algorithms based on combining qubitization with tensor factorizations of the Coulomb operator. Significant modifications of those algorithms are required to obtain an asymptotic speedup leveraging translational (or, more broadly, Abelian) symmetries. We implement block encodings using known tensor factorizations and a new Bloch orbital form of tensor hypercontraction. Finally, we estimate the resources required to deploy our algorithms to classically challenging model materials relevant to the chemistry of Lithium Nickel Oxide battery cathodes within the surface code.

Recently, first quantization quantum algorithms and constant factor resource estimation analysis for molecular systems [1,2] have been adapted to materials [3]. While the first quantization approach using a plane wave representation is attractive due to the smooth convergence to the continuum limit [4,5] a local basis representation such as atomcentered basis sets has other advantages. Similar to the molecular simulation setting, local basis functions can be advantageous when describing spatially localized phenomena such as heterogeneous catalysis or efficiently describing cusps [6].
The desire for systematically improvable electronic structure methods to treat the many examples of strongly correlated phenomena [7][8][9] in the condensed phase has recently driven the application of ab initio wavefunction theories to the periodic setting [10][11][12][13][14][15][16][17][18][19][20][21][22]. Standard treatments of symmetry in wavefunction theories [23,24] can be used to exploit the translational symmetry of periodic systems, thus enabling the application of post-Hartree-Fock methods to material systems. Despite these advantages, classical ab initio treatment of such problems is limited due to the large simulation cells needed to converge to the thermodynamic limit. This drawback has further driven the use of embedding theories [25][26][27][28][29][30] and downfolding [31]. Naturally, one may ask if fault-tolerant quantum computers can alleviate the computational burden associated with ab initio simulation of solids within the local basis framework.
In this paper, we describe how to extend molecular quantum simulation algorithms of second quantization Hamiltonians represented in local basis sets to periodic systems using the qubitization framework [32,33]. Though the general structure of the algorithms is largely unchanged, introducing symmetry-i.e. symmetry-adapting the block encodings-requires non-trivial modifications to realize an improvement in the asymptotic complexity. The first steps in this direction were taken in Ref. [34] using the "sparse" Hamiltonian representation. We provide an alternative derivation for block encodings using this representation and introduce symmetry-adapted block encodings for three other more performant tensor factorizations of the Hamiltonian: single factorization (SF), double factorization (DF) and tensor hypercontraction (THC). The result is orders of magnitude improvement in the quantum resources required to simulate materials.
For each of the four Hamiltonian representations we describe the origin of the asymptotic speedup (or lack thereof in one case), provide compiled algorithms for constant factor resource estimates, and compare the performance to non-symmetry-adapted block encodings. We note that the derived symmetry-adapted block encodings apply to any Abelian point group symmetry with minor modifications. For SF, sparse, and DF the symmetry-adapted block encodings provide an asymptotic speedup for walk operator construction proportional to the square root of the number of k-points used to sample the Brillouin zone. For THC, there is no asymptotic improvement due to the linear cost of unary iteration in the block encoding. Going beyond asymptotic analysis and compiling to total Toffolis, we find that for DF and THC using symmetry-adapted block encodings provides no asymptotic speedup over their nonsymmetry-adapted counterparts due to the increased number of applications of the walk operator for fixed precision phase estimation. DF and THC are sensitive to the numerical compression of the Hamiltonian, and thus we expect the number of walk operator applications can be decreased. Furthermore, there are classical advantages to using the symmetry-adapted block encodings coming from the reduced classical complexity of representing the Hamiltonian as the system size is increased towards the thermodynamic limit.
In parallel with recent studies estimating quantum resources required to simulate high-value molecular targets [32,[35][36][37], we estimate the quantum resources required to simulate an open materials science problem related to the cathode structure of Lithium Nickel Oxide (LNO) batteries. The LNO systems are universally observed in the high symmetry R3m structure which is at odds with the predicted Jahn-Teller activity of low-spin trivalent Ni [38]; more background can be found in Section V A. This discrepancy combined with the difficulty of synthesizing pure LNO, the size of the unit cells [39], and potential strong correlation at the high symmetry structure [40] makes the LNO problem an interesting application target for quantum simulation advantage. This realistic problem frames the algorithmic improvements articulated in this paper and the prospects of the quantum advantage given modern electronic structure methods. We find that the required resource estimates for simulating a set of benchmark systems and the LNO problem before reaching the thermodynamic limit are already substantial. In fact, the large simulation cells required to converge these calculations to the thermodynamic limit is ultimately a significant hurdle for ab initio simulations.
The layout of the rest of the paper is as follows: Section II describes the atom-centered basis sets and the Hamiltonian that we use, Section III describes the qubitization algorithm and the origin of the asymptotic speedup in constructing walk operators using each of the four Hamiltonian representations. Each subsection is dedicated to a particular Hamiltonian factorization and describes the qubitization algorithm and how to calculate associated parameters. Section IV compares all methods and extrapolates quantum resources required to simulate a diamond crystal converged towards the thermodynamic limit, and Section V reports the accuracy and correlation analysis of various electronic structure methods for LNO while providing estimates of quantum computing resources and runtimes. We close with prospects for this class of methods.

II. ELECTRONIC STRUCTURE HAMILTONIAN OF MATERIALS IN BLOCH ORBITALS
Though plane-wave basis sets are used in most periodic Density Functional Theory (DFT) calculations, there is a long history of local-basis methods as well. The use of a localized basis set has a number of advantages over plane waves: 1 ) 0D (molecular), 1D, 2D, and 3D systems can be treated on an equal computational footing, 2 ) Calculations on low-density systems with large unit cells can be more efficient [41][42][43] 3 ) Hartree-Fock exchange can be more efficiently computed in the smaller, local-orbital basis [42][43][44][45] and 4 ) The local-orbital representations can lower the computational cost of correlation corrections with a more compact representation of the virtual space. (1 ) - (3 ) have spurred the development of local-orbital DFT and Hartree-Fock methods with Gaussian orbitals [42][43][44]46] and numerical atomic orbitals [47], while (4 ) has been behind recent work to apply correlated electronic structure theory to periodic solids [10,11,[13][14][15][16][17]. In the following subsection we describe the symmetry-adapted periodic sum of Gaussian-type orbitals used in this work.

A. Basis functions and matrix elements
A local basis function,χ p , can be adapted to the translational symmetry of a lattice to form a periodized function where T represents a lattice translation vector and k is a crystal momentum vector lying in the first Brillouin zone. The lattice momentum k labels an irreducible representation of the group of translations defined by the translational symmetry of the material. Functions of this form are easily verified to be Bloch functions in that χ p,k (r) = e ik·r u p,k (r) (2) where u p,k (r) has the same periodicity as the lattice. Orbitals are constructed from a linear combination of the underlying Bloch orbitals, where N k is the total number of k points. The expansion coefficients, c p,i (k) are determined from the appropriate periodic self-consistent field procedure, usually Hartree-Fock or Kohn-Sham DFT. The resulting orbitals are normally constrained to be orthogonal by convention and can serve as a basis for representing the second-quantized Hamiltonian. The matrix elements of a one-electron operator, T pkp,qkq = dr φ * pkp (r)O 1 φ qkq (r) (4) are non-zero only when k p = k q as long as O 1 has the translational symmetry of the lattice. We can use a similar strategy to derive the structure of the two-electron integrals which are given by V pkp,qkq,rkr,sks = dr 1 dr 2 φ * pkp (r 1 )φ qkq (r 1 )O 2 φ * rkr (r 2 )φ sks (r 2 ).
The translational symmetry of the Bloch orbitals implies the 2-electron operator O 2 matrix elements can only be nonzero when (k p + k r − k q − k s ) = G where G is a reciprocal lattice vector. We note that this expression for nonzero matrix elements by symmetry is a specific instance of the more general expression. More generally, given a group g with its irreducible representations labeled by {Γ i }, the two-electron integral is nonzero by symmetry whenever Γ p ⊗ Γ q ⊗ Γ r ⊗ Γ s contains the complete symmetric representation [23]. For periodic systems g is the set of translational symmetries. Despite this sparsity, the evaluation of the nonzero matrix elements for all basis functions is often a major computational bottleneck whenever local basis sets are used. Local orbitals provide a more compact representation than plane waves, so fewer basis functions are needed. Unfortunately, there are O(N 3 k N 4 ) generally nonzero two-electron matrix elements for N k k-points and N basis functions in the primitive cell. For very large calculations locality can be exploited to yield asymptotically linear-scaling DFT methods [48,49]. Linear scaling Hartree-Fock is also possible for insulators [50]. This linear regime is almost never reached in practice, and it is usually advantageous to instead reduce the cost by tensor factorization.
Though our discussion has been thus far general with regard to the choice of local basis functions, Gaussian basis functions are by far the most popular choice in molecular calculations, and crystalline Gaussian orbitals are also a popular choice for periodic calculations. This popularity is due to the existence of analytic formulas which allow for fast, numerically exact evaluation of the matrix elements of most common operators. Despite the existence of efficient numerical techniques, the large number of two-electron integrals that must be evaluated in periodic calculations requires a more efficient procedure. Traditionally, this is accomplished with the Gaussian plane wave (GPW) method [41,51] which only requires storage of O(N 2 k N 2 n pw ) integrals where n pw is the number of plane waves used to evaluate the integrals. In molecular calculations, the most common decomposition is called the resolution of the identity (RI) or sometimes density fitting (DF) [52][53][54][55]. This procedure requires the storage of O(N 2 k N 2 n aux ) integrals where n aux is the size of the auxiliary basis set. Both the GPW and the RI method can be considered as density fitting approaches where the former uses a plane-wave fitting basis and the latter uses a Gaussian fitting basis. For this reason, the RI approach is often called "Gaussian density fitting" (GDF) in the context of periodic calculations [56][57][58][59][60].
The two-electron integral tensor can be further factorized into a product of five two-index tensors as was done in the tensor hypercontraction (THC) method of Martinez and coworkers [61][62][63]. Factorizations of this form are most useful for correlated methods where they have the potential to lower the computational scaling. In this work we present a translational symmetry-adapted form of the tensor hypercontraction for the two-electron integral tensors of periodic systems.
We first introduce summation limits for each symbol as we will commonly use short hand summation formulas to indicate multiple sums. For each variable {p, q, r, s} summation is performed over the range [0, N/2 − 1] indexing the spatial orbital or band, {Q, k, k } summation is performed over the Brillouin zone (BZ) at a set number of k-points of which there are N k , and {σ, τ } are electron spin variables and summed over {↑, ↓}. Non-modular differences of k, Q, and k span twice the Brillouin zone. Because V needs to be indexed by values in the Brillouin zone, we use modular subtraction indicated by . That is, if the number of points in each dimension is N x , N y , N z , we perform subtraction modulo N x , N y , N z in each direction, respectively. The Hamiltonian is generally complex Hermitian with four-fold symmetry of the two-electron integrals. 1 In the following sections we demonstrate how the sparse structure of the two-electron integral tensor affects the scaling of block encoding the Hamiltonian for implementation of qubitized quantum walk oracles. The cost of qubitization is greatly affected by the representational freedom of the the underlying Hamiltonian expressed as a linear combination of unitaries. We demonstrate how to construct the sparse, single-factorization (SF), double-factorization (DF), and tensor-hypercontraction (THC) integral decompositions of Bloch orbital Hamiltonians and cost out simulations for a variety of materials.
For all algorithms, we will make a comparison to the case of a Γ-point calculation on a supercell composed of N k primitive cells in the geometry described by the k-point sampling. This allows us to directly observe the proposed speedup due to symmetry-adapting. To demonstrate the scaling of symmetry-adapted block encoding, we estimate quantum simulation resource requirements for the series of systems listed in Table I. Range-separated density fitting [60] is used to construct integrals with Dunning type correlation-consistent basis sets [64] and the Goedecker-Teter-Hutter (GTH) family of pseudopotentials for Hartree-Fock [65]. For each Hamiltonian, cutoffs for the factorization are selected so that the Møller-Plesset second order perturbation theory (MP2) error in the total energy is below one milliHartree per cell or formula unit depending on the system. While prior works used coupled-cluster theory, MP2 is used here for computational efficiency.

III. QUBITIZATION OF MATERIALS HAMILTONIANS
Similar to fault tolerant resource estimates for molecular systems represented in second quantization [32,36,37,70,71], we compare the number of logical qubits and number of Toffoli gates required to implement phase estimation on unitaries that use block encoding [72] and qubitization [33] to encode the Hamiltonian spectrum in a Szegedy walk operator [73] for various linear combination of unitaries (LCU) [74] representations of the Hamiltonian. All LCUs represent the Hamiltonian as where ω ∈ R, ω ≥ 0, and U is a unitary operator. One can then construct the operators SELECT| |ψ → | U |ψ (13) 1 We note that the following generic complex Coulomb integral symmetries are present V pkp,qkq ,rkr ,sks = V rkr ,sks,pkp,qkq = V * qkq ,pkp,sks,rkr = V * sks,rkr ,qkq ,pkp from integration index relabeling and complex conjugation.
where |ψ is the system register, and | is an ancilla register used to index each term in the LCU. The walk operator constructed from select and a reflection operator built from prepare, R = 2|L L| ⊗ 1 − 1, has eigenvalues proportional e ±i arccos En/λ where E n is an eigenvalue of the Hamiltonian in Eq. (11). It was shown in References [70] and [32] when ensuring that select is self-inverse, only the reflection operator R needs to be controlled on the ancilla for phase estimation and not select. Therefore, the Toffoli cost of phase estimating the walk operator scales as where C S is the cost for implementing the select oracle and C P is the cost for implementing the prepare oracle, C P † is the cost for the inverse prepare oracle, and PEA is the target precision for phase estimation. Thus the main costs for sampling from the eigenspectrum of a second quantized operator are the costs to implement select, prepare, and prepare † . These costs need to be multiplied by a factor proportional to λ/ PEA for the number of walk steps needed for phase estimation. Note that when computing intensive quantities, such as the the energy per cell, the λ factor is scaled by 1/N k . The particular choice of LCU changes all of these costs. Prior works have investigated the resource requirements for simulating molecules with four different LCUs. While all these methods can be used without modification in supercell calculations at the Γ-point, the construction of molecular select and prepare do not exploit any symmetries and are not applicable away from the Γ-pointe.g. at the Baldereschi point [75].
The leading costs in constructing select and prepare for second quantized Hamiltonians is the circuit primitive that functions similar to a read-only-memory (ROM) called QROM. The QROM primitive is a gadget that takes a memory address, potentially in superposition, and outputs data, also potentially in superposition. There are currently two variations of QROM that have different costs; traditional QROM that has linear Toffoli complexity when outputting L items with any amount of data associated with each item, and advanced QROM (called QROAM) with reduced non-Clifford complexity [76]. It uses a select-swap circuit construction with Toffoli cost where k is a power of 2 for outputting L items of data where each item of data is m bits long. The notation k here for an integer should not be confused with k for the crystal momentum vector. It needs m(k − 1) ancillas, so increases the logical ancilla count in exchange for reduced Toffoli complexity. When L > m this function is minimized by selecting k ≈ L/m and thus the Toffoli and ancilla cost generically go as O( √ Lm). It is also possible to adjust k to reduce the ancilla count while increasing the Toffoli count. Having QROAM output the minimal amount of information to represent the Hamiltonian is at the core of the √ N k improvements we derive in many of the block encodings. We will also demonstrate that for all LCUs the lowest scaling can be linear in the Bloch orbital basis size O(N k N ) due to the requirement to perform unary iteration at least once over the entire basis.
Another primitive that becomes the dominant cost in constructing symmetry-adapted select is the multiplexedcontrolled swap between two registers. The controlled swap between two registers uses unary iteration [70] on L items to swap M elements between two registers at the cost of O(LM ) Toffolis. For simulating materials this primitive is commonly encountered when swapping all band indices with a particular irreducible representation label, or k-point, into a working register at a cost of O(N k N ). The necessity of coherently moving data thus puts a limit on the total savings one can achieve by leveraging Abelian symmetries. The cost of moving data must be weighed against the benefits, which we describe in each section below. In Table II we summarize the space complexity, in terms of logical qubits, and time complexity, in terms of Toffolis of the four LCUs when considering translational symmetry on the primitive cell and without (denoted as SC for supercell).
For the sparse LCU, exploiting primitive cell translational symmetry reduces the amount of symmetry unique information in the Hamiltonian by a factor of N k , which translates to a reduction of √ N k savings in Toffoli complexity and ancilla complexity. For sparse prepare, the square root savings originates from the QROAM cost of outputting "alt" and "keep" values for the coherent alias sampling component of the state preparation. For sparse select controlled application of all Pauli terms has linear cost in the basis size O(N k N ) and is not the dominant cost. The supercell calculation does not exploit the k-point symmetry unique non-zero coefficients of the Hamiltonian and thus has worse scaling.
The single factorization LCU leverages the fact that the Coulomb integral tensor is positive semidefinite and can be written in a quadratic form. For molecular systems without symmetry-i.e. C1 symmetry, the factorization results in  N ), which is the number of orbitals in the primitive cell or bands.Ξ is the average rank of the second factorization in the supercell calculation and is also expected to scale as O(N k N ). The tilde on O is used to account for logarithmic factors, and can include variables not explicitly given in the scaling. λ for each LCU is different and is denoted as a subscript indicating the LCU type and if it λ for the supercell version.

Representation
Qubits Toffoli Complexity SC Qubits SC Toffoli a three-tensor where there are two orbital indices and one auxiliary index that scales as the number of orbitals in the system [77]. For simulations where orbitals now have point group symmetry labels, such as k-points, each three-tensor factor can now be arranged into a five-tensor; two symmetry labels (irreducible representation labels), two-band labels (orbital labels), and one auxiliary index which still scales with the number of bands due to density fitting of the cell periodic part of the density [60,78]. Thus the origin of the √ N k improvement for the symmetry-adapted block encoding with a single-factorization LCU lies in the fact that the auxiliary index has N k lower scaling in comparison to a supercell variation where the Cholesky factorization or density fitting is performed on the entire supercell twoelectron integral tensor. The single factorization algorithm is also dominated by the QROM cost of prepare-of which there are two state preparations. The inner state preparation [32] for the k-point symmetry-adapted algorithm requires outputting O(N 2 k N 3 ) to be used in the state preparation leading to O(N k N 3/2 ) Toffoli and qubit complexity. Contrasting this to the supercell calculation, we see a √ N k savings due to the fact that the inner state preparation requires only N 2 k information and not N 3 k information. We elaborate on this point further in Section III B. select is implemented in a similar fashion to sparse, scaling as O(N k N ), and is not a dominant cost.
The double factorization LCU represents the Hamiltonian in a series of non-orthogonal bases and leverages that a linear combination of ladder operators can be constructed by a similarity transform of a single fermionic ladder operator, or Majorana operator, by a unitary generated by a quadratic fermionic Hamiltonian. In the molecular case, the dominant cost for these algorithms is the QROM to output the rotations for the similarity transform and implementing the basis rotations with the programmable gate array circuit primitive [32,36] for select. When taking advantage of primitive cell symmetry we reduce the amount of data needed to be output by QROM by N k , which results in a √ N k savings in the Toffoli complexity. Because we are using advanced QROM, this output size advantage is also observed in the logical qubit requirements. As mentioned previously, computing the total number of Toffolis requires scaling the walk operator cost by a linear function of λ. We find that using a canonical orbtial basis set, the total Toffoli cost is higher than the commensurate supercell total Toffoli cost because λ for the symmetry-adapted case increases. The origin of the increase is related to a reduced variational freedom when selecting non-orthogonal bases and is further discussed in Section III C.
Finally, in the THC LCU there is no asymptotic speedup because the molecular algorithm had the lowest possible scaling for second quantized algorithms. This stems from the fact that even iterating over the basis once with unary iteration to apply an operator indexed by basis element has a Toffoli cost of O(N k N ). As we will discuss in Section III D our symmetry-adapted algorithm offers other benefits such as enabling the classical precomputation of the THC factors by exploiting symmetry and lowering the number of controlled rotations.
We now describe the Hamiltonian factorization used in each LCU, the calculation of λ associated with each Hamiltonian factorization, and outline the construction of the qubitization oracles. Detailed compilations are provided for each LCU in the Appendices. In each section we provide numerical evidence that the symmetry-adapted oracles have the reported scaling by plotting the Toffoli requirements to synthesize select + prepare + prepare −1 compared against the number of k-points sampled (N k ).

A. The sparse Hamiltonian representation
In the "sparse" method the Hamiltonian described in Eq. (8) and Eq. (9) is directly translated to Pauli operators which form the LCU. Under the Jordan-Wigner transformation, we take where the notation Z is being used to indicate that there is a string of Z operators on qubits up to (not including) that on which X pkσ or Y pkσ acts upon. This requires a choice of ordering for the qubits indexed by p, k, and σ. We need only apply the string of Z operators for the same value of σ, because we always have matching annihilation and creation operators for the same spin σ (so any Z gates on the other spin would cancel). We also adopt a convention that the ordering of qubits for the Jordan-Wigner transformation takes k as the more significant bits, with qubits for all p with a given k grouped together. For most of the discussion we will not need to explicitly consider this ordering.
We provide the full derivation for this expression in Appendix A 1. To derive the two-body operator LCU we use only complex conjugation symmetry in contrast to the molecular derivation that used eight-fold symmetry. The two-body Hamiltonian can be written as where indicates modular subtraction as defined above. In the case where Q = 0 or p = q and r = s, we can move the creation and annihilation operators using the fermionic anticommutation relations to give the term on the second line as The Jordan-Wigner representation then gives the expression in square brackets in Eq. (20) as Then we can separate Eq. (22) into real and imaginary components as In accounting for cases where Q = 0 with p = q or r = s, the same expression is obtained, but there are also one-body terms obtained. These result in a total one-body operator A full derivation of this expression can be found in Appendix A 2.
Using the representation of the one-body and two-body operators as Pauli operators, we have a linear combination of unitaries form. The λ associated with this LCU is Re(V pk,q(k Q),r(k Q),sk ) + Im(V pk,q(k Q),r(k Q),sk ) .
In determining λH 1 there is a factor of 2 due to the summation over spin σ and then a factor of 2 accounting for the fact that each expression in braces in Eq. (19) is the sum of two different Pauli strings. As a result these factors have cancelled the original 1/4 prefactor. In the expression for λH 1 we have also summed over the native one-body terms and the contributions from the two-body terms. For λ H2 we had a factor of 1/8 in Eq. (23), which is multiplied by the factor of 1/4 in Eq. (20). The two sums over spin σ and τ give a factor of 4. Then for each of the real and imaginary parts in Eq. (23) there were sums over 8 Pauli strings, giving a factor of 8. As a result these factors have also cancelled in the expression for λ H2 . Note that there is a factor of 2 between this expression and that in [32], even when we just consider V that is real. The reason is that in Ref. [32] there was eight-fold symmetry, where here we only have four-fold symmetry. That is, here we have symmetry when simultaneously swapping the pairs p, q and r, s, whereas in Ref. [32] there are two symmetries from swapping p, q or r, s on their own. That meant it was possible to express the Hamiltonian as in Eq. (A2) of that work, then in Eq. (A3) of that work the Jordan-Wigner mapping was used in the form In this mapping there has been a cancellation of half the Pauli strings, which results in λ being reduced by a factor of 2. Here we only have four-fold symmetry, so the value of λ for the two-body term is a factor of 2 larger than that in [32].
In order to implement the Hamiltonian as a linear combination of unitaries, the first step is to perform a state preparation on Q, k, k , p, q, r, s. This state preparation corresponds to the sum, then we will perform controlled operations for each of the operators in Eq. (23). The state preparation is applied using coherent alias sampling as described in [71]. Because there are multiple variables that the state needs to be prepared over, it is convenient to use the QROM to output "ind" values as well as "alt" and "keep" values. Both "ind" and "alt" give values of all variables Q, k, k , p, q, r, s. Then an inequality test is performed between keep and an equal superposition state, and the result is used to control a swap between ind and alt.
There are both real and imaginary values of V pk,q(k Q),r(k Q),sk , so we also include a qubit to distinguish between these values in the state preparation. We also do not prepare all values of {pk, q(k Q), r(k Q), sk }. There is symmetry in swapping pk, q(k Q) with r(k Q), sk , or simultaneously pk with q(k Q) and r(k Q) with sk .
Only those values of {pk, q(k Q), r(k Q), sk } that give unique values of V will be prepared. Then the full range can be obtained by using qubits to control these swaps. There is also a complex conjugate needed in the symmetry, which can be applied with a Clifford gate.
The dominant complexity in the preparation comes from the QROM. The number of items of data is O(N 3 k N 4 ), and by using the advanced form of QROM the complexity can be made approximately the square root of the total amount of data (number of items of data times the size of each). The size of each item of data is logarithmic in N k and N as well as the allowable error. Therefore, the scaling of the complexity can be given ignoring these logarithmic parts as O(N To describe the controlled operations needed in order to implement the operation as in Eq. (23), we need to account for the fact that there are two lines for each of the real and imaginary components of V . In addition, for each line in Eq. (23) there is a product of two factors, each of which is a sum of two terms. To describe the linear combination of unitaries we therefore introduce three more qubits.
• The first is used to distinguish between the two lines for each of the real and imaginary parts in Eq. (23).
• The second distinguishes between the two terms in the first set of square brackets.
• The third distinguishes between the two terms in the second set of square brackets.
When implementing the controlled operations, we perform four operations of the form of ZX or ZY , with X or Y being applied on target qubits indexed by pk, q(k Q) and so forth. These Pauli strings are applied using the approach of [32], but in this case there is the additional complication that we need to select between X or Y . This selection can be performed simply by performing the controlled Pauli string twice, once for X and once for Y . The complexity is proportional to N k N , which is trivial compared to the complexity of the state preparation. The choice of whether X or Y is performed depends on the value of the three qubits selecting between the terms, as well as the qubit selecting between the real and imaginary parts. The processing of these qubits to determine the appropriate choice of X or Y can be performed with a trivial number of gates.
The last part to consider is how the implementation of the one-body part of the Hamiltonian is integrated with the implementation of the two-body part. In the state preparation, amplitudes corresponding to the real and imaginary parts of h pk,qk will be produced, as well as a qubit selecting between the one-and two-body parts. That qubit will be used to also select between the choice of X and Y . For the one-body part there is a product of only two of the Pauli strings, so the other two will not be applied at all for the one-body part. See Appendix A 3 for a more detailed description of the implementation.
In Figure 1 we plot the Toffoli complexity to implement select + prepare + prepare −1 for simulating the aforementioned sample systems using a symmetry-adapted select and prepare at different Monkhorst-Pack grids and different number of bands (cc-pVDZ and cc-pVTZ). We compare the symmetry-adapted calculations to supercell calculations using the same select and prepare. The supercell calculations do not explicitly take into account the symmetry of the primitive cell in the full simulation cell. To demonstrate the symmetry-adapted Hartree-Fock orbitals do not appreciably change the overall scaling with respect to a supercell calculation we plot the total λ for the supercell calculation (which reruns Hartree-Fock on the supercell) and the symmetry-adpated version.  Table I using the cc-pVDZ and cc-pVTZ basis set Γ-centered Monkhorst-Pack grids of size [1, 1, 1] to [3,3,3]. Each point is a single system described at a particular basis set and k-mesh where the threshold for zeroing each nonzero two-electron integral coefficient is determined by MP2 as described earlier. The scaling for implementing the block encoding is shown in the legend. To isolate the N k scaling behavior we divide the Toffoli step complexity by the square of the number of basis functions (N 2 ). For supercell we expect a scaling going as O(N 2 k ) and for symmetry-adapted block encodings we expect a scaling going as O(N 1.5 k ). The ideal symmetry-adapted scaling is not reached due to finite size effects which are further discussed in Figure 2. (b) Total λ for the symmetry-adapted version (denoted UC) and the supercell calculation without explicit primitive cell symmetry (denoted SC) demonstrating no deterioration of λ by symmetry-adapting. Similar to (a) all points are a particular system from the benchmark set in a fixed basis and k-mesh.
Though Figure 1 indicates some computational advantage for the symmetry-adapted case the expected √ N k improvement over the supercell case is not easily observed. The cost of the sparse method largely depends on the number of nonzero elements of V , which is generically expected to go as O(N 3 k N 4 ) for the symmetry-adapted case and O(N 4 k N 4 ) for the supercell case. The scaling ultimately depends on the number of nonzero elements in each block of integrals (indexed by three-momentum indices), which we expect to be independent of supercell size N k . For Diamond we plot this dependence in Figure 2 and demonstrate that convergence is slow and there is a strong N k dependence in the number of nonzero elements in each two-electron integral block. This dependence makes observing the improvement in Toffoli cost for symmetry-adapted oracles difficult in the low N k regime.

B. The single-factorization Hamiltonian representation
For the "single-factorization" method, the Cholesky decomposition of the 2-electron integral tensor can be applied iteratively or the factorized forms can be directly recovered from a density fitted representation of the atomic orbital FIG. 2. N k dependence on the number of non-zero elements in each two-electron integral block indexed by irrep. labels for Diamond in a single-zeta-valence basis with a k-mesh shifted to (1/8, 1/8, 1/8) of the simulation cell. In the thermodynamic limit this value should be independent of N k and thus both supercell (SC) and symmetry-adapted (UC) should have no correlation with N k -i.e. a slope of zero. The N k dependence for small N k makes it difficult to observe the asymptotic improvements from symmetry-adapting the sparse qubitization oracles.
integral. The quadratic representation of the two-electron integral tensor is V pkp,qkq,rkr,sks = n L pkpqkq,n L * sksrkr,n (29) where k p + k r = k q + k s modulo a reciprocal lattice vector G, or k p − k q − (k s − k r ) = G. We can identify k p − k q = Q + G = k s − k r . Thus the two-body interaction operator can be written aŝ Due to the reduced symmetry of the complex valued two-electron integral tensor we take additional steps to form Hermitian operators which can be expressed as Pauli operators under the Jordan-Wigner transform. We express each one-body operator in the product of particle-conserving one-body operators forming the two-electron operator aŝ (31) We now take a linear combination ofρ n (Q) to form Hermitian operators and represent our two-electron integral operator as a sum of squares of Hermitian operators that are amenable to the approach for the qubitization of onebody sparse operators via a linear combination of unitaries. These operators are denotedÂ n (Q) andB n (Q) and are defined asÂ to giveĤ We have taken advantage of the translational symmetry by performing the sum over Q outside the squares ofÂ and B, which reduces the amount of information needed in the representation. In the case Q = 0 we can writeÂ n aŝ Applying the Jordan-Wigner representation then giveŝ The same reasoning can be performed forB n (Q = 0), which gives the plus and minus signs between a † pkσ a q(k Q)σ and a † q(k Q)σ a pk in Eq. (35) reversed, so the roles of the real and imaginary parts are reversed. As a result we obtain Accounting for the cases with Q = 0, we may use the same expressions with an extra identity, which yields a one-body correction when squaring. We show in Appendix B 1 that this results in the total one-body operator as before. Therefore the associated λ is again λH 1 as given in Eq. (25). The λ for the two-body term is then This expression can be obtained by first summing the absolute values of the weights in the linear combination of unitaries forÂ andB to give This is obtained by noting that the sum over the spin gives a factor of 2, and there are two unitary operators for each of the real and imaginary parts; together these cancel the factor of 4. Then this expression is squared for each ofÂ andB, and there is a sum over Q and n in Eq. (34). A further factor of 1/2 is obtained because we use amplitude amplification on each operator as described in [32], thus giving our expression for λ V . Next we describe the method to block encode the Hamiltonian in this single-factorized representation. The key idea is to perform a state preparation over Q and n, then block encode the squares ofÂ n (Q) andB n (Q) using a single step of oblivious amplitude amplification (which saves a factor of 2 for the value of λ). That is, we perform block encodings ofÂ n (Q) andB n (Q), reflect on an ancilla register, then apply the block encodings again.
For the initial state preparation on Q, n, the number of items of data is M N k + 1, where the +1 is for the one-body part of the Hamiltonian. This state preparation is via coherent alias sampling, so the dominant cost is from the QROM needed to output ind and alt values. That has complexity scaling as O( √ M N k ), where the tilde accounts for the size of the items of data.
For bothÂ n (Q) andB n (Q) we have weightings according to the real and imaginary parts of L pkq(k Q),n , but the difference is in what operations are performed in the sum. Therefore, for an LCU block encoding, the state preparation step may be identical betweenÂ n (Q) andB n (Q). For each value of Q, n, the number of unique values of k, p, q to consider is N k N 2 /4. Unlike the supercell case we cannot take advantage of symmetry between p and q, because we have pk and q(k Q). The relation between k and k Q is governed by the value of Q which is given in the outer sum, and so we cannot exchange p and q. There is a further factor of 2 for the number of items of data, because both real and imaginary parts are needed.
Accounting for the values of Q, n, the total number of items of data that must be output by the QROM used in the state preparation is ( , given that M scales as O(N ). Again because the size of the items of data is logarithmic, this gives a complexity O(N k N 3/2 ). In contrast, in the supercell calculation, eachÂ n andB n would have O(N 2 k N 2 ) entries, and the rank would be O(N k N ), for a total number of items of data O(N 3 k N 3 ). That would give a complexity O(N 3/2 k N 3/2 ), so there is a factor of √ N k improvement obtained by taking advantage of the symmetry.
In the state preparation we only prepare p, q for p ≤ q, and the full range of values should be produced using a swap controlled by an ancilla register. A further subtlety in the implementation as compared to prior work is that the complex conjugate is needed as well. This may be implemented using a sign flip on the qubit indicating the imaginary part, so it is just a Clifford gate.
A major difference is in the selection of operations forÂ n (Q) andB n (Q). We see that there are two steps where we need to apply an operation of the form ZX or ZY , and the choice of X or Y . The selection of where the X or Y is applied (indicated by the subscript) can be implemented in the standard way. The choice of whether X or Y is applied depends on four qubits.
1. The qubit selecting between the one-and two-body parts.

A qubit selecting between
A and B, which can simply be prepared in an equal superposition using a Hadamard because there are equal weightings between these operators.

3.
A qubit selecting between the real and imaginary parts of L pkq(k Q),n , which was prepared in the state preparation.

4.
A qubit selecting between the two terms shown above in each line of the expressions forÂ n (Q) andB n (Q). This qubit can also be prepared using a Hadamard.
Using a trivial number of operations on these qubits we can determine whether it is X or Y that needs to be performed. The cost of the controlled unitary is doubled because we apply a controlled ZX and a controlled ZY , but this cost is trivial compared to the state preparation cost so has little effect on the overall complexity. A further subtlety in the implementation is that in the second implementation ofÂ n (Q) andB n (Q), we simply use the qubit flagging the one-body part to control whether the Pauli string ZX or ZY is applied at all. This ensures that the square is not obtained for the one-body part. For a more in-depth explanation of the implementation, see the circuit diagram in Figure 3 and the explanation in Appendix B 2. Figure 4 demonstrates the √ N k improvement in constructing the walk operator by symmetry-adapting. Even for small N k there is a clear separation between the cost of supercell (SC) and symmetry-adapted oracles that agrees with the theoretical scalings of 1.5 and 1.0, respectively.

C. The double-factorization Hamiltonian representation
In the sparse and SF LCU approaches we have found that there is a factor of √ N k savings in Toffoli costs and logical qubit costs for symmetry-adapted block encoding constructions over their non-symmetry-adapted counterparts (supercell calculations). The double-factorization (DF) representation continues this trend, though the origin of the speedup is different. In the double-factorization circuits each unitary of the LCU is a rank-one one-body operator that can be thought of as the outer product of two vectors of ladder operators, where each vector of ladder operators is obtained by a Givens rotation with multiqubit control based on other indices. First notice that for SF there is O(N 2 k N 3 ) data to output to specify the Hamiltonian via the Cholesky factors. The factors come from two momentum indices k, Q, two band indices p, q, and one auxiliary index n. In this section we demonstrate that by using a workspace The circuit for performing the state preparation and controlled operations for the single factorization approach. The register labelled is a contiguous register for Q, n, with Q also output in the state preparation. The inner state preparation uses as a control. Then the minus on Q controlled by k is to compute k Q, with that register reset to Q later with a controlled addition. A qubit is used to swap p and q to generate that symmetry, and another is used to swap the spin up and spin down components of the system so the selection only need act on the spin down component. The qubits labelled Re/Im, A/B, and "term" are the qubits selecting the real versus imaginary parts, A versus B, and the two terms in each line of the Hamiltonian. These correspond to b1, b2, b3 in Appendix B 2, and the Toffoli and CNOT gates are used so that the "term" qubit can be used to select whether the Paul string with X or Y is applied. The selection is performed twice, once for each of the Pauli strings and so is controlled by k Q and q the first time, then k and p the second time. A controlled phase between Re/Im and "term" (also controlled by the success flag qubits) is used to generate the correct sign for the term. The block encoding of A/B is performed twice, with the reflection on the ancilla qubits in the middle generating the step of oblivious amplitude amplification. The third register flags that we have the two-body part of the Hamiltonian, and is used to control the block encoding of A/B the second time to ensure it is not performed for the one-body part.
register to apply Givens rotations to pairs of band indices, {k, k Q}, the complexity of the DF LCU can also be improved by a factor of √ N k over supercell calculations. To construct the DF LCU, we will separateÂ n (Q) andB n (Q) out into sums over k. To express this, instead of having ρ n (Q), we define ρ n (Q, k) so then the Hermitian one-body operators that are squared to form the two body part of the Hamiltonian arê  Table I described using the cc-pVDZ and cc-pVTZ basis sets and Γ-centered Monkhorst-Pack grids of size [1, 1, 1] to [3,3,3]. Each point is a single system described at a particular basis set and k-mesh where the range of the auxiliary index of the Cholesky factorization is selected to produce two-electron integrals corresponding to an MP2 error of one 1 milliHartree per unit cell with respect to an untruncated auxiliary index range. We divide the Toffoli complexity for implementing select + prepare + prepare −1 by N 3/2 , which is the shared scaling in the number of bands. The different scaling in number of k-points becomes clear: N k for symmetry-adapted block encodings and N 3/2 k for supercell non-symmetry-adapted block encodings. We observe similar behavior for qubit count and for plotting oracle Toffoli complexity versus the number of bands. (b) The value of λ per unit cell (λ/N k ) as a function of the total system size N N k for the same systems described with the same cutoffs used in (a).
Just as in the single factorization case, we have the two-body part of the Hamiltonian We can writeÂ n (Q) asÂ where the basis rotation unitary U n (Q, k) acts on orbitals indexed by k and k Q, Ξ Q,n,k,A corresponds to a rank cutoff for A, and f A p (Q, n, k) is the eigenvalue of the one body operator that is diagonalized by U n (Q, k). The expression forB n (Q) is similar, and we use Ξ Q,n,k,B to denote the rank cutoff.
In practice, for the implementation we would apply a different basis rotation for each individual value of p. As explained by [32], when doing that the number of Givens rotations needed only corresponds to the number of orbitals it is acting upon, instead of the square. Here we have two momentum modes {k, k Q} with N orbitals for each, suggesting there should be 2N . However, there is no mixture between the different spin states indexed by σ, so that gives the number of orbitals as N .
To quantify the amount of information needed to specify the rotations for the Hamiltonian, there is a Q summation, n summation, k summation, p summation, and we need to specify N Givens rotations for each. In turn, each Givens rotation needs two angles. The total data here therefore scales as where a factor of N comes from the number of Givens rotations, and the tilde accounts for the bits of precision given for the rotations. By analogy with the supercell case, it is convenient to define an average rank This has division by N k for the Q sum and M for the n sum, but no division by a factor accounting for k. That is, it is the average rank for each value of Q and n, with the sum over k regarded as part of the rank. Then it is most closely analogous to the rank in the supercell case, and it is found that it similarly scales as O(N k N ). In terms of Ξ, the scaling of the amount of data can be given as O(N k N 2 Ξ), using M = O(N ). Next we describe in general terms how to perform the block encoding of the linear combination of unitaries, with the full explanation in Appendix C. As in the case of single factorisation, the general principle is to perform state preparation over Q and n, then block encode the squares ofÂ n (Q) andB n (Q) using oblivious amplitude amplification. The difference is thatÂ n (Q) andB n (Q) are now block encoded in a factorized form. In more detail, the key parts are as follows.
1. Perform a state preparation over Q and n, as well as a qubit distinguishing betweenÂ n (Q) andB n (Q). Using the advanced QROM, the complexity of this state preparation scales approximately as the square root of the number of items of data, so as O( √ N k N ). The tilde accounts for logarithmic factors from the size of the output. For convenience here we use a contiguous register for combined values of Q and n.
2. Apply a QROM which outputs the value of Q, as well as an offset needed for the contiguous register needed in the state preparation forÂ n (Q) andB n (Q).
3. Perform the inner state preparation over k and p. Here the number of items of data is O(N k N Ξ), accounting for the sums over Q, n, k, p. The complexity via advanced QROM is approximately the square root of this quantity, 4. Apply the QROM again to output the rotation angles for the Givens rotations needed for the basis rotation. This time the size of the output scales as O(N ), so the complexity of the QROM scales as O(N √ N k Ξ). This is the dominating term in the complexity.
5. Use control qubits to swap the system registers into the correct location. This is done first controlled by a qubit labelling the spin, σ, which is similar to what was done in prior work. The new feature here is that registers containing k and k Q are also used to swap system registers into N target qubits.
6. Apply the Givens rotations on these N target qubits. The complexity here only scales as O(N ), so is smaller than in the other steps.
7. Apply a controlled Z for part of the number operator. This comes from representing the number operator as (1 1 − Z)/2 and combining the identity with the one-body part of the Hamiltonian.
8. Invert the Givens rotations, controlled swaps, QROM for the Givens rotations, and state preparation over k, p.
The complexities here are similar to those in the previous steps, but the complexities for QROM erasure are reduced. This completes the block encoding ofÂ n (Q) andB n (Q).
9. Perform a reflection on the ancilla qubits used for the state preparation on k, p. This is needed for the oblivious amplitude amplification.
10. Perform steps 3 to 8 again for a second block encoding ofÂ n (Q) andB n (Q). This together with the reflection gives a step of oblivious amplitude amplification, and therefore the squares ofÂ n (Q) andB n (Q).
11. Invert the QROM from step 2. This has reduced complexity because it is an erasure.
12. Invert the state preparation from step 1.
A quantum circuit for the procedure is shown in Figure 5. This is similar to Figure 16 in [32], except it is including the extra parts needed in order to account for the momentum k used here. In particular, shown here is a contiguous register for Q, n, and p shown in the diagram is actually a contiguous register for k, p. The values of Q and k need to be output via QROM after the state preparations. Then k Q is computed and k and k Q are used to swap the required part of the system register into N target qubits where we apply the Givens rotations.
The lambda value for the Hamiltonian can be calculated by determining the total L1-norm of the coefficients of the unitaries used to represent the Hamiltonian. To determine this norm, note first that the number operator is replaced with (1 1 − Z)/2, and the identity is combined with the one-body part of the Hamiltonian. ForÂ n (Q), what is implemented therefore corresponds to The circuit for performing the state preparation and controlled operations for the double factorization approach. The register labelled is a contiguous register for preparing Q, n, with Q output next in the QROM. The register labelled p is actually a contiguous register for preparing both k and p. The value of k is output in the next step together with the rotations. Then the minus on Q controlled by k is to compute k Q. The "×" on |ψ ↓ controlled by the Q and k registers indicates that these registers are used to swap the required part of the system register into N target qubits that the Givens rotations R act upon.
Summing the absolute values of coefficients here gives where the sum over the spin σ has given a factor of 2 which canceled the factor of 1/2. In implementing the square ofÂ n (Q) we use oblivious amplitude amplification, which provides a factor of 1/2 to λ. Combining this with the 1/2 in the definition ofĤ 2 gives 1/4, and combining with the contribution fromB n (Q) then gives where the superscript B on f indicates the corresponding quantity forB n (Q). The one-body Hamiltonian is adjusted by the one-body term arising from the identity in the representation of the number operator in the two-body Hamiltonian. This yields an effective one-body Hamiltonian (see Appendix C) We can rewrite this as where λ k,p are eigenvalues of the matrix indexed by p, q in the the brackets in Eq. (51). Thus the L1-norm of H 1 is the sum  Table I described using the cc-pVDZ and cc-pVTZ basis sets and Γ-centered Monkhorst-Pack grids of size [1, 1, 1] to [3,3,3]. Each point is a single system described at a particular basis set and k-mesh where the threshold to keep eigenvalues and vectors of the second factorization is selected to produce two-electron integrals corresponding to an MP2 error of one 1 milliHartree with respect to an untruncated double factorization. On average this corresponds to a threshold value of 1 × 10 −4 for the benchmark systems. The expected O( √ N k ) scaling improvement for symmetry-adapted walk operators is demonstrated. (b) The value of λ per unit cell as a function of the total system size N N k for the same systems described with the same cutoffs used in (a). The reduced variational freedom in compression of the two-electron integral tensors for the symmetry-adapted walk operator construction translates to an increased value of λ at all system sizes. Figure 6 demonstrates the improved √ N k scaling of the block encodings coming from reducing the number of controlled rotations by N k . Unlike the SF case λ for DF has worse scaling in the symmetry-adapted setting compared to the supercell case. This is rationalized by the fact that there is a larger degree of variational freedom in the second factorization for supercell calculations (and thus more compression) compared to the symmetry-adapted case. The λ value is basis set dependent and can potentially be reduced by orbital optimization [79].

D. The tensor hypercontraction Hamiltonian representation
In the tensor hypercontraction (THC) LCU representation the fact that the two-electron integrals can be represented in a symmetric Canonical Polyadic like decomposition is used to define a set of non-orthogonal basis function in which to represent the Hamiltonian, and we use a similar infrastructure to the DF algorithm to implement each term in the factorization (which is in a different non-orthogonal basis) sequentially. In the following section we describe the Bloch orbital version (symmetry-adapted) of the THC decomposition and the resulting LCU, λ calculation, and qubitization complexities. First we review the salient features of tensor hypercontraction for the molecular case before introducing symmetry labels. Recall that in the molecular THC approach we expand density like terms over a grid of M points (labeled µ) and weight each grid point with a function ξ µ (r) which allows us to write the two-electron integral tensor as where the central tensor is defined as In order to incorporate translational symmetry into the THC factorization the decomposition of the density is performed on the cell periodic part of the Bloch orbitals as [80,81] where u pkp (r) = e −ikpr φ pkp (r). Then the two-electron integral tensor has the form where χ Some care needs to be taken when bringing this into a form similar to Eq. (5). First recall that we have k p − k q + k r − k s = G pqrs , where G pqrs is a reciprocal lattice vector, and we are working with a uniform Γ-point centered momentum grid with dimensions N = [N x , N y , N z ] and N k = N x N y N z . To eliminate one of the four momentum modes, we identify Q = k p k q , and Q = k s k r , and set k p = k, k q = k Q, k s = k and k r = (k Q). To evaluate the ζ tensor we still need to know the values of k p − k q in absolute terms given a value for Q and k. We note that mapping the difference k p − k q back into our k-point mesh amounts to adding a specific reciprocal lattice vector with a similar expression for k (the subtraction here is not modular). Thus, given a Q and k we can determine k − Q and G k,k−Q . With these replacements we can write where we have used In practice there are at most 8 values of G, so we only need to classically determine at most 8 2 N k values of ζ, as opposed to N 3 k . We can then write where in going from the second to the third line of Eq. (62) we have rewritten the sum over k and k as a double sum over all 8 2 values of G 1 and G 2 , and a restricted sum on k such that for a given G 1 and Q we only sum over those k which satisfy G k,k−Q = G 1 . Here the notation G kp,kq is used as equivalent to G pq above. The fourfold symmetry of the two-electron integrals carries over to analogous symmetries in ζ, which are listed in Appendix D.
We will then defineχ which are individually normalized for each k and µ so pχ * pk,µχ pk,µ = 1 and with N k,µ := p |χ pk,µ | 2 . We then use these normalizedχ to give transformed annihilation and creation operators We can then write the two-body Hamiltonian aŝ A complication for the implementation is that we would like to be able to choose the relative weighting between ζ and χ such that The difficulty here is that the values of N k,µ only depend on k, µ, because they are based on χ pk,µ . This sum is also dependent on Q and G, so for this normalization condition to hold it would mean we need to have χ pk,µ also dependent on Q and G in a multiplicative factor (so a non-µ-dependent way). That will leave the normalizedχ pk,µ unaffected, but means that the values of N k,µ need to have dependence on Q, G, which will need to be taken account of in the state preparation. The form in Eq. (65) then gives us a recipe for block encoding the Hamiltonian as a linear combination of unitaries.
1. First prepare a superposition state proportional to This state may be prepared via the coherent alias sampling approach with a complexity dominated by the complexity of the QROM. Accounting for symmetry the dimension is about 32N k M 2 and the size of the QROM output is approximately the log of that plus the number of bits for the keep probability. That gives a Toffoli complexity scaling as 2. For each of the two expressions in brackets in Eq. (65), a preparation over k or k is needed to give a state of the form As explained above, the values of N k,µ need to be chosen with (implicit) dependence on Q, G for this to be a normalised state. This means that the amplitudes here need to be indexed by k, Q, G 1 and µ. The restricted range of values in the sum over k means that the indexing over k, Q, G 1 gives N 2 k items of data, which is multiplied by M for the indexing over µ. So there are N 2 k M items of data needed, which is smaller than that in the first step, because it is missing the factor of 32 and typically N k < M . Given that the output size is approximately log(N k ) + ℵ, the Toffoli complexity is approximately This cost is incurred twice, once for each of the factors in brackets in Eq. (65).
3. For each of the c annihilation and creation operators we perform a rotation of the basis from a. This is done in the following way.
(a) First use the spin σ or τ to control a swap of the system registers. This is done once and inverted for each of the two c † c factors. Each of these 4 swaps has cost N k N/2.
(b) Then use k or k Q to control the swap of the registers we wish to act on into working registers. The value of k Q is used for c µ(k Q)σ , and needs to be computed to use as a control. Each of these eight swaps may be done with a Toffoli complexity approximately as half the number of system registers N k N/2.
(c) Next k (or k Q) and µ (or ν) are used as a control for a QROM to output the angles for Givens rotations. There are two angles for each of N/2 Givens rotations, so if they have each the size of the output is N . Then the QROM complexity is about This must be done 4 times (and has a smaller erasure cost). 4. After the rotation of the basis, we simply need to perform the linear combination of ZX and ZY for c † and c. The X or Y is applied in a fixed location, but the Z needs to be applied on a range of qubits chosen by k or k Q. We therefore have approximately N k for the unary iteration for each Z for a total cost of about 4N k .
The quantum circuit for the block encoding of the THC representation, split into two parts with the right half at the bottom. The top shows the portion of the circuit for the first part controlled by k and ν, and the bottom shows the (right) part of the circuit where it is controlled by k and µ. The dotted rectangles show the regions for implementing the c and c † operators together with the Givens rotations needed to change the basis. The swaps controlled by the k and k registers are to move the appropriate qubits into target registers in order to apply the Givens rotations. The c and c † are applied using a superposition of X and iY applied using an ancilla qubit (not shown for simplicity), together with a string of Z gates for the Jordan-Wigner representation. The preparation at the beginning includes an inequality test between µ and ν to give a qubit flagging whether the real or imaginary part is produced. To make the implementation self-inverse, the µ, ν and k, k pairs of registers are swapped in the middle (the left of the lower half). Also, an X gate is applied to the qubit that controls the swaps at the beginning and end.
Lastly we would perform reflections on control ancillas as usual to construct a qubitised quantum walk from the block encoding. This cost is trivial compared to that in the other steps. For a more detailed explanation, see the circuit diagram in Figure 7 and the discussion in Appendix D.  Table I. Here we compare the MP2 errors as a function of the THC rank parameter cTHC using ISDF or subsequent reoptimization to generate the THC factors.  9. (a) The number of k-points verses Toffoli cost to implement the block encoding for the THC factorization LCU evaluated for the benchmark systems listed in Table I described using the cc-pVDZ and cc-pVTZ basis sets and Γ-centered Monkhorst-Pack grids of size [1, 1, 1] to [3,3,3]. Each point is a single system described at a particular basis set and k-mesh where the range of the auxiliary index of the THC factorization is selected to produce two-electron integrals corresponding to an MP2 error of one 1 milliHartree with respect to an untruncated auxiliary index range. This corresponds to an auxiliary index that is eight times the number of orbitals in the primitive cell for symmetry adapted THC, and eight times the total number of orbitals (N k N ) for supercell THC. We divide the Toffoli complexity for implementing SELECT + PREPARE + PREPARE −1 by N, which is the shared scaling in the number of bands. While we observe a √ N k scaling improvement for symmetry-adapted walk operations, we believe this is a finite size effect and both methods should scale linear with N k for sufficiently large N k The value of λ as a function of the total system size N N k for the same systems described with the same cutoffs used in (a). The reduced variational freedom in compression of the two-electron integral tensors for the symmetry-adapted walk operator construction translates to an increased value of λ at all system sizes. The λ THC value has a one-body component and two-body component. Unlike molecular THC where the two-body component is reduced because we evolve by number operators in the non-orthogonal basis, in this version of the THC algorithm we will evolve by ladder operators in a non-orthogonal basis, and thus there is no one-body part to remove. The one-body contribution to λ THC , λ THC,1 , is computed in a similar way as for the double factorization algorithm but noting that the extra factor of 1/2 coming from the Z operator is no-longer present to cancel the factor of two from spin summing. The one-body contribution to λ THC is The two-body contribution to λ THC , λ THC,2 , is determined by summing over all unitaries in the LCU. This summation can be rewritten in the form using the expression for ζ described in Eq. (61).
To obtain resource estimates for THC with k-points we follow a similar procedure to previous molecular work [32] and first compress the rank of the THC factors (M = c THC N/2, where c THC is the THC rank parameter). In particular, we use the interpolative separable density fitting (ISDF) approach [80,82,83] as a starting point before subsequently reoptimizing these factors in order to compress the THC rank while regularizing λ [32,37], which we will call k-THC. Further details of this procedure are provided in Appendix G. In Fig. 8 we demonstrate that a c THC = 8 is sufficient to obtain MP2 correlation energies within approximately 0.1 mHa/Cell for a subset of the systems considered in the benchmark set. We note that the equivalent ISDF rank may be on the order of 10-15 for comparable accuracy, which would correspond to a much larger value for λ. Fig. 9 (a) demonstrates a √ N k scaling improvement of the block encodings in the symmetry adapted case. Note that this √ N k speedup for the block encodings is partially a finite size effect. In Fig. 10 we plot the Toffoli complexity per step as a function of N k using artificially generated data to explore the large N k behavior. We see that depending on the fitting range employed the extracted asymptotic scaling trends towards linear. While ultimately both the symmetry-adapted and supercell encodings should scale linearly with the system size due to the cost of unary iteration over all basis states, there are several factors that yield a √ N k saving in the symmetry-adapted case, and the relative size of the prefactors becomes important. Similar to DF, we find from Fig. 9 (b) that λ in the symmetry-adapted setting exhibits slightly worse scaling than for supercell calculations. This worsening of λ in the symmetry-adapted case can be understood again as a reduction in variational freedom in the symmetry adapated case, leading to smaller compression. Note that while Eq. (73) nominally scales cubicly with N k , we expect each individual matrix element to decay like N −1 k , which yields the expected quadratic dependence of λ, or a linear dependence of λ when targeting the total energy per cell. In the supercell case, there are simply M 2 = (N k N ) 2 elements in the central tensor, which in turn controls the scaling of λ. From Table II and Fig. 9 we can conclude that there is asymptotically no advantage to incorporating symmetry in the THC factorization for the Toffoli complexity, with both the supercell and symmetryadapted methods exhibiting approximately quadratic scaling with system size for a fixed target accuracy of the total energy per cell.

IV. SCALING COMPARISON AND RUNTIMES FOR DIAMOND
We now compare runtimes and estimate total physical requirements to simulate Diamond as a representative material. In Figure 11 we plot the total Toffoli complexity for the sparse, SF, DF, and THC LCUs using symmetryadapted block encodings and supercell calculations for Diamond with cc-pVDZ and cc-pVTZ basis sets at various Monkhorst-Pack samplings. In sparse and SF there is a clear asymptotic separation between supercell and symmetryadapted Toffoli counts. This is expected from the fact that both block encoding constructions are asymptotically improved and λ does not increase. For the DF case, total Toffoli complexity for supercell and symmetry-adapted cases is similar due to the larger λ for the symmetry-adapted algorithm. For THC, the total Toffoli complexity is similar in the supercell and symmetry adapted case, but the asymptotic scaling is identical for the supercell and symmetry-adapted algorithms. This is due to the increase in λ for the symmetry-adapted algorithm.
In Table III we tabulate the quantum resource requirements and estimated runtimes after compiling into a surface code using physical qubits with error rates of 0.01% and a 1 µs cycle time. We assume four Toffoli factories similar to References [32] and [37] and observe that for systems with 52-1404 spin-orbitals the quantum resource estimates are roughly in line with extrapolated estimates from the molecular algorithms.
It is important to note that while the THC resource requirements look competitive for these small systems, in its current form it is not a practical way to simulate materials at scale. This is due to the prohibitive cost of reoptimizing the THC factors which significantly limits the system sizes that can be simulated. Moreover, as discussed in Section III D, we caution that the THC trend lines are only valid within the fitting range, and we expect that asymptotic THC Toffoli count will trend more towards O(N 2 k ) in the thermodynamic limit.  TABLE III. Diamond represented in a cc-pVDZ basis (52 spin-orbitals in the primitive cell) at various k-mesh sizes and the associated quantum resource requirements to compute the total energy per cell to within 1 kcal / mol. The surface code runtime is estimated using four T-factories, a physical error rate of 0.01%, and a cycle time of 1 µs. The physical qubit count is given in millions.

V. CLASSICAL AND QUANTUM SIMULATIONS OF LNO
In this section, we compare modern classical computational methods with quantum resource estimates in the context of a challenging problem of industrial interest: the ground state of LiNiO 2 .

A. LNO background
Layered oxides have been the most popular cathode active materials for Li-ion batteries since their commercialization in the early '90s. While LiCoO 2 is still the material of choice in the electronics industry, the increasing human, environmental and financial cost of cobalt spells out the need for cobalt-free cathode active materials, especially for automotive applications [84,85].
The isostructural compound LiNiO 2 (LNO) had been identified as an ideal replacement for LiCoO 2 already in the '90s, due to its comparably high theoretical capacity at a lower cost [86,87]. Despite its numerous drawbacks, LNO still serves as the perfect model system for many derivative compounds such as lithium nickel-cobalt-manganese (NCM) and lithium nickel-cobalt-aluminum oxides (NCA) that are nowadays the gold standard in the automotive industry [38]. Moreover, the constant demand for better performing materials pushes the amount of substituted Ni to the dilute regime and the research trend is approaching the asymptotic LiNiO 2 limit, making LiNiO 2 a system of interest in battery research [38].
Even the nature of the ground state of LNO is still under debate. The universally observed rhombohedral R3m symmetry [38], with Ni being octahedrally coordinated to six oxygen atoms through six equivalent Ni-O bonds conflicts with the renowned Jahn-Teller (JT) activity of low-spin trivalent Ni, which has been experimentally proven on a local scale [38]. In a recent DFT study [39], we argued that this apparent discrepancy might be resolved by the dynamics and low spatial correlation of Jahn-Teller distortions. In that work, the energy distance between Jahn-Teller distorted and non-distorted candidates ( Figure 12) compared to zero-point vibrational energies makes a strong argument in favor of the dynamic Jahn-Teller effect. A non-JT distorted structure resulting from the disproportionation of Ni 3+ has also been reported as a ground state candidate [40] despite the 1:1 ratio between long and short Ni-O bonds, which conflicts with the experimentally determined 2:1 ratio. In the original study, the stability of this structure has been found to depend heavily on the value of the on-site Hubbard correction applied to the PBE functional. With the SCAN-rVV10 functional (with and without on-site Hubbard correction) [39], this candidate is consistently less stable than the JT-distorted models; it is also worth mentioning that the on-site Hubbard correction considerably increases the stability of the JT-distorted models. The dependence of Jahn-Teller stabilization energies on the functional had already been observed by Radin [88] and is ascribed to the difficulty to adequately describe the doubly degenerate high-symmetry, undistorted state.
In light of previous studies, we will focus on four candidate structures for the LNO ground state. These structures are shown in Figure 12. We will furthermore focus only on the energetics of the problem. The goal is to compute the relative energies of these different crystal structures without the uncertainty of DFT.

B. Correlated k-point calculations
Local-basis quantum chemistry methods for electron correlation have been increasingly applied to periodic systems as an alternative to DFT with more controllable accuracy. Here we apply two such methods, second order Møller-Plesset perturbation theory (MP2) [89,90] and coupled cluster singles and doubles (CCSD) [91,92], to the three distorted structures of LNO ( Figure 12). Local basis methods like these can be directly compared to quantum algorithms described in this work, since both are formulated within the same framework of a crystalline Gaussian one-particle basis. While these methods cannot be easily applied to the symmetric structure, which is metallic at the mean-field level, they should provide accurate results for the distorted structures provided that the finite-size and finite-basis errors can be controlled. All mean field, MP2 and CCSD calculations were performed with the PySCF program package [93,94]. QMCPACK [95,96] was used to perform the ph-AFQMC calculations, where we used at least 600 walkers and a timestep of 0.005 Ha −1 . The population control bias was found to be negligible. In all calculations, we use separable, norm-conserving GTH pseudopotentials [65,97] that have been recently optimized for Hartree-Fock [98]. In all calculations on LNO we use the GTH basis sets [41,99] (GTH-SZV and GTH-DZVP specifically) that are distributed with the CP2K [42] and PySCF [94] packages. In Figure 13 we show the convergence of the minimal-basis MP2 energy as a function of effective cell size for increasingly large k-point calculations. This demonstrates the essential difficulty in converging to the bulk limit for correlated calculations: the finite-size error will converge with n −1/3 k . Shifting the k-point grid to (1/8, 1/8, 1/8) and/or twist averaging (TA) does not change the asymptotic behavior of the energy. In all other LNO calculations, we use Γ-centered k-point grids. In all calculations, the density of k-points along each reciprocal lattice vector was chosen so that the density of k-points is as close to constant as possible.
While a minimal basis is useful for a qualitative understanding of the finite-size error, it is does not provide sufficient accuracy to resolve the different LNO structures examined in this work. The double-zeta basis set (GTH-DZVP) is large enough to provide qualitative accuracy, but converging the result to the bulk limit is prohibitively expensive for the systems considered here. We can nonetheless provide some estimates of the ground-state CCSD and MP2 (DZVP) energies as shown in Table IV and V. Since there is no evidence of particularly "strong correlation" in any of these systems (see Appendix E for a more detailed discussion), MP2 and CCSD should provide qualitatively correct estimates of the ground state energy. The unusually large MP2 correlation energy for the P2/c structure suggests it may not be as reliable for this structure, and this suspicion is confirmed by the CCSD and ph-AFQMC calculations. For CCSD and ph-AFQMC, the P2 1 /c structure is lowest in energy which agrees qualitatively with the DFT calculations in Ref. [39]. However, this prediction carries with it a great deal of uncertainty due to the small simulation size, small one-particle basis set, and error in the MP2/CCSD/ph-AFQMC approximations.

C. Single shot density matrix embedding theory
Another way to apply high-level correlated methods to periodic solids is through quantum embedding methods in which a local impurity is treated with a high-level method and the remainder of the system, the bath, is treated at a lower level of theory. For periodic solids, dynamical mean-field theory (DMFT) is perhaps the most widely successful such method [20,[100][101][102][103]. Density matrix embedding theory (DMET) is an efficient quantum embedding method for the ground state of quantum systems [104,105], and it has recently been applied to periodic solids with a fully ab initio Hamiltonian [20].
Though very large impurities are necessary to converge to the bulk limit of the correlated method used for the impurity, a fixed impurity size provides a local, systematically improvable approximation to the correlation energy. This is particularly useful in cases where a local treatment of correlation is sufficient for a qualitatively correct solution.
Here we apply DMET with a CCSD impurity solver to the distorted structures in a minimal basis set (GTH-SZV). The libdmet code [19] was used for the DMET calculations with the PySCF program package [93,94] used in the impurity solver. Figure 14 shows the minimal-basis DMET results with for each of the three distorted structures. In the context of DMET, we can effectively converge the mean-field part of the problem. Unfortunately, the small one-particle basis set and modest impurity size make it unable to meaningfully resolve these three structure of LiNiO 2 . Quantum simulation can potentially overcome some of these limits by acting as a lower scaling, unbiased impurity solver [106].

D. Quantum resource estimates for LNO
Quantum resource estimates for LNO using the SF and DF LCUs are reported in Table VI. THC is not reported due to the difficulty of re-optimizing the THC tensors to have low L1-norm as discussed in References [32,37,79]. For the sparse LCU, a threshold of 1 × 10 −4 was determined by averaging the thresholds for the systems in Table I required to achieve 1 mE H per unit cell. For the SF LCU the truncation of the auxiliary basis was set to eight times the number of molecular orbitals which was determined by requiring the error in the MP2 energy for the smallest C2/m system to be less than 1 mE h per formula unit. For DF, the same requirement was used to determine a cutoff for the second factorization of 1 × 10 −3 . The trends are consistent with what was observed in Section IV: DF is consistently more efficient than either sparse or SF LCUs.
For the smaller systems, these calculations are anticipated to be useful for benchmarking faster classical methods. For the larger systems, the estimated run times are daunting, but we are optimistic that further algorithmic improvements can make calculations like these feasible in the future.

VI. CONCLUSION
In this work we developed the theory of symmetry-adapted block encodings for extended system simulation using four different representations of the Hamiltonian as LCUs in order to improve quantum resource costs for reaching the thermodynamic limit when simulating solids. In order to realize an asymptotic speedup due to symmetry, we substantially modify the block encodings compared with their molecular counterparts. To demonstrate these asymptotic improvements we compiled constant factors for all four LCUs and compared their performance on a suite of benchmark systems and a realistic problem in materials simulation. We find that despite a clear asymptotic speedup for walk operator construction there are competing factors (such as lower compression in Hamiltonian tensor factorizations) that make it difficult to observe a large speedup using symmetry. It was recently shown that variationally constructing tensor compressions for Hamiltonian simulation can improve quantum resource requirements [37,79,107] and thus we believe the compressions can be improved to ultimately demonstrate a speedup for these types of simulations.
For the sparse and SF LCUs we derive a O( √ N k ) speedup in constructing select and prepare by ensuring only the minimal amount of symmetry unique information is accessed by the quantum circuit through QROM. In both cases a speedup is observable, though it is much clearer in the SF case. Observing the sparse LCU speedup is more challenging due to the difficulty of converging the N k and N dependence of the two-electron integrals. Compared with the molecular case where sparse was competitive with the DF and THC algorithms [32,71], largely due to the simplicity of select, we find that sparse is not viable for converging to the thermodynamic limit of solids.
The DF and THC tensor factorizations yield LCUs as unitaries in non-orthogonal bases and lead to much higher compression than sparse and SF LCUs. In the DF case we derive an asymptotic O( √ N k ) improvement in Toffoli complexity and qubit cost when constructing the qubitization walk operator. Unfortunately, λ is increased in these cases. The increase is attributed to the lower variational freedom in constructing non-orthogonal bases when repre- senting the two-electron integral tensor in factorized form compared with the non-symmetry adapted setting. For the THC case, no asymptotic speedup is formally possible. This stems from the linear cost of unary iteration over all basis states. Nevertheless, due to competing prefactors between unary iteration and state preparation, we do observe a √ N k scaling improvement in the Toffoli per step and logical qubit cost for the range of systems studied. This is likely a finite size effect, but may be a practically important when considering which algorithm to chose in the future. Thus, improving the λ value of THC through more sophisticated and affordable means is worth further investigation.
Reaching the thermodynamic and complete basis set limit is very challenging, even for classical wavefunction methods like CCSD and ph-AFQMC. Previous ph-AFQMC results for simple insulating solids with two-atom unit cells suggest that at least a 3 × 3 × 3 and 4 × 4 × 4 sampling of the Brillouin zone is required to extrapolate correlation energies to the thermodynamic limit [108]. Similarly, it has been found that quadruple-zeta quality basis sets are required to converge the cohesive energy to less than 0.1 eV / atom, while a triple-zeta quality basis is likely sufficient for quantities such as the lattice constant and bulk modulus [109]. Similar system sizes and basis sets were found to be required for CCSD simulations of metallic systems [18]. Although the theory of finite size corrections [110][111][112][113] is still an area of active research [114,115], the simulation of bulk systems even with these corrections typically requires on the order of 50 atoms, which in turn corresponds to hundreds of electrons and thousands of orbitals. For excited state properties, particularly those concerning charged excitations, even larger system sizes may be required without the use of sophisticated finite size correction schemes [116]. Thus, we suspect that simulating large system sizes will continue to be necessary in order to obtain high accuracy for condensed phase simulations. It is important to note that high accuracy classical wavefunction methods are often considered too expensive for practical materials simulation, and DFT is still the workhorse of the field. Appendix F shows that simulating even simple solids with coarse k-meshes can take on the order of hours, which would otherwise take seconds for a modern DFT code. From the quantum resource estimates it is clear that several orders of magnitude of improvement are necessary before practical materials simulation is possible. Despite this, the fairly low scaling of phase estimation as a function of system size serves as encouragement to pursue quantum simulation for materials further.
The aforementioned convergence difficulties are demonstrated in our classical calculations on the LNO system when attempting to resolve the discrepancy between band-theory predictions and experimental observations of the ground state geometry. Furthermore, the variance in energy between CCSD, MP2, ph-AFQMC, and DMET (and their expenses) make it difficult to select an efficient method for determining Hamiltonian parameter cutoffs to use in quantum resource estimation. If anything, this highlights the need for high accuracy classical computation when performing quantum resource estimates and ultimately picking an algorithm for quantum simulation. The quantum resource estimates for LNO simulations are exorbitantly expensive even at small k-mesh; estimated to run in O(10 2 ) − O(10 3 ) days using the DF LCU. Just as resource estimates for chemistry fell drastically with algorithmic developments clearly further algorithmic improvements are needed to make a LNO sized problem feasible on a quantum computer.
Qubitization is a general tool for Hamiltonian simulation and there may be other simulation scenarios when the improved walk operators yield faster simulations. There are also areas to further improve the quantum algorithms by taking advantage of space group symmetry along with translational symmetry. In classical calculations this can lead to substantial computational savings even at the mean-field level. Just as in the case of quantum algorithms for molecular simulations we expect the quantum resource costs to fall with further exploration of algorithmic improvements.
Appendix A: Sparse representation derivations

The Pauli operator representation of the one-body term
Here we derive the Pauli operator form of the one-body operator amenable to implementation as a Majorana select operation. The one-body operator is rewritten as Re(h pk,qk ) −i ZY pkσ ZX qkσ + i ZX pkσ ZY qkσ Im(h pk,qk ) ZX pkσ ZX qkσ + ZY pkσ ZY qkσ + 1 2 Re(h pk,qk ) ZX qkσ ZY pkσ + ZX pkσ ZY qkσ Im(h pk,qk ) ZX pkσ ZX qkσ + ZY pkσ ZY qkσ + 1 2 Re(h pk,qk ) ZX pkσ ZY qkσ + i 4 Im(h pk,qk ) ZX pkσ ZX qkσ + ZY pkσ ZY qkσ + 1 2 In the last line we have used the symmetry of Re(h pk,qk ) to combine ZX qkσ ZY pkσ and ZX pkσ ZY qkσ , then used the fact that iXY = −Z to combine the sum with p = q with that for p. The complete expression for the Hamiltonian has the sum over σ and k, which we have left out for simplicity here. Including those gives the expression in Eq. (19).

One-body correction for sparse case
Next we derive the effective one-body term from the two-electron part of the Hamiltonian. In the case p = q and Q = 0, the second term in square brackets in Eq. (20) can be written as − V * pk,q(k Q),r(k Q),sk a † pkσ a q(k Q)σ a r(k Q)τ a † sk τ = V * pk,q(k Q),r(k Q),sk a pkσ a † q(k Q)σ a r(k Q)τ a † sk τ − V * pk,q(k Q),r(k Q),sk (a pkσ a † q(k Q)σ + a † pkσ a q(k Q)σ )a r(k Q)τ a † sk τ = V * pk,q(k Q),r(k Q),sk a pkσ a † q(k Q)σ a r(k Q)τ a † sk τ − V * pk,q(k Q),r(k Q),sk a r(k Q)τ a † sk τ . (A2) In the last line we have used the fact that for p = q and Q = 0, a pkσ a † q(k Q)σ + a † pkσ a q(k Q)σ is just the identity, so this becomes a one-body operator.
Similarly, if r = s and Q = 0 (but p = q), the second term in square brackets in Eq. (20) can be written as Thus we see that in either case (p = q or r = s), we have the same expression as in Eq. (21), plus a one-body operator. Moreover, because of the symmetry of V (in swapping the pq pair with the rs pair), these corrections are equal. Note also that we can relabel swapping p with q and r with s to replace V * pk,qk,rk ,sk a pkσ a † qkσ with (now explicitly taking This means that the contribution of these corrections is In this expression the constant factor is determined as follows. There is a factor of 1/4 in Eq. (20). Next, there is a factor of 2 because we have the contribution from p = q as well as that from r = s. Last, there is the factor of 2 from the summation over the spin τ . As a result, these factors cancel to give 1 above. Therefore, for p = q, we can combine this one-body term with h pq as Next we consider the case where p = q, r = s, and Q = 0. Then the second term in square brackets in Eq. (20) can be written as The operators in brackets in the final line can be written as, taking p = q, r = s, and Q = 0, a † pkσ a pkσ a † rk τ a rk τ + a pkσ a † pkσ a † rk τ a rk τ − a pkσ a † pkσ a † rk τ a rk τ − a pkσ a † pkσ a rk τ a † rk τ = a † rk τ a rk τ − a pkσ a † pkσ = a † rk τ a rk τ + a † pkσ a pkσ − 1 1.
By symmetry of swapping p and q, and swapping r and s, we must be able to simplify the final line of (A7) to V pk,pk,rk ,rk (a † rk τ a rk τ + a † pkσ a pkσ − 1 1).
That is, this value of V is real. We can also relabel r and p and use symmetry to show the contribution from the first term in Eq. (A9) is equivalent to V rk ,rk ,pk,pk a † pkσ a pkσ = V pk,pk,rk ,rk a † pkσ a pkσ . (A10) Hence the contribution of these corrections is In this case, the constant factor comes from 1/4 in Eq. (20), and a factor of 2 from the sum over τ . As a result the expression in Eq. (A9) is divided by 2 here, and apart from the identity we have the same expression as that accounting for only one of the pairs p, q and r, s being equal. The operator proportional to the identity can be ignored in the implementation of the Hamiltonian because it just gives a global shift in the eigenvalues.

Complexity for sparse implementation
The fundamental operator we are aiming to implement is in the form of Eq. (19) for the one-body term and Eq. (23) for the two-body term. In both we have a real part and an imaginary part; for the one-body term this is h pq , and for the two-body term this is V pk,q(k Q),r(k Q),sk . We need to perform a state preparation that provides real amplitudes for the real and imaginary parts of h and V on separate basis states (not just real and imaginary parts of an amplitude on each basis state). This means the number of items of data to output is doubled in order to give the real and imaginary parts. The state preparation is otherwise essentially unchanged from that in [71], as described in Eq. (48) of that work and the accompanying explanation.
Recall that in the sparse state preparation procedure, we use a register indexing the nonzero entries (see Eq. (43) of [71]). That is used to output "ind", "alt", and "keep" values via QROM (see Eq. (44) of [71]). The "ind" values are values of p, q, r, s, as well as the sign needed, and a qubit distinguishing between the one-and two-body terms. The "alt" values are alternate values of these quantities, and "keep" governs the probability of swapping these registers for the state preparation via coherent alias sampling. Since we need a bit to flag whether the amplitude being produced is for the real or imaginary part, that would indicate we need two extra bits output, one for the "ind" value and one for the "alt" value. However, we can use one bit in the register indexing the nonzero entries to flag between real and imaginary parts. It is most convenient to make this register the least significant bit. Then we just need to produce "alt" values of this register, so the output size is only increased by 1 bit instead of 2. A requirement for this approach is that the non-zero entries of V that are retained are the same for the real and imaginary parts.
A further increase in the size of the output register is because we need to output values of k, k , and Q. The number of bits needed to store k is not simply log N k because k is a vector. The number of bits will be denoted n k . If we assume that the number of values is given by the product of numbers in the three dimensions N k = N x N y N z , then n k = log N x + log N y + log N z .
(A12) Therefore k, k , and Q increase the size of both the ind and alt registers by 3n k , for a total of 6n k . The size of the output is given in Eq. (A13) of [32] as m = ℵ + 8 log(N/2) + 4, and would here be increased to where we have also increased the size of the output by 1 to account for selecting between real and imaginary parts, as discussed above. The quantity ℵ is the number of bits for the "keep" register. The remaining consideration for the sparse state preparation is the symmetry. In prior work there were three symmetries, with swap of p, q with r, s as well as swaps within the p, q and r, s pairs. The method to take advantage of this was described from about Eq. (49) on in [71]. There you only perform the preparation for a restricted range of p, q, r, s, then use three qubits to control swaps to generate the symmetries.
Here we have the symmetry with swap of p, q with r, s, but we can only swap the p, q and r, s pairs simultaneously. We also need to take the complex conjugate when performing that swap. In order to implement the symmetries here, we will have two control qubits. One qubit will be in a |+ state and control swap of the p, q with r, s as before, except we now have the registers containing k, k , k Q, k Q to swap. That qubit is only set to |+ for the two-body term, since that symmetry does not make sense for the one-body term. The second qubit is used to simultaneously swap the p, q and r, s pairs, as well as the registers containing k, etc. It will also be used as a control for a Z phase gate on a qubit flagging imaginary components. That is a Clifford gate and is not included in the Toffoli count.
The net result is that the cost of the swaps to produce these symmetries is unchanged from that in [71], except in that we are counting the qubits needed to store k, etc, as well as p, q, r, s. Since a controlled swap of two qubits can be performed with a single Toffoli (and Clifford gates), the Toffoli cost of the two controlled swaps of registers is the total number of qubits used to store p, q, r, s as well as k, k , k Q, k Q, which is 4 log(N/2) + 4n k . Note that in the state preparation we will be producing k, k , Q, and need to compute k Q and k Q before performing the swaps for these symmetries.
Assuming for the moment that N x , N y , N z are all powers of two, then the number of Toffolis needed for the modular subtractions of the three components will be n k − 3, unless one or more of N x , N y , N z is equal to 1. It is simpler to give the cost as n k Toffolis, to avoid needing to address special cases. A further complication is when one or more of N x , N y , N z are not powers of two. In this case, the subtraction can be performed in the usual way for two's complement binary. Then you can check if the result for any component is negative, and if it is then add the appropriate N x , N y , N z to make it non-negative. The controlled addition of a classically given number has complexity n k , so this at worst doubles the complexity to 2n k for the modular subtraction.
The other major feature that we need to account for is the modified select operation needed. The basic circuit primitive was given in Figure 13 of [32], in order to apply ZY p,σ followed by ZX q,σ . A more complicated circuit primitive was given in Figure 1 of [71], which included testing p = q which is not needed in the approach of [32]. Here the scheme is more complicated, because instead of having a fixed sequence where we need to apply Y followed by X we have every combination. This can be achieved by simply performing each twice; once with a controlled ZY and once with a controlled ZX, with a doubling of the Toffoli complexity. That can be seen easily from the diagram in Figure 9 of [2]. There a control qubit is used, so that can be used to control application of this circuit with Y , then to control application of this circuit with X.
To understand how X versus Y is selected, note that there are effectively five bits controlling here. Let us call the bit selecting between the one-and two-body terms b 0 ; this is created in the sparse state preparation. Let us call the bit selecting real versus imaginary parts b 1 ; this is again created in the state preparation. There also needs to be a bit b 2 for selecting between the two lines for real and the two lines for imaginary in the expression in Eq. (23). Then we have b 3 to select between the two terms in the first set of square brackets in each line of Eq. (23), and a bit b 4 selecting between the two terms in the second set of square brackets. Now, considering the operators indexed by r, s first, these are applied for the two-body terms but not the one-body term. This control of the operations adds only one Toffoli to the cost. For the first operation, ZX sk τ or ZY sk τ , we can see that the selection between X and Y depends only on bit b 4 . For the second operation, the selection is independent of whether we have the real or imaginary part. We select X if we have b 4 = 0 (the first term) and b 2 = 0 (the first line), or if we have b 4 = b 2 = 1. To create a bit selecting between X and Y we can simply perform a CNOT between these bits, with no Toffoli cost.
Next, consider the operators indexed by p, q. For simplicity we will first consider just the two-body terms. Again the first operation can select between X and Y just by using the bit b 3 selecting between the terms. Then for the second operation, we select X if we have b 1 , b 2 , b 3 equal to 0, 0, 0, or 0, 1, 1, or 1, 0, 1, or 1, 1, 0. It is easily seen that if we apply CNOTs with b 1 then b 2 as control and b 3 as target, then we should apply X if we have b 3 = 0. This selection can be performed without Toffolis again. Now to take account of how the one-body terms are applied, it is convenient to rewrite the first line of Eq. (19) as Then the selection between the operations is identical to that for b 2 = 1 (second lines) for the two-body part. Therefore, for the above analysis of the two-body implementation, we can replace b 2 with a bit that is 1 if b 2 = 1 OR b 0 = 0. This operation requires one more Toffoli. Another modification we need to make is to compute k Q and k Q to use in the selection for the two-body operations. As explained above, these modular subtractions have complexity at worst 2n k . The calculation k Q needs to be controlled on the bit b 0 selecting between the one-and two-body terms, which increases its complexity by n k . Therefore the complexity of this arithmetic is 3 log N k Toffolis. We can keep the working qubits in order to uncompute this arithmetic with Clifford gates.
Finally, we should account for the phase factors needed in the implementation. The phase factors needed are as follows.
1. We should apply an i phase factor on the one-body term. That can be implemented with an S gate which is Clifford.
2. If we have the one-body term (flagged by b 0 = 0) we should flip the sign of the real part (flagged by b 1 = 0). This can be done with a controlled phase, which is again Clifford.
3. For the two-body term (b 0 = 1), real (b 1 = 0), and second line (b 2 = 1) we should flip the sign. This doubly controlled phase has a cost of one Toffoli. 4. We should flip the sign with b 3 = 1 if we have the two-body term (b 0 = 1) and the second line for real (b 1 = 0, b 2 = 1) or the first line for imaginary (b 1 = 1, b 2 = 0). We should also flip the sign with b 3 = 1 if we have the one-body term (b 0 = 0) and real (b 1 = 0). To achieve this we can first perform a CNOT with b 1 as control and b 2 as target. Then, if b 0 = 0, b 1 = 0 OR b 0 = 1, b 2 = 1 we should apply a Z gate to the qubit containing b 3 . This can be achieved with two double controlled phase gates, so has Toffoli cost 2.
5. We should flip the sign for b 4 = 1 if we have the second line b 2 = 1. That is just a controlled phase with no non-Clifford cost.
As a result, the total complexity of implementing these phase factors is 3 Toffoli gates. The total additional complexity is therefore 2 Toffolis for the selection of X versus Y when we account for needing to perform the one-or two-body term, 3 log N k Toffolis for subtractions, 3 Toffolis for phase factors, and doubling the selection cost to select between X and Y . The two Toffolis to account for the one-body term were one for selecting performing the operators indexed by r, s, and another Toffoli to perform an OR between b 0 and b 2 for the operators indexed by p, q.
A further complication arises where the h and V are dependent on the spins σ and τ . This is easily accounted for by outputting the values of σ, τ as part of the state preparation. This means that the size of both the "ind" and "alt" outputs are increased by 2, making the total size of the output increase by 4 to be ℵ + 8 log(N/2) + 6n k + 9. (A15) Often there is the symmetry that for V the value with σ =↑, τ =↓ are the same as for σ =↓, τ =↑. This means that we can omit the case σ =↓, τ =↑, and use a swap of these two qubits controlled by an ancilla qubit in the usual way for obtaining symmetries. In the detailed costing below, we give results for the case where h and V are not dependent on spin for simplicity. The QROM output size is where n N = log(N/2) . This output size is increased above that analysed in [32]. Then, using that output size, the formula for the cost of the preparation with d unique nonzero entries is and of the inverse preparation is Here k 1 and k 2 must be chosen as powers of 2. This formula is the same as in [32], but with the modified value of m.
To begin the state preparation, we need to prepare an equal superposition state over d basis states. The analysis is described in [32], which gives the costing 3 log d − 3η + 2b r − 9 Toffoli gates. Here η is a number such that 2 η is a factor of d, and b r is a number of bits used for rotation of an ancilla qubit to improve the amplitude of success. This is a cost needed both for the preparation and inverse preparation.
Other minor Toffoli costs are as follows. We use extra ancillas to save cost, because a large number of ancillas were used for the QROM, and can be reused here without increasing the maximum number of ancillas needed. In the following we use the notation n N = log(N/2) . Figure 13 of [32] twice, but controlling between X and Y . This complexity is 4N N k − 6, since we have 8 times a complexity of N N k /2 − 1 for each of the selected operations, plus 2 Toffolis to generate the qubits we need for the control. There were two Toffolis needed to account for selecting between one-and two-body terms, and otherwise the selection to account for the various terms can be performed using Clifford gates. In addition to this, we need to perform swaps controlled by spin qubits twice for each of the two spin qubits, with a complexity 2N N k . That then gives a total complexity of this step 6N N k − 6.

Perform select as shown in
2. The state preparation needs an inequality test on ℵ qubits, as well as controlled swaps. The controlled swaps are on n N + 3n k + 2 qubits. Here 4 log(N/2) are for the values of p, q, r, and s, the 3n k is for the k, k , and Q values, a +1 is for the qubit which distinguishes between the one-and two-electron terms, and a further +1 comes from the qubit for selecting between the real and imaginary parts. There are also ind and alt values of the sign, but the correct phase can be applied with Clifford gates, so this does not add to the Toffoli cost. The cost of the inequality test on ℵ qubits is ℵ. As in [32], we can eliminate the non-Clifford cost of the inverse preparation using ancillas and measurements, so the Toffoli cost is ℵ + 4n N + 3n k + 2.
3. The controlled swaps used to generate the symmetries have a cost of 4n N + 4n k . This is increased by 4n k over that in [32], since we need to swap k registers as well. Although there are only two controlled swaps rather than three in [32], two of the controlled swaps in [32] together act on as many qubits as one controlled swap here, so the factor of 4 is the same as in [32]. A further 4n k cost is for computing k Q and k Q (or 2n k if N x , N y , N z are powers of 2), and an extra n k is needed to make the computation of k Q controlled. Again these controlled swaps can be inverted for the inverse preparation with measurements and Clifford gates. Thus the total Toffoli cost here is 4n N + 9n k .
4. For the qubitization construction a reflection on the ancilla is needed as well. The qubits that need to be reflected on are There is no non-Clifford Toffoli cost for the preparation on b 2 , b 3 , b 4 , since an equal superposition may be prepared with a Hadamard. They are control qubits that need to be reflected upon for the qubitization, so add a cost of 3 Toffolis to the reflection giving a total cost log d + ℵ + 6.
5. As before, the control for the phase estimation uses unary iteration on the control registers, with one more Toffoli for each step. The control by these registers is implemented simply by controlling the reflection, which needs just one Toffoli per step.
6. An extra three Toffolis are needed for the phase factors.
The total cost for a single step is then with m = ℵ + 8n N + 6 log N k + 5, n N = log(N/2) , η an integer such that 2 η is a factor of d, and b r the number of bits used for rotation of an ancilla qubit. We may count the qubit costs by considering the maximum used during the QROM, as the advanced QROM has a high qubit usage that will not be exceeded in other parts of the algorithm. The qubit costs are therefore as follows.
Appendix B: Single-factorization derivations 1. One-body correction for single factorization For the single factorized form of the Hamiltonian, we may use the same expressions forÂ andB for the case Q = 0 as for Q = 0, with an additional correction proportional to the identity. This yields a one-body correction in the case ofÂ but notB. ForÂ n (Q = 0) we obtain a term proportional to the identity, as followŝ A n (Q = 0) = 1 2 σ∈{↑,↓} k p =q L pk,qk,n a † pkσ a qkσ + L * pk,qk,n a † qkσ a pkσ + σ∈{↑,↓} k p Here we have used the symmetry of L, so L pkpk,n is real. This derivation is similar to that for the one-body term in Appendix A 1. BecauseÂ n (Q = 0) is squared, the identity term gives rise to a one-body correction Here there was a factor of 1/2 on the square ofÂ n (Q = 0), a factor of 2 from the cross term in the square, a factor of 2 from the sum over the spin on the identity, and so a factor of 1/2 has been cancelled. The form of this correction is identical to that for the one-body term, except h pk,qk is replaced with ForB n (Q = 0), it is easily seen that the symmetry L pkqk,n = L * qkpk,n implies thatB n (Q = 0) = 0. If we use the form forB n (Q = 0) in terms of Pauli operators given in Eq. (37), then it will be proportional to the identity due to the case p = q. Squaring then just gives a correction proportional to the identity (which can be ignored in the implementation because it is just an energy shift), and it gives no one-body correction. As a result we add the expression in Eq. (B4) to h pq to obtain the complete one-body Hamiltonian given in Eq. (38).

Complexity for single-factorized representation
To see the changes we need to make to the algorithm for the single-factorized representation, recall that the two-body term was of the form [32] where Q pqσ was an individual Pauli string. So the changes in the representation are • The sum over up to L has been replaced with a sum over Q and n, as well as a sum over the squares of A and B.
• Inside the square, the sum over just σ, p, q now also has a sum over k.
• Inside the sum, instead of just having a single Pauli string, we have a sum over 4, with real and imaginary parts of L pkq(k Q),n .
The amendments we will make to the original algorithm (according to the description in [32]) to implement the block encoding are as follows.
• For the sum over Q and n we can combine them into , and use the same state preparation method as before.
The value of Q will need to be used in the select operation, so needs to be output as part of that state preparation.
• In the preparation for the block encoding of A and B, the index k will be needed as well as p and q.
• We no longer take advantage of p, q symmetry.
• We need to perform arithmetic to compute k Q and k Q, with a cost of 4n k (or 2n k if N x , N y , N z are powers of 2).
• A number of qubits can be used for selecting between the parts of the linear combination of unitaries, similar to the sparse case. We have b 0 to select between the one-and two-body terms, b 1 for selecting between the real and imaginary parts, and b 3 selecting between the two terms in one application of A or B. The qubit b 2 can be used for selecting between A and B, which is a change from the sparse case, where it was used for selecting between lines. We do not need b 4 because we are implementing A or B twice (and creating the bit b 3 both times).
• There needs to be a doubling of the selection cost to select between X and Y as in the sparse case.
• The creation of the qubits for controlling between X and Y can be performed with one additional Toffoli. Note first that the terms in A are equivalent to the one-body part, and the terms in B are the same except with the real and imaginary lines swapped around. This means that we can use b 0 and b 2 as a control to flip b 1 , which effectively swaps the real and imaginary parts for B so it can be implemented in the same way. Now, for the first selection of X versus Y , we can apply a CNOT with b 1 as control and b 3 as target, and use that as control For the second selection we can simply use b 3 as control.
• For the phase factors, we just need a sign flip if b 1 = 0 and b 3 = 1, which is a Clifford controlled phase.
To explain the modifications needed for the costings, here we give the sequence of steps with the same numbering as in [32], explaining the differences. (B6) where | , Q, n indicates which starts from 1 indexing values of Q, n, but Q and n are also output in registers. That is, we will be preparing while outputting values of Q, n. We are assuming the more difficult case where the number of values of Q or n are not powers of 2, but if they are then further simplifications are possible. This has complexity as follows.
(a) Preparing an equal superposition on M N k + 1 basis states has complexity 3n M N + 2b r − 9, where b r is the number of bits used for the rotation on the ancilla, with ℵ 1 being the number of bits used for the keep values (which govern the precision of the state preparation via the inequality test). Here n M N and 2n k are for and Q, with the factor of 2 accounting for ind and alt values of Q. The extra 2 qubits are for outputting a qubit showing if = 0 (for selecting between the one-and two-body parts). The complexity is (c) An inequality test is performed with complexity ℵ 1 .
(d) A controlled swap is performed with complexity n k + log M + 1.
2. Next, we prepare a state on the second register as pq0 |0, p, q, 0 + 2|Im(h pq )| |θ where θ kpq1 are used to obtain the correct signs on the terms, and the |+ states at the end are used to select the spin and control the swap between the p and q registers. Now we have a distinction from [32] in that we have separate real and imaginary parts, and a separate prepared qubit to flag between the real and imaginary parts. Because of the large number of variables, we will again use a single variable for iteration, and use it to output k, p, q. The complexity of this state preparation is then as follows.
(a) First, prepare an equal superposition over the variable for iteration. There are P = N k N 2 /2 values to take, which includes a factor of 2 for the real and imaginary parts, N k for k, and N 2 /4 for the values of p, q. Then the complexity of preparing the equal superposition is 3n P − 3η + 2b r − 9, where n P = log P , with η being the largest number such that 2 η is a factor of P .
(b) The size of the QROM output is where the first term is for the three components of k, the second is for p and q. The third is for ind and alt values of the qubit to store the correct sign, as well as an alt value of the extra qubit for selecting between the real and imaginary parts. We do not include an ind value for that qubit, because it is part of the register we are iterating over. The complexity of this QROM will be where we are accounting for the cost to select based on both the index from the factorization and the index for k, p, q, and using the result for the complexity of QROM on two registers from Appendix G of [32].
(c) Perform the inequality test with cost ℵ 2 , which is the bits of precision for this state preparation.
(d) Perform the controlled swap with the alt values with cost n k + 2n N + 1. Here we are swapping the ind and alt values of k, p, q, as well as the qubit selecting between real and imaginary parts. The sign required for the sign qubits can be implemented with Cliffords as in [32], so does not add to this Toffoli cost.
3. We no longer perform swaps of p and q for symmetry, but we do need to perform arithmetic to compute k Q and k Q, with a cost of 4n k .
4. Perform select by performing the sequence of four controlled ZX p,σ or ZY p,σ operations. The cost is 4(N N k /2− 1) Toffolis since it must be controlled, and there is a cost of one more Toffoli to create the qubits to control on. In order to select the spin we also perform a swap controlled by the spin selection qubit before and after, with a cost of N N k Toffolis.
5. Reverse steps 2 and 3, where the complexities are the same except the QROM complexity which is changed to 6. Reflect on the qubits that were prepared in step 2. The qubits we need to reflect on are as follows.
(a) The n P qubits for the variable of iteration.
(b) We need to reflect on the ℵ 2 registers that are used for the equal superposition state for the state preparation.
(c) One that is rotated for the preparation of the equal superposition state.
(d) One for the spin.
(e) One for controlling the swap between the p and q registers.
(f) One for selecting between the real and imaginary part.
(g) One for selecting between A and B.
That gives a total of n P + ℵ 2 + 5 qubits. The reflection needs to be controlled on the success of the preparation on the register, and = 0, making the total cost n P + ℵ 2 + 5 Toffolis.
7. Perform steps 2 to 5 again, but this time M N k + 1 is replaced with M N k in Eq. (B12) and Eq. (B13). Also, the select operation needs to be controlled on = 0, which flags the one-body term. That requires another 4 Toffolis.
8. Invert the state preparation on the register, where the complexity of the QROM is reduced to 9. To complete the step of the quantum walk, perform a reflection on the ancillas used for the state preparation. There are n M N + n P + ℵ 1 + ℵ 2 + 5, where the qubits we need to reflect on are as follows.
(a) The n M N qubits for the register.
(b) The n P qubits for the registers in the state preparation for A and B.
(c) The ℵ 1 qubits for the equal superposition state used for preparing the state on the register using the coherent alias sampling.
(d) The ℵ 2 qubits for the equal superposition state for preparing the state for A and B.
where the factor of 1/8 becomes a factor of 1/2 accounting for spin. This factor of 1/2 is further divided by two because we perform oblivious amplitude amplification-i.e. the inner step of qubitization evolving by 2Â n (Q) 2 − 1 and 2B n (Q) 2 − 1.
Next, the one-body terms in the third line of Eq. (C4) can be rewritten as Taking the trace of Eq. (C2) and (C3) then implies k1 A k = 1 Tr(Â n (Q)) = 1 The trace ofρ n (Q) is non-zero only for Q = 0. In that case which is real. Moreover, it is easily seen thatρ(0) is Hermitian using the symmetry L pkqk,n = L * qkpk,n , soÂ n (0) =ρ n (0) andB n (0) = 0. Therefore Therefore the contribution to the one-body Hamiltonian becomes As a result, the complete one-body Hamiltonian is This is identical to the result that was obtained in the single-factorization case as in Eq. (B4). Thus the L1-norm of H 1 is the sum where λ k,p is an eigenvalue of the matrix representing H 1 (k) which are the coefficients in the parenthesis of Eq. (C14).

Complexity of the double-factorized representation
Our form of the two-body part of the Hamiltonian iŝ and similarly forB n (Q). In comparison, the double-factorized Hamiltonian from [32,36] is So, in contrast to the decomposition before, instead of a sum over , we have a sum over Q, n, and a qubit indexing overÂ,B. This difference can be accounted for easily in the method as presented in [32]. That method may be summarized as follows.
1. Perform a state preparation over for the first factorisation.
2. Use a QROM on to output some parameters needed for the state preparation for the second factorisation (the operator that is squared).
3. Perform the inner state preparation over p.
4. Apply a QROM to output the sequence of rotations dependent on and p.
6. Apply a controlled Z.
7. Invert the Givens rotations, QROM, and state preparation over p.
8. Perform a reflection on the ancilla qubits used for the state preparation over p.
11. Invert the state preparation from step 1.
Note that this is distinct from the procedure in [36] which combined the and p preparations.
To account for the changes here, the index can be used to iterate through all possible values of Q, n, and the qubit indexing overÂ,B. Most of the steps can be performed ignoring these values, but we will need to know Q before performing the Givens rotations. It is convenient to output this value in the QROM used in step 2, which slightly increases the output size of this QROM. We will also need to output k values, and these will be given in the second state preparation used in step 3. But, that preparation will produce a joint index of p and k without giving k explicitly (similar to our preparation over not giving Q explicitly. This can be output by the QROM in step 4. In order to apply the Givens rotations, we will need to perform controlled swaps of system registers k, k Q into working registers, then apply the Givens rotations on those working registers. Since k Q is not given directly by the state preparation, it needs to be computed with cost 2n k (or n k if N x , N y , N z are powers of 2). The controlled swaps have a Toffoli cost of 2n k for the unary iteration, and N N k for the controlled swaps. The cost of N N k is because we need to run through N N k /2 system qubits twice. These controlled swaps are performed 4 times, because they need to be performed before and after each application of the Givens rotations. That gives a total cost from this part Then for the QROM outputting the Givens rotations, the number of items of data can be given as where Ξ Q,n,k,A and Ξ Q,n,k,B are the cutoffs in the sums forÂ andB. As per Eq. (47), we define Ξ as this quantity divided by L = 2N k M , so we can write the number of items of data as LΞ.
The Givens rotations need to be on 2N orbitals, so there are 2N Givens rotations. For each of these rotations two angles need to be specified, in contrast to one in [32,36]. The size of the data output for the QROM for the Givens rotations is increased to 2N , because there are 2 registers of size N/2, and there are 2 rotations of of precision for each Givens rotation. The total complexity of applying the Givens rotations is increased to 16N ( − 2). This is an increase of a factor of 4 over that in [32], with a factor of 2 from using 2 working registers, and a factor of 2 because there are two rotations for each Givens rotation.
The other changes in the cost are relatively trivial. There is a swap on the system registers controlled on the spin register. Since this is now on N N k qubits instead of N , the cost is multiplied by N k .
So, to summarize the complexity using the same numbering of steps as in [32], we have the following.
1. The cost of the state preparation over is where L is now 2N k M and as before b p1 = n L + ℵ 1 , n L = log L .

The complexity of the QROM on is now
with the extra n k being to output Q. Here n Ξ is the number of bits needed for Ξ k,p values of p, and n L,Ξ is the number of bits needed for the offset.
3. The cost of the second stage of state preparation is where the brackets are used to indicate the cost of parts (a) to (d) of step 3. As well as using our modified definition of Ξ, the only change over the costing in [32] is replacing N/2 with N N k /2 for the range of values for the one-body term. In this cost we are including the second use of the preparation in part 7.

The cost of the number operators via QROM is
(C26) Here the term 4N (k r − 1) has been increased by a factor of 4 over that in [32], because we have 2 times as many qubits that the Givens rotations need to act on, and there are twice as many rotations needed for each Givens rotation. (This term is corresponding to the output size for the QROM.) We have also added n k for the output size so we can output the value of k needed to select the register. Again N/2 is replaced with N N k /2 for the one-body term. The quantity 16N ( − 2) is for the cost of the Givens rotations, and is also multiplied by a factor of 4 over that in [32]. The 2N N k for the controlled swaps for spin, and is increased over 2N in [32] because we now have k.
5. The inversion of the state preparation has cost 2(7n Ξ + 2b r − 6) + 2(n L,Ξ − 1) + This cost is the same as in part 3, except the cost of erasing the QROM is reduced. We are again including both uses (with the second described in step 7).
6. The reflection for the oblivious amplitude amplification has an unchanged cost 7. The cost of the second use of the block encoding to give the square are already accounted for above.
8. The cost of inverting step 1 is and for inverting step 2 is where we are using the improved cost for erasing QROM.
9. The reflection cost is unchanged at 10. The extra cost of unary iteration on the control register and of controlling the reflection on that register is 2 Toffolis.
11. The new costs of performing controlled swaps into working registers and arithmetic to compute k Q and k Q are Adding all these costs together gives the total cost for block encoding the Hamiltonian. The cost in terms of logical qubits is very similar to that for the original double-factorized approach. The differences are as follows.
which, implies Here ( Q) is used to indicate a modular negative of Q, similar to modular subtraction. In the above expression the complement of G 1 , !G 1 , is defined through and it is important to note that ( Q) is defined to be in the original set of k-points and it is useful as we only build ζ Q . A similar expression can be derived for !G 2 . It is helpful to consider some concrete examples, which are given in Table VII.
. Some examples of the values that the different momentum labels can take in Eq. (D3). We restrict k, Q, ( Q) to be in the original k-point set.

Complexity of the tensor hypercontraction representation
The following is a detailed costing for the qubitization oracles using the THC LCU. In the initial state preparation, we need to prepare a superposition over Q, G 1 , G 2 , µ, ν with weights |ζ Q,G1,G2 µν |. The state can be prepared via the coherent alias sampling procedure, starting with QROM to output keep and alt values. One option here is to produce an equal superposition over Q, G 1 , G 2 , µ, ν, then calculate a contiguous register from these values to use for the QROM. That procedure is fairly complicated, because it requires preparing equal superpositions over three components of Q as well as G 1 , G 2 , µ and ν, then arithmetic for the contiguous register. To simplify the procedure we give the complexity for giving ind values like for sparse state preparation. That is, we prepare the contiguous register, and use the QROM to output both ind (index) and alt values of Q, G 1 , G 2 , µ, ν.
There is the symmetry ζ Q,G1,G2 µν = (ζ Q,G2,G1 ν,µ ) * , which indicates only half the range of µ, ν, G 1 , G 2 values need be prepared. It is convenient to prepare the full range, but use part of the range for real and part for imaginary components. If we only were considering µ, ν, we could use µ ≤ ν for real components and µ > ν for imaginary components. To account for G 1 , G 2 as well, we can combine them with µ, ν as least-significant bits for combined integers to use in inequality tests. This inequality test between µ, G 1 and ν, G 2 is used to give a qubit flagging that the component should be imaginary. A further qubit in a |+ state is used to control a swap of µ, G 1 with ν, G 2 registers, and a controlled Z gate on the qubit flagging the imaginary component gives the desired complex conjugate. As a result the range for µ, ν is M 2 taking account of giving real and imaginary components.
For Q there is also the symmetry where ζ Q,G1,G2 µν = (ζ ( Q),!G1,!G2 µν ) * , so it is only necessary to produce approximately half as many values of Q. This is complicated by the cases where Q = Q. If N x , N y , N z are odd, then the only case where this can be true is that Q = 0, so the number of values of Q that need be considered is (N x N y N z + 1)/2. If one of N x , N y , N z is even and the other two are odd, then for the one that is even there will be a second values of that component of Q that is equal to its negative. That means there are two value of Q overall satisfying Q = Q, and the number of unique values is N x N y N z /2 + 1. Similarly, if there are two even values of N x , N y , N z , then there are four values of Q satisfying Q = Q, and the number of unique values is N x N y N z /2 + 2. For all three of N x , N y , N z even the number of unique values is N x N y N z /2 + 4. We also need N N k /2 values for the one-body term. The number of values is then where v is the number of even values of N x , N y , N z . The size of the output is then where ℵ is the number of bits for the keep register. There is a factor of 2 at the front to account for ind and alt values, then log M for each of µ and ν, and n k for the components of Q. There is a further qubit distinguishing between the one and two-electron terms, a qubit giving the sign of the real or imaginary component of ζ, and 6 qubits for G 1 , G 2 , for a total +8.
1. There is a cost of N k N/2 for controlled swaps for the spin. In principle this is performed four times, because it is performed before and after the two c † c operators. The middle pair can be combined, with the single controlled swap being controlled by the parity of the two spin qubits, for a total cost of 3N k N/2.

2.
Before the state preparation, we need to prepare an equal superposition over d basis states, with costing 3 log d − 3η + 2b r − 9 Toffoli gates. As before, η is a number such that 2 η is a factor of d, and b r is a number of bits used for rotation of an ancilla qubit to improve the amplitude of success. This cost is incurred twice, once for the preparation and once for the inverse preparation.
3. The complexity of the QROM being used for the state preparation is with k p being a power of 2. The inverse preparation then has a cost 4. We perform an inequality test with cost ℵ. Accounting for the inverse of the preparation gives a total cost 2ℵ.

The controlled swap based on the result of the inequality test is on
2 log M + n k + 7 (D8) pairs of qubits, so has this Toffoli cost. Note that we have +7 here rather than +8. This is because we do not need to swap the sign qubits; the sign can be applied with Z gates controlled on the result of the inequality test, not adding to the Toffoli cost (as usual). This cost is incurred again in the inverse preparation for a total of 4 log M + 2n k + 14.
6. As described above, we perform an inequality test between µ, G 1 and ν, G 2 to give the qubit flagging whether we have a real or imaginary component. Then we perform a controlled swap of µ, G 1 with ν, G 2 to generate one symmetry for the state preparation, with the complex conjugate applied using a Clifford gate. This part therefore has Toffoli cost 2 log M + 6. This cost is incurred again in the inverse preparation giving a total cost 4 log M + 12. In addition to this controlled swap, we perform a controlled swap in the middle, but it is not controlled so does not add to the Toffoli complexity.
7. For the symmetry where ζ Q,G1,G2 ) * , we can use a second control qubit to flip the sign on Q, negate G 1 and G 2 , and apply the complex conjugate. The complex conjugate again can be applied with a Clifford gate, and so can the controlled-NOT gates on G 1 , G 2 . A controlled sign flip of Q can be performed with 2n k Toffolis, simply by flipping the sign in usual two's complement binary, then controlling addition of N x , N y , N z in each component.
8. Next we need to prepare a superposition over allowed values of k, because k − Q − G needs to be in the allowed range of k values (using G to indicate G 1 or G 2 depending on which part we are performing). In particular, for the x-component we have an allowed range for k x from Q x to N x − 1 when G x = 0, or 0 to Q x − 1 when G x = 0. It is similar for the other two components. We can therefore prepare a superposition over the appropriate range then add Q x if G x = 0.
Creating an equal superposition requires Hadamards on the appropriate subset of qubits, as well as a Q x , G xdependent rotation to give a high success probability for the amplitude amplification. This information can be output with Toffoli cost 2N x −2 on the qubits representing Q x , G x . The complexity of the controlled Hadamards is then log N x Toffolis, assuming we use a catalytic T state as in [32]. The complexity of preparing the equal superposition is then 6 log N x +2b r −6, including 3 log N x for three rounds of log N x controlled Hadamards. The reason why there is −6 rather than −9 is the inequality test is with a value in a quantum register (in each of three tests), which requires one more Toffoli than an inequality test with a classically given value.
The controlled addition of Q x has complexity 2 log N x . The total complexity of the preparation of the superposition for the three components of k is therefore This cost is incurred four times for the preparation and inverse preparation of k and k .
9. In order to account for the one-body term, we note than the one-body term has a single µ and k rather than Q.
We also do not want the operations we perform in the two-body part for the symmetry to affect the one-body part. We can therefore output µ = ν for the one-body part in the QROM, so the swap of µ and ν has no effect. The value of k for the one-body part can be stored in the same register as used for Q for the two-body part. To prevent the operations used to generate the symmetry ζ Q,G1,G2 µν = (ζ ( Q),!G1,!G2 µν ) * being applied for the one-body part, we can simply apply a Toffoli to produce a new control qubit. The remaining part above is the preparation of the superposition over the k values controlled on Q; this does not need to be amended to account for the one-body part because there we will not be using this value in the extra register.
10. Now that we have prepared the register that is in an equal superposition over the appropriate range of k, we need to use that in combination with Q and µ to prepare a superposition with the correct weights. To do this, we will use coherent alias sampling in the usual way, but will need to construct an appropriate register to iterate over from registers k, Q, G, µ. First we compute k − Q − G in an ancilla register. These two subtractions have cost 2n k . Since it needs to be computed and uncomputed for each of the two factors in the Hamiltonian, the total cost is 8n k . Now, because k and k Q uniquely specify Q, G, these two registers can be used for the iteration instead of Q, with the additional advantage that they are both over the full range of the Brillouin zone. Now we need to compute a contiguous register (((((k x N y + k y )N z + k z )N x + k x )N y + k y )N z + k z )M + µ, where we are using k for k Q. This contiguous register includes many multiplications by classically chosen constants, which has complexity depending on how many ones are in these constants. The worst case is where these numbers are all ones, so we will give the cost for that case even though it is rare.
As discussed in [117] the cost of multiplying two integers when one is given classically is no more than the product of the numbers of bits. For the additions, the cost is no more than the number of bits on the larger number. So, we have a cost as follows.
(a) For multiplying k x N y a cost of log N x log N y . Here N y would have more bits if it were a power of 2, but then the multiplication cost would be zero.
We need to add all these items together to give the total cost, and it needs to be multiplied by 4 because we compute and uncompute for each of the two factors in the Hamiltonian.
Next we have a QROM on this contiguous register with cost N 2 k M k nrm + (k nrm − 1)(n k + ℵ).
with k nrm a power of 2. This is because there are N 2 k M items to iterate over, and we need to output n k bits for the alternate value of k and ℵ for the keep value. We have twice this cost because of the two factors in the Hamiltonian, but the erasure cost for each factor is The last two steps of the coherent alias sampling are an inequality test with cost ℵ and a controlled swap with cost n k . These costs are incurred 4 times, once for preparation and once for inverse preparation for each of the two factors for the Hamiltonian.
11. We will need to prepare a register that is k − Q − G again. We previously computed this, but we need to compute it again because we have performed a state preparation on k. This has a cost of 2n k again, and needs to be done 4 times for a total cost of 8n k .
A further subtlety is that we are storing the value of k to use in the Q register in the one-body case. We can perform a controlled swap into the working register for k or k Q, which has a total cost of 4n k . Combined with the arithmetic cost this is 12n k .
12. To use the register with k or k Q to control the swap of system registers into working registers, we can use each qubit to control swaps of the system registers in a similar way as is used for advanced QROM. The cost for selecting each qubit out of N k is N k − 1, similar to the use in advanced QROM, despite the use of multiple components. In particular, we can perform swaps of system registers based on the x-component of k with cost (N x − 1)N y N z . Then swapping the registers based on the y-component out of the subset of N y N z has cost (N y − 1)N z . Then the cost of swapping based on the z-component has cost N z − 1. Adding these three costs together gives N x N y N z −1 = N k −1. This is performed for each of the N/2 qubits we need, for cost N (N k −1)/2. We need to swap and inverse swap 8 times on N N k /2 system registers, for a total cost of 4N (N k − 1).
13. Next, we consider the output of the rotations for the c modes. These will be controlled by the registers with µ and k (or k Q), as well as the qubit selecting between the one-and two-body terms. A difficulty is that we would need a contiguous register in order to be able to effectively apply the advanced QROM. A method around this is to use a QROM on the selection qubit and k to output an offset, then add µ. The complexity of that QROM is 2N k , then the complexity of the addition is log N k (M + N/2) .
That is in the case where we need to apply the one-body term as part of the implementation. Recall that we only need to do that once when we are applying c † c twice for the two-body term. In the part where we are not applying the one-body term we instead have complexity N k + log N k M .
The size of the QROM output for the rotations is then N . That is again because we need Givens rotations on N/2 qubits, and need two angles for each Givens rotation with each. The complexity is then, in the case with the one-body term, and for the case where we do not need the one-body term with k r being a power of 2. The cost of the Givens rotations is 2N ( − 2), because we have N/2 qubits and 2 angles for each Givens rotation.
The cost of erasing the QROM is for the two cases is 4. A phase gradient state of size is used for rotations.

5.
A single T state is used for the controlled Hadamards.
6. The QROM used for the first state preparation outputs m qubits.
7. This first state preparation also uses m(k p − 1) + log(d/k p ) − 1 temporary qubits. 8. A single qubit is used for the result of the inequality test for the coherent alias sampling. 9. A single qubit is used for the result of the inequality test between µ, G 1 and ν, G 2 .
10. There are n k +3b r qubits used as the output of the QROM used to give the information needed for the preparation of the equal superposition over k, as well as b r for k.
11. In the preparation of the state for k we compute k Q in an ancilla needing n k qubits, and compute a contiguous register that needs log(N 2 k M ) qubits. 12. The state preparation for k uses n k + output qubits. 13. The state preparation uses (k nrm − 1)(n k + ) + log N 2 k M k nrm − 1 (D20) temporary qubits.
14. We also use ℵ in this state preparation for a superposition state, and another qubit for the result of the inequality test.
15. There are log N k (M + N/2) qubits used for the contiguous register for the QROM for the qubit rotations. 16. There are N k r used for the QROM for the rotations, with another log N k (M + N/2) k r − 1 (D21) temporary qubits.
Accounting for the maximum total involves determining the maximum number used which depends on the use of temporary ancillas. We first need to determine whether the temporary qubits described in part 13 or the total qubits in parts 14 to 16 are larger. We take the maximum of these, and add it to the qubits used in parts 8 to 12. Then we compare that to the number of temporary qubits in part 7. The maximum of that is added to the qubits in parts 1 to 6. 15. This shows how to construct a self-inverse procedure for block encoding two-electron terms, as in Fig. 6 of [32]. The left side is the manifestly self-inverse form, and the right side is a more intuitive form where the |+ state is used to generate the symmetry between µ and ν, and the two V operations are controlled by µ and ν in succession.
Next we give a little more detail on how the construction is made self-inverse. As explained in Fig. 6 of [32] (shown above as Figure 15) the THC construction may be made self-inverse by using the qubit in the |+ state which controls the swap of the |µ and |ν registers. As can be seen in Figure 6, we are using a similar procedure, where the |+ state controls the swap of µ and ν at the top left and at the lower right. The X gate and swaps on the lower left corresponds to the X gate and swap in the middle of Figure 15.
We have currently just drawn the quantum circuit as having c and c † , but these would be implemented using ancilla qubits to control the election between X and Y in X ± iY . For the implementation to be self-inverse, we would want the qubit used to control the first c to be the same as that for the final c † . This can be achieved by taking the four qubits for control of each of the c and c † so the top one can be used as the control each time. In particular, after the first c, swap the first two qubits, then after the c † swap the first pair with the second pair, then after the next c swap the first two qubits again. As a result, the first qubit can be used as the control each time. Moreover, this arrangement of swaps is obviously self-inverse.
The application of the controlled X and iY gates is also automatically self-inverse. The reason is that iY squared is −1 1. In the block encoding we perform Z gates for the control qubits for c but not c † . Then performing the unitaries for the block encoding twice, we have first that the final controlled c † in the first block encoding is matched with the first c for the second block encoding. The same control qubit is used, so the two controlled X operations cancel, and the two controlled iY operations give −1 1. This cancels with the Z gate on that control qubit. In this way all the operations can be cancelled, and because each time we have controlled c and c † matched, which cancel the Z gate on the control qubit. There is no additional Toffoli cost for these swaps and phase gates, because they are all Clifford gates.   For insulators like the distorted structures of LiNiO 2 , there are several common diagnostics that are used in molecular calculations to identify cases of "strong correlation." Here we examine the maximum elements of the CCSD T 1 and T 2 tensors (max(|t 1 |) and max(|t 2 |)), which are commonly used as a measure of correlation [118], as well as the T1 [119] and D1 [120,121] diagnostics computed from the ROHF k-point CCSD calculations. We also show the expectation value of the S 2 operator for the UHF solution because spin contamination in the UHF calculation can be a signature of strong correlation. The results are shown in Table VIII. As expected, these results do not suggest particularly strong correlation. Only the max(|t 1 |) values for the C2/m and P2 1 /c structures and the UHF S 2 for the P2/c structure are larger than might be expected. The larger max(|t 1 |) is likely an indication that the ROHF orbitals are far from optimal, and the symmetry breaking in the UHF calculations does not mean that CCSD cannot provide reliable energies.

Appendix F: Classical timing benchmarks
In order to compare the quantum algorithm run time to state of the art classical algorithms we measured the cost to compute the CCSD and ph-AFQMC total energy for the benchmark systems listed in Table I in double-and triple-zeta basis sets. The results of these timings are presented in Fig. 16.
For CCSD we used pyscf [93,94] and timed the data on a node with 30 3.1 GHz Intel Xeon CPUs (30 OpenMP threads).
For ph-AFQMC, which is considerably more expensive than CCSD for small system sizes, we estimated the run time by performing a short ph-AFQMC calculation for each system using 8 Nvidia V100 GPUs. From this data we can then estimate the run time to achieve an statistical error bar (per atom) of 1 × 10 −4 Ha through the assumption that the statistical error of ph-AFQMC decays like N −1/2 s , where N s is the number of Monte Carlo samples. Formally, ph-AFQMC should asymptotically scale like O(N 3 k ) [22] assuming the number of samples required to reach the desired precision does not scale with the system size. Interestingly, we found that the statistical error bar per atom for a fixed number of samples actually decreased with N k , which implies the variance of ph-AFQMC is increasing sub-linearly with the system size. For smaller system sizes it is important to note that practically one can saturate the GPU with walkers with nearly no loss speed [108], thus reducing the error bar given a fixed wall time. As a result the small N k AFQMC numbers represent a large overestimation in runtime one would practically need to obtain the desired statistical error. Another confounding factor which may affect the scaling of ph-AFQMC is the time step error (we fixed the time step at 0.005 Ha −1 for all systems). Recent results suggest that ph-AFQMC suffers from a size extensivity error [122], which is practically remedied through time step extrapolation. This may break the assumption of a constant number of Monte Carlo steps required to reach the desired precision (with a fixed time step) [122]. Due to these factors we decided not to fit a trend line to the ph-AFQMC data. We should stress that these comparisons are for illustrative purposes only, the absolute timings will differ with different classical implementations, codes and architectures and the run times should be considered with these factors in mind.