Compilation of Fault-Tolerant Quantum Heuristics for Combinatorial Optimization

Here we explore which heuristic quantum algorithms for combinatorial optimization might be most practical to try out on a small fault-tolerant quantum computer. We compile circuits for several variants of quantum accelerated simulated annealing including those using qubitization or Szegedy walks to quantize classical Markov chains and those simulating spectral gap amplified Hamiltonians encoding a Gibbs state. We also optimize fault-tolerant realizations of the adiabatic algorithm, quantum enhanced population transfer, the quantum approximate optimization algorithm, and other approaches. Many of these methods are bottlenecked by calls to the same subroutines; thus, optimized circuits for those primitives should be of interest regardless of which heuristic is most effective in practice. We compile these bottlenecks for several families of optimization problems and report for how long and for what size systems one can perform these heuristics in the surface code given a range of resource budgets. Our results discourage the notion that any quantum optimization heuristic realizing only a quadratic speedup will achieve an advantage over classical algorithms on modest superconducting qubit surface code processors without significant improvements in the implementation of the surface code. For instance, under quantum-favorable assumptions (e.g., that the quantum algorithm requires exactly quadratically fewer steps), our analysis suggests that quantum accelerated simulated annealing would require roughly a day and a million physical qubits to optimize spin glasses that could be solved by classical simulated annealing in about four CPU-minutes.

Here we explore which heuristic quantum algorithms for combinatorial optimization might be most practical to try out on a small fault-tolerant quantum computer. We compile circuits for several variants of quantum accelerated simulated annealing including those using qubitization or Szegedy walks to quantize classical Markov chains and those simulating spectral gap amplified Hamiltonians encoding a Gibbs state. We also optimize fault-tolerant realizations of the adiabatic algorithm, quantum enhanced population transfer, the quantum approximate optimization algorithm, and other approaches. Many of these methods are bottlenecked by calls to the same subroutines; thus, optimized circuits for those primitives should be of interest regardless of which heuristic is most effective in practice. We compile these bottlenecks for several families of optimization problems and report for how long and for what size systems one can perform these heuristics in the surface code given a range of resource budgets. Our results discourage the notion that any quantum optimization heuristic realizing only a quadratic speedup will achieve an advantage over classical algorithms on modest superconducting qubit surface code processors without significant improvements in the implementation of the surface code. For instance, under quantum-favorable assumptions (e.g., that the quantum algorithm requires exactly quadratically fewer steps), our analysis suggests that quantum accelerated simulated annealing would require roughly a day and a million physical qubits to optimize spin glasses that could be solved by classical simulated annealing in about four CPU-minutes. The prospect of quantum enhanced optimization has driven much interest in quantum technologies over the years. This is because discrete optimization problems are ubiquitous across many industries and faster solutions could potentially revolutionize fields as broad as logistics, finance, machine learning, and more. Since combinatorial optimization problems are often NP Hard, we do not expect that quantum computers can provide efficient solutions in the worst case. Rather, the hope is that there may exist ensembles of instances with structure that would enable a significant quantum speedup on average, or for which a quantum computer can approximate better solutions.
Among the most studied algorithms for quantum optimization are those that can function as heuristics. The objective of a heuristic algorithm is to produce a solution given a reasonable amount of computational resources that is "good enough" (or at least the best one can afford) for solving the problem at hand. While heuristics are often able to efficiently find the exact solution, sometimes they might fail to do so and instead only approximate the exact solution (potentially in an uncontrolled fashion). But such techniques are still valuable because finding some usable result does not require a prohibitively long time. Accordingly, heuristics are often used without regard for rigorous bounds on their performance. Indeed, the NP Hardness of many combinatorial optimization problems makes heuristics the only viable option for many problems that need to be routinely solved in real-world applications.
While some heuristic algorithms have a strong theoretical basis, many of the most effective heuristics are based on intuitive principles and then honed empirically through data and experimentation. However, today, our ability to evaluate quantum heuristics through experimentation is limited since the only available quantum computers are small and noisy [1]. We can perform numerics on small instances but extrapolation from those small system size numerics can be potentially misleading [2]. Still, it is reasonable to ask the question: what would be some of the most compelling quantum heuristics for optimization that we would want to attempt on a small fault-tolerant quantum computer, and how many resources would be required to implement their primitives?
There are many prominent approaches to combinatorial optimization on a quantum computer. These include variants of Grover's algorithm [3,4], quantum annealing [5,6], adiabatic quantum computing [7,8], the shortest path algorithm [9], quantum enhanced population transfer [10,11], the quantum approximate optimization algorithm [12], quantum versions of classical simulated annealing [13,14], quantum versions of backtracking [15,16] as well as branch and bound techniques [17], among many others. While often these works focus on the asymptotic scaling of exact quantum optimization, in many cases one can use these algorithms heuristically through trivial modifications of the approach. For instance, the quantum adiabatic algorithm requires that one evolve the system for an amount of time scaling polynomially with the inverse of the minimum spectral gap of the adiabatic evolution. However, one can instead use this algorithm as a heuristic by choosing to evolve for a much shorter amount of time, and hoping for the best (this is similar to the strategy usually employed with quantum annealing).
What essentially all forms of quantum optimization have in common is the requirement that the quantum algorithm query some function of the cost function of interest. This is how the quantum computer accesses information about the energy landscape. For instance, if our cost function is H and H |x = E x |x so that E x is the value of the cost function for bit string |x , then often we need to phase the computational basis by a function f (·) of E x , e.g., For example, f (E x ) ∝ E x is required to implement the quantum approximate optimization algorithm, quantum enhanced population transfer, digitized forms of quantum annealing and the shortest path algorithm. Alternatively, f (E x ) ∝ arccos(E x ) would describe something related to the quantum walk forms of those algorithms. If f (E x ) ∝ (−1) (Ex≤K) this primitive would be the bottleneck subroutine for amplitude amplification to boost our support on energies less than K. In most quantum approaches to optimization, a unitary like this is interleaved with a much cheaper operation which does not commute with the operation in Eq. (1). Some algorithms instead call for simultaneously evolving under a function of the cost function together with a simple non-commuting Hamiltonian, but still the bottleneck is usually the complexity of the cost function Hamiltonian. The difference between many of these algorithms often comes down to the choice of f (·) and the choice of the much cheaper non-commuting unitary. The quantum algorithms for simulated annealing (e.g. [13]) work slightly differently as those algorithms are based on making local updates to the wavefunction. For instance, the quantum version of a simulated annealing algorithm that updates with single bit flips requires where x k is defined as the bit string x with the k th bit flipped, i.e. |x k = not k |x , with k = 0 corresponding to no bit flip. But again, these approaches are still typically bottlenecked by our ability to compute these functions of the cost function f (·). This paper will not address the important question of how well various heuristic quantum optimization approaches might perform in practice. Rather, our main motivation to is compile common bottleneck primitives for these approaches to quantum circuits suitable for execution on a small fault-tolerant quantum computer. In doing this, we will see that most contemporary approaches to quantum optimization are actually bottlenecked by the same subroutines (e.g., those required for Eq. (1) and Eq. (2)), and thus improved strategies for realizing those subroutines are likely of interest regardless of which paradigm of quantum optimization is ultimately found to be most effective in practice. In essentially all heuristic approaches to quantum optimization there is a primitive that is repeated many times in order to perform the optimization. Instead of investigating how many times those primitives must be repeated, we focus on the best strategies for realizing those primitives within a fault-tolerant cost model. For all algorithms we consider, we report the constant factors in the leading order scaling of the Toffoli and ancilla complexity of these primitives.
For some algorithms studied, such as for the quantum algorithms for simulated annealing, this work is the first to give concrete implementations which determine constant factors in the scaling. In other cases our contribution is to optimize the scaling for certain problem Hamiltonians or improve details of the implementation. We focus on Toffoli complexity since we imagine realizing these algorithms in the surface code [18,19], where non-Clifford gates such as Toffoli or T gates require considerably more time (and physical qubits) to implement than Clifford gates.

A. Overview of results
The goal of this paper is to estimate the performance of an early universal quantum computer for key steps of combinatorial optimization. To achieve this goal, we consider prominent heuristic-based methods for combinatorial optimization on a quantum computer and how their key steps could be executed on early hardware. We consider the following heuristic-based methods: amplitude amplification [20] as a heuristic for optimization and in combination with other approaches; quantum approximate optimization algorithms (QAOA) [12]; time-evolution approaches such as adiabatic algorithms [2] (including a variant incorporating a Zeno-like measurement [21]), quantum enhanced population transfer [11], and "shortest path" optimization [9]; and three quantum methods for simulated annealing (QSA), namely, a Szegedy walk-based [22] implementation of Markov Chain Monte Carlo [13], a qubitized form of the Metropolis-Hastings approach [23], and simulation of a spectral gap amplified Hamiltonian [14]. We review existing approaches in detail and develop several new methods or improvements. For each approach, we compile the primitive operations into quantum circuits optimized for execution in the surface code [19].
For concreteness, we focus our analysis on four families of combinatorial optimization problems: the L-term spin model, in which the Hamiltonian is specified as a real linear combination of L tensor products of Pauli-Z operators; Quadratic Unconstrained Binary Optimization (QUBO), which is an NP-hard special case of a 2-local L-term spin model; the Sherrington-Kirkpatrick (SK) model, which is a model of spin-glass physics and an instance of QUBO that has been well-studied in the context of simulated annealing [24]; and the Low Autocorrelation Binary Sequences (LABS) problem, which is a problem with many terms but significant structure that is known to be extremely challenging in practice. For each of the above problems, we design several methods of calculating the cost function on a quantum computer depending on how a given algorithmic primitive is supposed to query and process the cost of a candidate solution. We present these methods in Section II.
Our analysis has produced several novel techniques that yield improvements over previous approaches. We recount the main ones here in order of appearance. In Section II A 2, we reduce by a logarithmic factor the cost of calculating the Hamming weight of a bit string using our method from [25]. This new technique leads to improvements in several other parts of our paper. In Section II E, we introduce a new technique for evaluating costly arithmetic functions when computational cost matters more than accuracy. Our new technique is based on approximating the function using linear interpolation between classically precomputed points that can be accessed using quantum read-only memory (QROM) [26], or a new variant of QROM designed for sampling at exponentially growing spacings.
In Section III B, we introduce a method of cost function evaluation for QAOA based on amplitude estimation. This technique gives a quadratic improvement over the original approach. In Section III C, we introduce a heuristic method for adiabatic optimization that is likely to be computationally cheaper for some applications of early quantum computers, although we do not expect an asymptotic advantage over other state-of-the-art approaches. The idea is to simulate the adiabatic path generated by the arccosine of the given Hamiltonian, not by the Hamiltonian directly, by "stroboscopically" simulating time evolution with short time steps produced by evolving under a qubitized walk.
In Section III D we give a new method for constructing the Szegedy walk operator suggested in [13]. Our key technique is a state preparation circuit that avoids expensive on-the-fly calculations by using the techniques introduced in [27]. In Section III E, we introduce an alternative method for executing the controlled qubit-rotation step in the qubitized Metropolis-Hastings approach introduced in [23]. Our approach is preferable in cases where the Hamiltonian has a higher connectivity; i.e. when the probability of accepting a proposed transition depends on many bits in the candidate solution. In those cases the approach of [23] would have exponential complexity. In Section III F, we give an Problem Algorithm Primitive (Table VIII and Table IX) (   Table VIII and Table IX are based on a problem size of N = 256, a surface code cycle time of 1 µs, and a physical gate error rate of 10 −3 (there are other assumptions as well, covered in more detail in Section IV). Note that depending how they would be used, it might be appropriate to scale the Hamiltonian walk steps by a factor of λ which is roughly λSK ≈ N 2 /2 and λLABS ≈ N 3 /3. We simplify the complexity scaling estimates from explicit LCU-based oracle for the spectral gap amplified Hamiltonian introduced in [14]. This explicit oracle enables a cost analysis of the approach, which we provide. Apart from assisting with our goal of estimating early quantum computer performance, many of these innovations produce asymptotic improvements to the approaches we consider.
Having compiled the primitive operations of our chosen approaches and established how to query cost functions for our chosen problems, we are able to numerically estimate the computational resources needed to execute these primitives on a quantum computer. Based on our assumption that the quantum computer will be built from superconducting qubits and employ the surface code to protect the computation from errors, we focus on minimizing the number of ancilla qubits and non-Clifford gates that would be required. This approach is founded on the knowledge that non-Clifford operations are significantly harder than Clifford operations to perform in the surface code.
We give an example of some of our ultimate findings in Table I. In the table we provide the leading order scaling of the number of Toffoli gates needed to perform an update using five of the heuristics that we consider for two benchmark problems -LABS and SK. These scalings are reproduced from Table VII and presented in a simplified form where we assume that the working precision for various calculations is a constant. We also reproduce key figures from Table VIII and Table IX to show how we expect these estimated complexity scalings translate into the runtime of an early quantum computer. In Table I we show the estimated number of steps of the chosen algorithmic primitive that could be executed in a single day on a quantum computer for a problem size of N = 256, a relatively small problem size that would be reasonable to execute with only a single Toffoli factory as we assume in Table VIII and  Table IX. We also present the estimated number of physical qubits needed.
We find that, despite great efforts made to optimize our compiled quantum circuits, the costs involved in implementing heuristics for combinatorial optimization will be taxing for early quantum computers. Not surprisingly, to implement problems between N = 64 and N = 1024 we find that hundreds of thousands of physical qubits are required when physical gate error rates are on the order of 10 −4 and sometimes over a million are required for physical gate error rates on the order of 10 −3 . But even more concerning is that the number of updates that we can achieve in a day (given realistic cycle times for the error correcting codes) is relatively low, on the order of about ten thousand updates for the smallest instances considered of the cheapest cost functions. With such overheads, these heuristics would need to yield dramatically better improvements in the objective function per step than classical optimization heuristics. From this we conclude that, barring significant advances in the implementation of the surface code (e.g., much faster state distillation), quantum optimization algorithms offering only a quadratic speedup are unlikely to produce any quantum advantage on the first few generations of superconducting qubit surface code processors.

B. Organization of paper
Our paper is divided into essentially two parts. In the first part (Section II) we introduce and provide explicit compilations for a wide variety of subroutine or "oracle" circuits which perform operations related to specific problem Hamiltonians. In the second part of our paper (Section III) we describe a variety of heuristic algorithms for quantum optimization and discuss how the oracle circuits of Section II can be called in order to implement these algorithms. We will see that the same "oracle" circuits are required by many algorithms. The results of Section III essentially provide query complexities to implement the primitives of common quantum optimization heuristics with respect to the oracles of Section II. Thus, while the results of Section II are adapted to particular problem Hamiltonians, the results of Section III are fairly general. We now describe our results in slightly more detail.
Section II details strategies for realizing five straightforward oracle circuits which are detailed therein for each of four problem Hamiltonians in Table III. The specific problems we focus on are introduced at the beginning of Section II. These five oracles correspond to: (Section II A) the direct computation of a cost function into a quantum register, (Section II B) the computation of the difference between the cost of two computational basis states which differ by a specific single bit, (Section II C) an operation which phases the computational basis by an amount proportional to the cost, (Section II D) the realization of a qubitized quantum walk [28] which encodes eigenvalues of the cost function, and (Section II E) the computation of arithmetic functions of an input value using QROM [26]. Our approach to computing arithmetic operations using QROM is likely useful in other contexts and is a new technique from this work. The culmination of Section II is Table IV which gives leading order constants in the scaling of Toffoli, T and ancilla complexities for all five of these oracles and for all four of the problems. Even though the first two cost functions we introduce in Section II have fairly general specifications, they do not capture exploitable structure in all optimization problems of interest. Still, we imagine that the motifs developed in Section II will be helpful for any future work seeking to develop similar circuits for other cost functions.
Section III describes how the oracle circuits of Section II are queried in order to realize the essential primitives of many fault-tolerant quantum heuristics for optimization. This section contains a mixture of new results and a review of established methods. Section III A reviews how one can use amplitude amplification [20] heuristically for optimization and also discusses how and why one might combine amplitude amplification with other algorithms in this section. Section III B discusses strategies for executing QAOA [12] within fault-tolerant cost models. While most of this section is review, we also discuss the combination of QAOA with amplitude amplification based methods for more efficiently extracting the cost function value.
Section III C discusses several approaches to quantum optimization that are based on time evolution or quantum walks generated by a cost function and simple driver. First, we review the adiabatic algorithm [2] and well known methods for how it might be digitized using product formula type circuits. We then introduce a method of simulating the adiabatic algorithm based on qubitized quantum walks. Next, we review how the adiabatic algorithm can be combined with a Zeno-like measurement approach which corresponds to evolution under static Hamiltonians for random durations [21], and give some new results about how to optimally choose the distribution of those durations.
The remainder of Section III focuses on three approaches to a quantum algorithm which accelerates classical simulated annealing. In terms of implementation, these are the most complex algorithms studied in the paper. For the three variants of the quantum simulated annealing algorithms, we provide the first complete compilation of circuits which execute the heuristic primitive. In Section III D we analyze and compile the original version [13] of these algorithms that is based on Szegedy quantum walks [22]. As anticipated, this approach is the least efficient of the three studied. In Section III E we focus on what is essentially a qubitized version of the Szegedy quantum walk. The primary characteristics of this approach were independently described in [23] (a paper that came out during the preparation of our own) but we go beyond that work to determine (and in some ways improve upon) constant factors in the scaling. Finally, in Section III F we compile the algorithm for quantum simulated annealing based on spectral gap amplification [29], using an improvement based on qubitization. The results of Section III are summarized in Table VI and Table VII, which give the query complexities with respect to the oracles of Section II and overall gate and ancilla complexities of all algorithms of Section III for all of the cost functions of Section II.
Finally, we conclude in Section IV with a discussion of these results. Our discussion includes an attempt to contextualize the ultimate cost of these heuristic primitives by giving the Toffoli count, ancilla count, and total number of physical qubits and wallclock time that would be required to realize these primitives given various resource budgets and assumptions in the surface code. These concrete resource estimates are given in Table VIII and Table IX. We then finish with a discussion of how these results lead to a fairly pessimistic outlook on the viability of obtaining quantum advantage for optimization by using a small quantum computer unless one is able to obtain significantly better than a quadratic speedup over classical alternatives.

II. ORACLES AND CIRCUIT PRIMITIVES FOR SPECIFIC COST FUNCTIONS
While many paradigms of quantum optimization require the same bottleneck subroutines for their implementation, aspects of these subroutines will always be specific to the particular problem that one intends to optimize. Thus, in order to give concrete implementations and develop a sense of how many resources would be required for steps of common quantum heuristics, aspects of our work are adapted to particular problem Hamiltonians (equivalently here, "cost functions") of interest. There are four main types of Hamiltonians that we consider in this paper.
The first two types of Hamiltonians we will study are of interest because they are programmable instances of optimization problems that one might encounter in practical situations. The second two types of problems we will study are of interest more to those who study statistical physics and for different reasons: because they define ensembles of instances for which the average case has known and interesting properties. While solutions to specific instances of the latter two problems are probably not of much value, we anticipate they will be interesting problems on which to investigate the performance of a quantum computer. The four problems we study are described below.
1. L-term spin model: The most general Hamiltonian we will consider is the one we will refer to simply as the "L-term spin model". This Hamiltonian is a linear combination of L tensor products of Pauli-Z operators, where w are real scalars, Z i is the Pauli-Z operator on qubit i, N is the number of qubits in the cost function, and q is a unique set of up to N integers which also take values between 1 and N (it is a set of integers corresponding to the indices of qubits on which term acts). One might anticipate that it would be helpful to also specify this Hamiltonian in terms of its many-body order k = max|{q }|. However, perhaps surprisingly, none of the algorithms discussed in this paper have a Toffoli complexity that scales explicitly in k.

Quadratic Unconstrained Binary Optimization:
We will also consider an NP-Hard example of H N 2 /2 known as Quadratic Unconstrained Binary Optimization (QUBO). The QUBO Hamiltonian is expressed as symbol meaning x bitstring corresponding to a candidate solution of the optimization problem N number of bits needed to specify a candidate solution Ex cost (a.k.a. energy) of candidate solution x as specified by a cost function H cf Hamiltonian operator corresponding to a cost function "cf" b number of bits used to specify the precision of an oracle L number of terms in a spin model (type of cost function) λ the normalization parameter for LCU methods, related to the Hamiltonian 1-norm β inverse temperature in simulated annealing C Toffoli or T cost of some oracle A ancilla required to implement some oracle that must be kept B temporary ancilla required to implement some oracle  TABLE III. Quick definitions of the most important "oracle" circuits discussed in this work. Here, we slightly abuse the term "oracle" to mean a circuit primitive which is repeatedly queried throughout an algorithm, usually revealing information about the problem we are solving. Throughout the paper we will use C to denote Toffoli (or occasionally T) complexity while A and B will denote persistent and temporary ancilla costs, respectively. For some of these oracles there are different Toffoli costs when performing them in the forward and reverse directions. We always pair a forward oracle with a reverse oracle, so will give the average cost. In some cases the computation may introduce ancilla qubits not shown here, that are erased in the inverse computation. For the function evaluation oracle we incorporate multiplication by the inverse temperature β. The approximationf is given to bsm bits, but for generality we allow an error 2 −b fun which may be larger than 2 −bsm .
used by multiple algorithms throughout our paper. In Section II A, we explain how to implement cost function oracles that are required to return the cost of a specific candidate solution x. We refer to such oracles as "direct energy oracles". In Section II B, we explain how to implement cost function oracles that are required to return the difference in cost between two candidate solutions that differ by exactly one bit. In Section II C, we explain how to implement cost function oracles that are required to return the cost function as a phase, rather than as a value written to a separate quantum register. In Section II D, we explain how to implement cost function oracles that are required to implement the cost function as a direct application of the Hamiltonian onto a target quantum register. Finally, in Section II E, we consider the cost of evaluating functions whose input is the difference in cost of candidate solutions as described in the other parts of this section. We summarize the content of this section using three tables. In Table II we give a list of the symbols we use for reporting our computational complexity results. This table aids in the interpretation of the following two tables. In Table III, we summarize the definitions of the various different kinds of oracles considered in this section. Finally, in Table IV, we summarize the complexities of each of the sixteen cost function oracles (four types of oracles for each of four types of cost functions) as well as the complexity of calculating functions of those oracle outputs. In these tables, and throughout the paper, we use log to indicate logarithms base 2.

A. Oracles for direct cost function evaluation
Many of the algorithms considered in this work are formulated in terms of a query to an oracle which computes the value of the cost function C (for instance, one of the Hamiltonians discussed above) in a binary register. For instance, TABLE IV. Summary of complexities for realizing oracles used throughout this paper. Next to the complexity entry is a number indicating the equation in the paper which gives the full expression in context. The energy difference for HL and LABS just has twice the Toffoli cost and the same ancilla cost as the direct energy oracle, because it is found by evaluating the energy twice. These oracles and the meaning of their precision parameters b are defined in Table III. The Toffoli count is reported except when the oracle type for that cost function is marked with (*), which indicates that T count is reported instead. Here we include only the main terms in the order expressions. We use these costings to determine the complexities in Table VII. if we have a wavefunction |ψ = x ψ x |x where the computational basis states |x are eigenstates of C such that C |x = E x |x then we define the direct energy evaluation oracle O direct as a circuit which acts as whereẼ x is a binary approximation to E x using b dir bits. We provide some strategies for how to realize this oracle for specific problems with low Toffoli complexity. We will refer to the Toffoli complexity of this oracle as C direct . However, first we will discuss an efficient method for performing reversible in-place addition of a constant. This routine will be critical to our implementation.

Direct energy oracle for L-term spin model and QUBO
We will now explain how to implement the direct energy oracle for the H L Hamiltonian with low Toffoli complexity. We will represent the energyẼ x in the two's complement binary representation, as this encoding enables efficient methods for addition [32]. In two's complement positive integers have a normal binary representation whereas negative integers are the complement of that representation minus one. For instance, in 4-bit two's complement 3 10 = 0011 2 whereas −3 10 = 1100 2 + 1 = 1101 2 . Zero still corresponds to all bits zero. The fact that we need to add one for negative numbers complicates our approach but this representation is still preferable for our purposes.
The main idea behind our approach will be to add or subtract the value of each term's coefficient w to a b-bit output register based on the parity of the string i∈q Z i . To perform addition or subtraction controlled on a qubit, we use the fact that one can switch between addition and subtraction by applying not gates to the target register in two's complement representation. That is, applying not gates to all qubits of a register will change |v to |−v − 1 . Adding w to this register will give |w − v − 1 , then applying not gates to all qubits again will yield |v − w . To perform addition or subtraction controlled on a qubit, one can use the procedure shown in Figure 4(a) of [32] (see Appendix C 2). The complete procedure to compute the energy is then as given in Algorithm 1.

Algorithm 1 Energy evaluation for L-term spin model and QUBO
Require: A quantum state x ax |x , a vector of weights {w } that specifies the L-term spin model or QUBO Hamiltonian. Ensure: An output state of the form x ax |x |Ẽx , where H is the relevant Hamiltonian andẼx is the approximate eigenvalue of H corresponding to |x . 1: Use Clifford gates (cnot gates) to compute the parity of the term i∈q Zi in-place in a single system qubit |π . Specifically, if xi is the i th bit of computational basis state x then we are using cnots to compute π = ( i∈q xi) mod 2. 2: Controlled on |π , use more cnot gates to negate every bit of the output register. We will refer to this output register as |v . Thus, after this step we will have the state |0 |v if the first bit holds π = 0 and we will have the state |1 |−v − 1 if the first bit holds π = 1. 3: Using the strategy described in Appendix C 2 for the addition of a constant, add a b dir -bit binary approximationw to the coefficient w into the output register. This step has Toffoli complexity b dir − 2 where b dir is the size of the output register. After this step we will have the state |0 |v +w if π = 0 and we will have the state |1 |w − v − 1 if π = 1. 4: Negate the output register using cnot gates, controlled on |π . After this step we will have the state |0 |v +w if π = 0 and we will have the state |1 |v −w if π = 1. 5: Using Clifford gates uncompute the parity π .
After performing this for L terms one can verify that this will produce the intended state |v = |Ẽ x in the output register. Toffoli gates enter only through the adder in step 3. Thus, in total our approach has Toffoli complexity C direct L and ancilla requirements A direct where the ancilla refer to the carry bits for the adder in addition to the b dir bits required to output the energy. We note that for this oracle these costs have no dependence on the many-body order of the Hamiltonian H L since this only affects the number of cnot gates used to compute the parity of the terms. This exact same reasoning can be used to determine the complexity of computing the energies for the QUBO Hamiltonian. Due to the relative lack of structure in QUBO, there is no obvious way to improve over this general complexity. There we have L = N (N + 1)/2 terms and so from Eq. (8), Eq. (9) and Eq. (10) we require a number of Toffolis and ancillas equal to

Direct energy oracle for SK model
Here we show that the energy for the SK model can be computed with only N 2 Toffolis and a logarithmic number of ancillas. The method we use is a sum of tree sums of bits. It is also possible to just use a tree sum with a Toffoli cost of about N 2 /4, but the drawback is that this method would need N 2 /2 ancilla qubits, which is prohibitive. For the SK model it is convenient to replace −1 with 0, so the sum takes values between 0 and L. That corresponds to dividing the Hamiltonian by 2 and shifting it, which does not change the optimization problem, but means we are only summing bits. If we were to sum the bits in the obvious way, the Toffoli complexity would be approximately N 2 log N . However, we can take advantage of the fact we are summing bits to reduce the complexity to O N 2 .
Our methods are based on tree sums of bits. In [25] it was shown that it is possible to sum L bits using L − 1 Toffoli gates and L − 1 ancilla qubits, and this sum can be uncomputed with no Toffoli cost. As discussed in [25], it is also possible to perform sums in approaches that reduce the number of ancilla at the price of increasing the number of Toffoli gates. In particular, we can subdivide the bits we are summing into about L/ log L groups of size log L, start by using the tree sum approach to sum each of the groups, add it into a running sum, and uncompute it. The number of ancillas needed is reduced to approximately log L for each of the tree sums. There is also a cost of approximately L for adding the tree sums, giving a total complexity of approximately 2L.
To be more specific, taking into account that L need not be a power of two, we can use M = L/ log L − 1 groups of size log L , except for a remaining group of size J ≤ log L such that M log L + J = L. That is, there are L/ log L groups, and J can be smaller than log L . The Toffoli cost of computing each of these sums is The cost of the additions is We have assumed that L > 1 and hence log L > 0. The first line of Eq. (15) comes from starting with the sum over J bits and then adding each of the sums over log L to it. After adding j of the sums over log L bits, the maximum value of the sum is J + j log L , so the number of bits needed to store the result is log(J + j log L + 1) , and the number of Toffolis needed for that sum is one less than that. The inequality in the first line comes from the fact that the total number is never less than L, so the cost of the additions is never greater than log(L + 1) −1. The inequality in the second line is because log(L + 1) − 1 ≤ log L. The inequality in the third line is using M < L/ log L . Therefore, the total Toffoli cost is less than 2L. The ancilla cost of each tree sum is log L −1, there are log(L + 1) ancilla needed for the total, and log(L + 1) − 1 temporary ancillas for the addition of the tree sum into the total. Since the ancillas in the tree sum are uncomputed, they contribute to an overall temporary ancilla cost, meaning the temporary ancilla cost is 2 log L + O(1) and the persistent ancilla cost (for the total) is log L + O(1).
Since L = N (N − 1)/2, if we were to use a tree sum the cost would be less than N 2 /2, but the ancilla cost would be approximately N 2 /2. The sum could be uncomputed without ancillas, giving an average (compute and uncompute) cost of N 2 /4. We expect that the tradeoff is not worth it in this case. However, by using the sum of tree sums, we get a Toffoli cost less than N 2 , and an ancilla cost that is logarithmic in N . That gives costs for the SK model of

Direct energy oracle for LABS model
Next we show that for the LABS problem it is possible to compute the energy with a Toffoli cost of 5N (N + 1)/4 for N ≥ 64, with a logarithmic number of ancilla qubits. We improve over the application of our general technique by specializing the implementation to the LABS problem. Since the LABS problem has L = O(N 3 ) with maximum integer energy values of O(N 3 ), we would expect a complexity of O(N 3 ). Instead, we show that it is possible to perform the direct energy evaluation at cost O(N 2 ). We focus on the form of the LABS Hamiltonian that is expressed as N k=1 |H k | where H k is as defined in Eq. (6) (as we mentioned, this form of the problem has the same ordering of the low energy landscape).
In the following we use E k to denote the eigenvalue of H k . It will be most efficient to use the sum of tree sums approach described above. Here we need to find E k by using +1 and −1 rather than +1 and 0, because we need to take the absolute value, so we need an extra bit for the sign. Therefore, after summing bits, we will need to multiply by 2 (which has no Toffoli cost), followed by subtracting the number of bits. The overall approach is then as follows. We will sum k starting at k = N − 1 and go down to zero, so the number of bits at each step is minimized. For each value of k we will perform Algorithm 2.

Algorithm 2 Energy evaluation for LABS model
Require: A quantum state x ax |x , the set of all terms in the LABS Hamiltonian {H k }. Ensure: An output state of the form x ax |x |Ex .
1: Compute for computational basis vector |x the value of E k in a scratch register |u that will require log(N − k + 1) + 1 ancilla to store (with +1 for the sign).

In
Step 1, the Toffoli complexity computing each E k is approximately 2(N − k) plus the cost of subtracting N − k. In two's complement we can determine whether the number is negative or positive by looking at the highest bit; if the highest bit is 1 then we know the value is negative. This justifies the operations in Step 2. Since Step 2 requires no non-Clifford operations, it can be neglected in our cost analysis. In Step 3, the state is |u |u + v if u ≥ 0 or |u |v − u if u < 0; equivalently we now have the state |u |v + |u| . The output register will be of size log[(N − k)(N − k + 1)/2 + 1] + 1 so the Toffoli cost is log[(N − k)(N − k + 1)/2 + 1] . The output register is significantly larger than the scratch register. However, with a slight modification of the procedure in Appendix C 2 we can allow this register to be smaller with no additional Toffoli cost. First, consider expanding the number of qubits |u is encoded on. This is of course trivial for positive numbers. For negative u, for n bits it is encoded as 2 n + u. Therefore, if we have a number that is negative and we need to map it to a negative number on some larger number of bits n , then we need to map 2 n + u to 2 n + u, which means adding 2 n − 2 n = n −1 j=n 2 j . This means that bits n + 1 to n of the negative number encoded on the n bits need to be ones. These can be set by using CNOTs controlled by bit n, which means no additional Toffoli cost is needed to encode the number into more qubits. A further simplification can be used to eliminate the need for those extra qubits. First, rearrange the addition circuit as in Figure 17 so that the qubits of |u are only used as controls and not changed. Since all of the additional qubits for |u contain the same value as the sign qubit of |u , we may use that sign qubit as the control instead of any of those additional qubits. Then the additional qubits are not used, and can be omitted.
There is an improvement that we can make when we take into account that each computation needs to be paired with an uncomputation. This is because, in step 5, if we are computing an energy that we will later uncompute, then we can use the strategy of [32] to erase |u using X measurements and no Toffoli cost. A phase correction is required, but that can be done when we later uncompute the LABS energy. This means that in step 5 we have a cost of N − k in uncomputing the LABS energy, but no Toffoli cost in computing the LABS energy. Because each computation is paired with an uncomputation, it is therefore convenient to give the average complexity of N − k. The largest temporary ancilla cost is when we need to uncompute the overall Hamiltonian, when it is 2 log(N − k) + O(1). That is still less than the temporary ancilla cost in step 3, so can be ignored.
After repeating this for the N values of k one can verify that the output register will contain the energy of the LABS Hamiltonian. Toffoli gates enter only through steps 1, 3 and 5. The primary contribution to the complexity is the computation of E k in steps 1 and 5. Ignoring the complexity of subtracting N − k, the Toffoli complexity is The cost of the subtractions as well as the additions in step 3 will increase the cost, but also 2(N −k) is an overestimate of the cost of adding n − k bits. In particular, we can use tree sums of as many as approximately log N bits, rather than just log(N − k), with no penalty in terms of the temporary ancilla cost. The computed costs are shown in Figure 1(a), and it is found for the range of N we are interested in (64 -1024), the constant factor on N (N + 1) is less than 1.2, rather than 1.5 (in fact, this bound is good for N ≥ 45). In particular, the constant factors for N = 64, 128, 256 and 1024 are 1.16466, 1.12673, 1.13945, and 1.0901, respectively. To simplify the expressions we give the slightly looser bound in the table with the caveat that it is for N ≥ 45. The number of ancilla we will require is The persistent ancilla are for the output value. Approximately 2 log N of the temporary ancilla are for carry bits in the addition and log N are for the scratch register. We assume N > 1 for the inequalities which omits the trivial case. This example illustrates how taking advantage of problem structure can lead to advantages over the implementation of an oracle intended to handle a more general case.

B. Energy difference oracles
For some of the algorithms discussed in this work (specifically the quantum versions of simulated annealing) we often need the direct energy oracle only as means to compute a difference between the energies of two different states which differ in only one bit. The ultimate objective in that context is a circuit that performs the mapping where (as usual) X k is the not operation on qubit k and δE (k) x is a binary approximation to δE (k) x using b dif bits. Especially when the many-body order is 2-local, it is more efficient to consider a specialized implementation of O diff k than to try to realize this operation using one call to O direct and one call to O direct X k .
First, we will discuss the energy difference oracle for QUBOs. In this case, δE is the eigenvalue of the operator We see that δH (k) is itself a simple cost function which is an example of H N (the L-term spin model with L = N ). Thus, to compute the eigenvalue of this operator (equivalent to implementing O diff k ) we would require This scaling is much less than the N 2 b dif + O(N b dif ) Toffoli gates that would be required by making two queries to the direct energy oracle for QUBO. For the SK model we can simplify the QUBO result. We would then have the difference operator 2 i =k w ik Z i Z k , so we just need to sum N − 1 bits, and can take b dif = log N . We also need to subtract N − 1 from the bit sum to obtain the energy difference, but the cost of that subtraction plus the cost of the bit sum is still no more than the upper bound of 2N we gave previously on the cost of the bit sum. Therefore the energy difference oracle has cost For higher many-body order Hamiltonians like LABS or the H L model of many-body order greater than two, the best strategy will probably involve two applications of the direct energy oracle O direct . However, rather than actually use two registers to output the energy and then perform subtraction one can instead just compute the energy of x first and then in the same register compute the energy of y while subtracting all of the terms instead of adding them. There is a slightly greater Toffoli cost because the subtraction is on a slightly larger number of qubits, but that cost is small enough to be ignored. This leads to Toffoli complexity of 2 C direct but requires no additional ancilla.

C. Oracles for phasing by cost function
In some contexts our goal will be to phase each computational state on which the wavefunction has support by an amount proportional to the energy of that computational basis state (this task is equivalent to performing evolution under a diagonal Hamiltonian for unit time). We will refer to circuits that achieve this task as a "phase" oracle and define them to act as To simplify the following discussion, we assume that E x is shifted such that it is non-negative. Such a shift corresponds to an unobservable global phase.
To realize this oracle, one strategy would be to first approximately compute E x into a register using O direct , then multiply by γ and perform further logic to phase the system by the amount in the register. For instance, where the phasing operation needed is The value of 2πkγ/2 b dir would correspond to the approximation of γE x , with k the integer approximating E x (so is a scaled form of γ. We will limit ourselves to simulations where the phase factor is no more than a factor of 2π, soγ ≤ 1. Using the "phase gradient" trick of [32,33], it is possible to apply a phase by adding into a reusable ancilla register initialized to the state Here we use b grad rather than b dir in this state to allow for needing to use more bits to obtain the required precision in the phase. For details see Appendix A. In this case we need to multiply by the classically specified numberγ to obtain the required phase. This number can be given by logγ + b pha + O(1) digits in order to obtain error < 2 −b pha . There will be error due to the finite number of digits for E x , the finite number of bits forγ, and the multiplication. Rather than performing the multiplication byγ, adding into the phase gradient state, then uncomputing the multiplication, a more efficient method is to perform the multiplication by repeated addition into the phase gradient state. For each non-zero bit ofγ, we can add a bit-shifted copy of k into the phase gradient state. Each addition into the phase gradient state has cost b grad − 2, and on average approximately half the bits ofγ will be zero, giving cost roughly b grad (logγ + b pha )/2. To address cases where more bits ofγ are nonzero, we can writeγ as a sum of powers of 2 with plus and minus signs. In that case it is possible to use no more than (logγ + b pha )/2 + O(1) additions, giving cost b grad (logγ + b pha )/2 + O(b grad ). The error due to omission of bits in the multiplication is no more than approximately 2 −b grad (logγ + b pha )π, so to obtain error < 2 −b pha one should take b grad = b pha + O(log b pha ). That gives an overall cost for the multiplication For more details see Appendix A. Note finally that the state |φ can be initialized prior to simulation and reused throughout, with a negligible additive one time cost scaling as O(b 2 grad ). This one time cost comes from synthesizing b grad arbitrary rotations. However, since this is additive to the overall cost (whereas all other oracle costs are multiplicative with the number of queries), we expect this will be negligible.
For the L-term spin Hamiltonians and QUBOs, the cost of the multiplication by γ can be eliminated by simply including it in the coefficients of the problem Hamiltonian. However for these cases an even more efficient approach is to simulate each term explicitly in a Trotter-like fashion and perform rotation synthesis to decompose each rotation into a sequence of T gates. In that case, one would require a number of T gates equal to the number of terms times the cost of rotation synthesis, which gives a complexity of O(L(b pha + log L)). Using the repeat until success circuits of [34], this would give T gate and ancilla complexities of roughly A phase There is a single temporary ancilla qubit used by the repeat until success circuits. The measure of error in [34] is the Expanding in a series gives a Frobenius distance of 2 −b pha / √ 8 + O(2 −3b pha ). That means the cost becomes 1.15 b pha + log √ 8 + 9.2 = 1.15 b pha + 10.925, which is why the second term above is different than in [34]. Because Toffoli gates require roughly twice the resources to distill as T gates [35], this approach is likely to be more efficient in practice. This would give T and ancilla complexities for QUBO of assuming N > b pha . For the SK model it is better to compute the energy, add the energy into the phase gradient state, then uncompute the energy. That has Toffoli complexity 2N 2 , with 2 log N persistent ancillas and 4 log N temporary ancillas. The cost of the multiplication directly into the phase gradient state is b 2 pha /2 + O(b pha log b pha ) (withγ ≤ 1), with b grad permanent ancillas for the phase gradient state and b grad − 1 temporary ancillas for the addition. That gives costs for SK of For the parameters we consider for examples of gate counts, 4 log N ≥ b pha , so we give that in Table IV. For the LABS model we still need to explicitly compute the partial sum for H k and then take the absolute value. Instead of adding the absolute value of that to an output register we can cnot the highest bit (indicating the sign of the partial sum u) into a single ancilla. Then, we can negate the whole partial sum controlled on this ancilla so that we have the state |u |0 if u ≥ 0 or |−u − 1 |1 if u < 0. Then, we can add this ancilla to the partial sum register giving us either |u |0 if u ≥ 0 or |−u |1 if u < 0. At this point we can multiply by γ and add the value of u to the |φ register and perform phase kickback in order to phase the system by the absolute value of the partial sum. Then, we need to invert adding the sign qubit register to the sum register and uncompute |u and the ancilla.
Using the sum of tree sums, we numerically find that the Toffoli cost to compute and uncompute the partial sums is no greater than 8N (N + 1)/5 for N in the range 64 to 1024 that we consider. The numerically computed ratios are shown in Figure 1 The number of ancillas needed is b grad persistent ancillas for the phase gradient state, b grad − 1 temporary ancillas for the addition, log N + O(1) for the temporary ancilla with the partial sum for H k , and 2 log N + O(1) for the temporary ancillas used for the sum of tree sums. The ancillas for the partial sum for H k are needed at the same time as those for the addition into the phase gradient state, but the temporary ancillas for the sum of tree sums are not. The temporary ancillas for the sum of tree sums will be less than those for the addition into the phase gradient state, so can be ignored. That gives us a total of b grad + log(N + 1) + 1 temporary ancillas for a total B phase LABS = b grad + log(N + 1) For 9N/5 < b 2 pha , it is more efficient to just compute the entire energy, multiply byγ, then uncompute the energy, as explained above. Then we obtain complexity A circuit realizing the qubitized quantum walk operator W controlled on an ancilla qubit [26,28]. Here R is a reflection about the zero state for the entire | register, and therefore has Toffoli complexity log L + O(1) where log L is the size of the | register. However, that overhead is negligible compared to the cost of the prepare and select operators in the constructions of this paper.
where the cost of multiplying byγ is absorbed into the order term. Because this is smaller than that given above for 9N/5 < b 2 pha , we should give the cost as the minimum of the two complexities C phase LABS ≤ 8N (N + 1)/5 + min N b 2 pha /2, 9 10 In that case we need 2 log N + O(1) temporary ancillas for the energy, and b grad − 1 temporary ancillas for the addition into the phase gradient state at the same time. There are also 3 log N + O(1) temporary ancillas for computing the energy, which are not used at the same time as b grad − 1 temporary ancillas. That gives a number of temporary ancillas increased to We give this cost in Table IV to account for the possibility of using either method. In the table we assume 3 log N ≥ b pha , because that is true for most combinations of parameters we consider.

D. Oracles for linear combinations of unitaries
A number of approaches to quantum simulation are based on accessing the Hamiltonian as a linear combination of unitaries. This so-called "linear combination of unitaries" (LCU) query model [36] has been used for Taylor series simulation [37], interaction picture simulation [38], and generalized to block encodings for "qubitization" [28]. These approaches begin from the observation that any Hamiltonian can be decomposed as a linear combination of unitaries, where w are real scalars and U are unitary operators.
Here we consider an approach to forming quantum walks known as qubitization [28]. The quantum walk involves LCU using queries to two oracles, followed by a reflection operation as shown in Figure 2. The first oracle circuit, the "preparation oracle", acts on an empty ancilla register of log L qubits and prepares a particular superposition state related to the notation of Eq. (51), The quantity λ has significant ramifications for the overall algorithm complexity; specifically, the qubitization oracles will need to be repeated a number of times proportional to λ in order to realize the intended quantum walk. The second oracle circuit we require acts on the ancilla register | as well as the system register |ψ and directly applies one of the U to the system, controlled on the ancilla register. For this reason, we refer to the ancilla register | as the "selection register" and name the second oracle the "Hamiltonian selection oracle", Using two queries to prepare and a single query to select we are able to implement a controlled quantum walk W which encodes the eigenvalues of H as a function of its own eigenvalues [28]. Specifically, in a subspace this quantum walk has eigenvalues equal to the arccosine of the eigenvalues of the problem Hamiltonian divided by λ. We now discuss the realization of this quantum walk for the problems discussed in Section II.

LCU oracles for L-term Hamiltonian
Using the strategy for unary iteration introduced in [26] we can implement select for H L with Toffoli complexity of exactly L − 2 and log L − 1 extra ancilla qubits (or L − 1 and log L if the operation needs to be controlled by another ancilla, as it is in [26]). The circuit given there has log L ancilla. The other ancilla is just a control, it isn't needed for the iteration. If we don't want to make it controlled, then the number of ancilla needed is log L − 1. Also, the Toffoli cost is only L − 2 if we don't need to make it controlled. The operator we are to implement is A simple way to understand the strategy would be to first map the binary representation of | to a one-hot unary register (a register that contains L qubits which are all off except for qubit which is on). Then, one could control the application of the Z i associated with i ∈ q on this qubit with only Clifford gates. This strategy would have low Toffoli complexity but would require L ancilla. The basic insight of the unary iteration circuits in [26] is that one can stream through bits of this unary register using just log L − 1 extra ancilla. A circuit primitive is repeated L times and at iteration j, a particular ancilla is equal to on if and only if = j. At that point in the circuit we can use Clifford gates to control the application of Hamiltonian terms like Z i Z j Z k .
In [26] a strategy referred to therein as "coherent alias sampling" is introduced and explicit circuits are provided which allow one to realize prepare for an arbitrary model with a Toffoli cost of L + b LCU + log L + O (1). We need approximately log L ancillas for the state being prepared, log L for the alternate index values, and log L for the temporary ancillas in the QROM. There are b LCU ancillas for the keep probabilities in the coherent alias sampling and b LCU for the equal superposition state. Another temporary ancilla is used for the result of the inequality test. select uses L Toffolis and log L temporary ancilla, but these can be reused from the temporary ancilla used by prepare.
Here, b LCU is a parameter that scales the precision of the cost function. In particular, this strategy will generate the state in Eq. (52) but with approximate coefficientsw in place of the exact coefficients w such that Per the realization depicted in Figure 2, the quantum walk of interest is realized using two queries to prepare and one query to select. Thus, the strategy we have outlined requires Toffoli and ancilla counts of

LCU oracles for QUBO and using dirty ancilla
In some cases, especially when there is some structure in the Hamiltonian terms and one is willing to reduce gate complexity at the cost of space complexity, another method of implementing prepare might be appropriate. In particular, we can combine the coherent alias sampling ideas of [26] with the on-the-fly "dirty QROAM" of [39] (which is a concrete realization of an idea in [40] which builds on the QROM idea of [26] and is named "QROAM" since it incorporates attributes of both QROM and QRAM). Using Theorem 1 of [39] in conjunction with the coherent alias sampling of [26] Toffolis and (k − 1)b LCU dirty ancilla in addition to 2b LCU + log(L/k) + O(1) clean ancilla (not counting the selection register), where k ∈ [1, L] is a free parameter that must be a power of 2. This sort of QROAM can be uncomputed faster than it can be computed [39]. Combining Theorem 3 in [39] with coherent alias sampling [26] leads us to the result that the Toffoli cost of uncomputing prepare is less than the complexity quoted above by 4(b LCU − 1)k and can reuse the same ancilla. The number of dirty ancilla is reduced to k − 1, which means that the value of k can be taken to be larger, reducing the Toffoli complexity. See Table V for detailed costs of various types of QROAM. We will use this dirty QROAM strategy for the QUBO Hamiltonian. Our approach will involve indexing the terms and coefficients with two registers, each of size log N so that | = |i |j . This makes applying select particularly easy as we can use two applications of the unary iteration strategy that we discussed for implementing Eq. (54) to type of ancilla type of computation Toffolis clean ancilla dirty ancilla  [39], where L is the number of items, k is a power of 2, and M is the output size. This table omits the log L ancilla from the selection register and the M -qubit output.
realize select with Toffoli complexity 2N − 4 and log N − 1 ancilla (again, not counting those in the selection register). Because the QROAM strategy needs a single register that takes a contiguous set of values, we need to compute a new register for QUBO. For QUBO where i ≤ j one would calculate j(j − 1)/2 + i. (Note that this is with indexing starting from 1, which we have done to simplify the sums, but 1 would be represented in binary as 00 . . . 00, and so forth.) We apply the QROAM to this register, then uncompute it afterwards. The cost of computing and uncomputing this register is O(log 2 N ) due to the multiplications. Since L = N (N + 1)/2 for QUBO, the Toffoli cost of implementing select, in addition to implementing (and later uncomputing) prepare, will be and will require kb LCU + O(1) dirty ancilla and 2b LCU + 2 log(N/k) + 2 log N + O(1) clean ancilla. For simplicity we are taking k to be the same for the computation and uncomputation here, though it is more efficient to take k larger for the uncomputation. Minimizing k by taking the derivative gives us which leads to Toffoli complexity for the entire walk (including select) going like and ancilla complexity for the entire walk going like where the first term in the ancilla scaling corresponds to the dirty ancilla, and thus can use the system qubits. For simplicity we have used the exact optimal value of k here; there will a slight increase to the complexity because k needs to be a power of 2 so cannot be taken exactly equal to N/ √ 2b LCU . While this result optimizes the Toffoli complexity of our implementation it does so at a fairly high cost; we have increased the space complexity from . In many cases this will not be a sensible tradeoff and one should instead choose a smaller k so that the total number of qubits is not increased. For instance, k = N/b LCU will never increase the spatial complexity because we will always have N system qubits available in the system register that are not acted upon while we apply prepare. In some cases (for instance, the quantum simulated annealing algorithm realized by Szegedy quantum walks) we will actually have 2N qubits available for use during prepare and so we can safely take k = 2N/b LCU without increasing the spatial complexity.
Next we give a more detailed explanation of the costing. The QROAM costings are, for output size M , given in Table V. The value of L is L = N (N + 1)/2 for QUBO. The output consists of b LCU qubits for the keep probability in the state preparation, plus 2 log N qubits for the alternate values of i and j, so With clean ancilla qubits, the optimal value of k for preparation limited to powers of 2 is and for inverse preparation is The other Toffoli costs in other parts of the LCU (beyond the QROAM) are as follows.
• There is O(log N ) cost to prepare the equal superposition states over i and j with i ≤ j.
• There is 2(b LCU +2 log N )+O(1) Toffoli cost for the inequality test and controlled swaps for the state preparation and inverse preparation.
• The cost of the arithmetic for producing the contiguous ancilla is O(log 2 N ).
• The select has a Toffoli cost of 2N − 4, or 2N − 2 if it needs to be made controlled.
Altogether these costs give a Toffoli cost with clean ancilla of with the values of M , k c1 , and k c2 in Eq. (64), Eq. (65), and Eq. (66). If we ignore the rounding in k c1 and k c2 , then the Toffoli cost is The rounding in k c1 and k c2 can potentially increase the cost by a factor of 1/ √ 2 + 1/ √ 8, or about 6%. In costing the total number of ancillas for the state preparation, we also need to account for the following (in addition to those in Table V).
• There are 2 log N qubits needed for the prepared state.
• There are b LCU qubits used for the register in equal superposition that we use to perform an inequality test with in the state preparation.
• The M output qubits.
• There are log L temporary ancilla qubits used for the contiguous register.
• There are b LCU − 1 temporary ancillas used in computing the inequality test for the state preparation.
There are also log N temporary registers needed for the select step, but many of the qubits are only temporarily used by the QROAM, and these can be reused, so we do not get an additional ancilla cost for select. The ancillas additional to those in Table V can therefore be given as 2M persistent ancillas and max(log L, b LCU )+O(1) temporary ancillas. The ancillas in Table V are temporary as well, and the b LCU qubits are not needed at the same time, giving a maximum of temporary ancillas. Ignoring the rounding in k c1 for simplicity gives the leading-order term as N M/2 temporary ancillas. Next we consider the cost with N dirty ancilla. The optimal value of k for the QROAM computation is For the uncomputation cost it is optimal to take k d2 = L/2 which gives a cost of 4 √ 2L, ignoring rounding of k d2 to a power of 2. With L = N (N + 1)/2, the optimal k d2 is N (N + 1)/4 < N , so there are enough dirty ancilla available. With rounding the value of k d2 for uncomputation would be Together with the additional Toffoli costs for the state preparation, the Toffoli cost for LCU is To simplify the expression, we will use N/M rather than N/M + 1 in the expression for k d1 , and not take into account rounding k to a power of 2. Then we get a computation Toffoli cost of For the ancilla cost, the persistent ancilla cost is again 2M , and the temporary ancilla cost loses the term M (k − 1) because dirty ancilla are used for that, so it does not increase the ancilla cost. The temporary ancilla cost is Using L = N (N +1)/2 and k d1 = N/M gives log(L/k d1 ) = log(N + 1)+log M −1. Then log(N + 1) = log N +O(1/N ).
In Table IV we just give 3 log N for the temporary ancilla cost, because it is true (or close to true) for the combinations of parameters we consider.

LCU oracles for the SK model
For the SK model we can considerably improve over the naive implementation. Because the SK model coefficients only need to give a sign, we just need to apply a sign to the terms in the superposition. That corresponds to the phase fixup used for the QROAM uncomputation, and the cost is the same. Another advantage of this approach is that we eliminate the 2(b LCU + 2 log N ) + O(1) cost for the inequality test and controlled swaps that would otherwise be needed for the coherent alias sampling. Therefore the Toffoli cost with clean ancilla is If we ignore the rounding in k c2 then we obtain the complexity Beyond the ancillas needed for the QROAM, we just need the 4 log N + O(1) qubits for the i, j, and contiguous registers. Again select can use the same temporary ancillas as the QROAM and does not add to the ancilla cost. Therefore the ancilla cost is Ignoring the rounding in k c2 for simplicity gives If we are using dirty ancilla, then the Toffoli cost becomes Ignoring the rounding in k d2 we obtain the complexity The persistent ancilla cost is only 2 log N for the i and j registers, and there is temporary ancilla cost of 2 log N for the contiguous register and log(L/k d2 ) ≈ log N from the QROAM. The total ancilla costs are therefore

LCU oracles for the LABS model
The LABS problem has L = O(N 3 ) terms in it, which would lead a high complexity quantum walk if our general strategy were applied. Fortunately, there is much structure in this problem. We start by rewriting Eq. (6) as Instead of linearly indexing all O(N 3 ) terms, we will use three registers, each of size log N , which store the values of i, j and k. Thus, our select operation will act as To accomplish this, we simply need 4 applications of the unary iteration primitive described in [26]. Each of these primitives require N − 1 Toffoli gates. The only nuance is that we will need to compute the values i + k and j + k before implementing the primitive to perform Z i+k and Z j+k . These additions can be performed in place (and then uncomputed) in the i and j registers and introduce a negligible additive 4 log N cost to the cost of unary iteration, where the cost of addition is log N − 1 ≤ log N . Thus, the total Toffoli cost of our select implementation is 4N + 4 log N . We require approximately 3 log N persistent ancilla for the i, j and k registers, another log N temporary ancilla for computing the i + k and j + k (since they are computed in place), and log N temporary ancilla for the addition. The unary iteration uses log N − 2 < log N ancillas, which can be reused from the temporary ancillas for the addition so do not add to the cost. Because all terms have the same coefficient, prepare needs to initialize a superposition over a number of items that is not a power of 2. The Toffoli cost is O(log N ). The only unfortunate aspect is that for the LABS problem the corresponding normalization λ, is quite large and this will enter into the complexity of our quantum walks as the number of times the quantum walk must be repeated to realize the intended unitary. In total then the cost to realize the quantum walk in Figure 2 is

E. QROM-based function evaluation
Now that we have explained how to implement oracles for various cost functions of interest, we turn to the question of how to calculate functions of the cost. This is important for several possible approaches to heuristic-based combinatorial optimization. In simulated annealing, for instance, the probability of moving from one candidate solution to another is proportional to an exponential of the energy difference between the two solutions, multiplied by an inverse temperature β. We would thus require the quantum computer to calculate an exponential of the output of the relevant energy difference oracle.
Because we are implementing heuristic approaches to combinatorial optimization, we do not expect that the functions of the cost need to be calculated to a high degree of accuracy so long as the functions we compute are still monotonic in the cost (to make sure that the energy landscape is not inverted in any way). We instead want to minimize the computational complexity of evaluating these functions given rather weak requirements on the accuracy of the output. Here we describe a general strategy for such cheap approximate function evaluation.
Our overall strategy is to approximate a function f of a b-bit input z by a piecewise linear approximation,f . This approximationf is calculated based on a choice of sample points z 0 < z 1 < . . . < z g , where z 0 ≤ z < z g . These sample points separate the interval [z 0 , z L ) into g different sub-intervals of the form [z , z +1 ) with = 0, 1, . . . , g − 1. The input z belongs to exactly one of these sub-intervals, and so we find an such that z ≤ z < z +1 . Having found , we use some data that can be looked up in order to calculatef (z) = αf (z ) That is, the functionf is defined by interpolating between known values f (z ) and f (z +1 ) of the target function f . QROM [26] can be used to obtain the region that z is in (i.e. the correct value of above), and for that region the QROM outputs a slope and intercept for the linear approximation. The Toffoli cost of looking up one of g different possible values in the scheme of [26] is g − 2, or g − 1 if the output is controlled by a qubit. This Toffoli count relies on a technique from [32] in which certain naively expected Toffolis can be replaced with Clifford gates plus measurement. Also note that the Toffoli count of QROM-based lookup is independent of the number of bits of data output, meaning that we are free to choose any number of bits to represent the slope and intercept without introducing a Toffoli cost from the QROM. We choose the number of bits in order to obtain b sm bits forf . That is,f may be a rough approximation of f , but we givef to more bits than needed by that approximation sof has smooth behaviour.
We will not use QROM precisely as specified in [26] but rather a variant of it. To explain the distinction, we begin with some terminology. QROM is a method for executing a quantum circuit that operates on two registers, an input register and an output register. The input register has an initial value of encoded into it and the output register starts in the all-zero state. Each value of corresponds to some piece of data d that has been specified classically before the quantum circuit was constructed. The effect of the QROM is   Our variant of QROM is designed for the case in which there are data collisions. That is to say, we consider the case where d = d for several different pairs and . In Figure 3 In this variant, we imagine that we have distinct parts of the iteration: iterate by → + 1, iterate by → + 2, iterate by → + 4, and so on for each power of two. This variant of QROM is appropriate for our purposes because we want to improve computational efficiency by spacing z unevenly. This is equivalent to treating many pieced of data d as being equal, as the data is simply the information needed to calculate a linear function. The total number of Toffoli gates is still g − 2 for g distinct regions, provided these regions correspond to ignoring bits of the input. For example, we can use a region such as {4, 5}, but not {3, 4}, because 4 ≡ 100 and 5 ≡ 101, so grouping 4 and 5 corresponds to ignoring the least significant bit, but the least significant bit changes between 3 and 4.
A further subtlety is that all regions need to be a size corresponding to a power of 2 for this cost. In some cases we may wish to have a final region that is larger than half, so it is not a size that is a power of 2. That will occur because we can have a large energy difference, but the exponential will give a transition probability that will just be approximated as zero for a wide range of energies. Then the cost can be larger. For example, if we are distinguishing 0 from 1 − 15, then it will take 3 Toffolis. The cost can be seen from the diagram where the size of the regions increases in powers of 2, shown in Figure 3(b). There one can choose the numbers used for d 4−7 and d 8−15 to be equal, which gives a region for 4 − 15. This choice corresponds to a situation where the gap between neighboring interpolation points z grows exponentially.
For many of the piecewise approximations, we can obtain accurate approximations using just powers of 2, as in Figure 3(b). Two main types of function that we aim to approximate are the exponential and the arcsine of the exponential. For the exponential the piecewise approximation can use points at argument values of 0, 1/2, 1 and so on and achieve a piecewise linear approximation within about 0.03. The arcsine of the exponential is more difficult to approximate because the slope diverges at an argument of 0, but using piecewise linear approximation points starting at 1/2 11 and going up by powers of 2 gives similar precision as for the exponential.
To estimate the number of interpolation points needed for higher precision, note that the error of interpolation of function f (z) is approximately FIG. 4. The numbers of intervals multiplied by 2 −b fun /2 for the five functions we consider. In (a) we allow the intervals to have general endpoints, and in (b) we restrict the intervals to change by factors of 2, to be consistent with the QROM method we use. This demonstrates that the number of intervals scales as 2 b fun /2 with a scaling constant around 1.
where δz is the width of the interval. To obtain error no greater than 2 −b fun , we can therefore take We can therefore estimate the number of intervals needed to approximate the function by In the case where we are approximating arcsin(exp(−z/2)), then we would get g ≈ 1.31103 × 2 b fun /2 , and if we were approximating exp(−z), then we would have g ≈ 2 (b fun −1)/2 . For the three functions used for spectral gap amplification, 346002, 0.566302, and 0.517075 respectively. The variation of 2 −b fun /2 g with b fun is shown in Figure 4(a). In practice, we need to limit the intervals to sizes that increase by factors of two as described above. That increases the values of 2 −b fun /2 g to around 1.0, 1.9, 0.5, 0.8, and 0.7 for the five cases, as can be seen in Figure 4(b), an increase of around 44%. Nevertheless, it is reasonable to give the scaling of g as O(2 b fun /2 ), with the constant factor somewhere between 0.5 and 2.
In the linear interpolation, the primary cost is that of multiplication of the argument times the slope. This cost will depend on how many digits are used for the slope and the argument. For simplicity, consider the case where bits of the argument can be divided between those before the decimal point and those after the decimal point. The maximum value needed for the argument is O(b sm ), because beyond that the functions are within 1/2 bsm+1 of their asymptotic values. That means only log b sm + O(1) bits would be needed before the decimal point. The number of digits after the decimal point would depend on the maximum value of the slope. In the case of the exponential the maximum slope is 1, so only b sm bits would be needed. Because the slope could be multiplied by an argument that Toffoli gates. The same result is obtained for all other functions we consider except the arcsine. The arcsine has a slope that goes to infinity, but the linear interpolation will only use a finite slope. The minimum interpolation point needs to be O 2 −2b fun , which gives maximum slope of O 2 b fun , so the argument would require another b fun + O(1) bits after the decimal point. The slope would need b fun + O(1) bits before the decimal point, and b sm + log b sm + O(1) bits after the decimal point to account for the maximum argument. Then both numbers would need b fun + b sm + log b sm + O(1) bits. We will take b fun similar to b sm , giving a multiplication cost of ( To estimate the numbers of bits needed, we have performed simulation of the technique of Section III E with the SK Hamiltonian on 16 qubits, as shown in Figure 5. In that technique, we need an approximation of the arcsine of the transition probability to control a qubit rotation, rather than the transition probability itself. Choosing interpolation points such that the error in the approximation of the rotation angle is no more than 0.01, the success probabilities are almost unchanged. So far we have assumed that the energy difference has been multiplied by the inverse temperature β before being input to the procedure. It is possible to bundle the multiplication by β into the oracle, and as shown in Figure 5 that again has similar performance. There is also the question of how many bits are needed in the function approximating the transition function. We again find that low-precision approximations have very little impact on the success probability. The effect of various methods of function approximation on optimization performance. We numerically simulate the quantum simulated annealing technique of Section III E using various methods of approximating the transition probability. We consider the performance when the rotation angle is calculated to machine precision ("exact"), with piecewise linear approximation chosen to ensure the worst-case error does not exceed 0.01 ("interpolated"), incorporating the inverse temperature β into the definition of the function so that we are interpolating f (z) = arcsin(exp(−βz/2)) rather than f (z) = arcsin(exp(−z/2)) to avoid a multiplication ("include β"), and when we round off the output of the function to 7 bits ("rounded output"). Each of these approximations builds upon the previous approximation, so we perform linear interpolation in all but the exact method.
We simulate the performance averaging over 4096 random SK instances on 16 qubits, with β linearly increasing over 50 steps from 0 to 1.2. We report the average failure probability (bottom) as well as an estimate of the computational cost (top) in which we calculate the number of annealing steps divided by the probability of success. We observe that the differences in performance are not meaningfully affected by the method of function approximation, suggesting that we can pick the computationally cheapest option for our cost analysis.
The overall complexity of the interpolation excluding the QROM is therefore when the arcsine is needed. To estimate the QROM complexity, we need to account for the final region not being a size which is a power of 2. In the worst case the additional cost can be no larger than b dif , which is the total size of the input register. We can therefore bound the QROM complexity as b dif + O(2 b fun /2 ), giving total interpolation complexity of or, for the case where the arcsine is needed, For the number of ancilla qubits needed, except for the arcsine case there are 2b sm + O(log b sm ) needed for the slope and intercept, and 2b sm +O(log b sm ) used as temporary ancillas for the arithmetic. We need b dif −1 temporary ancillas for the QROM, which is more than the number used for the arithmetic. The output for the transition probability can be added into the slope, so does not increase the ancilla cost. Therefore the ancilla costs are These considerations give the costs for function evaluation in Table IV. For the arcsine case we need 2b sm + b fun + O(log b sm ) ancillas for the slope and intercept, because we need another b fun ancillas for the slope. Again the temporary ancilla cost is primarily for the QROM, so the ancilla costs are

III. OPTIMIZATION METHODS
In this section we review proposals for heuristic quantum optimization algorithms and explain how those algorithms can be implemented in terms of the oracles we describe in Section II. By this we include methods based on Hamiltonian walks, those based on time evolution, and methods related to simulated annealing. In most cases we will suggest improvements to these methods, but an important motivation for this section is to give a complete analysis of the complexity of these algorithms which includes constant factors so that we can estimate the resources required to realize them in the surface code in Section IV. We describe the complexities of these methods in terms of the oracles from the previous section in Table VI, then give the complexity in terms of Toffoli or T gates in Table VII. As this section incorporates a wide variety of sophisticated techniques, we begin with a brief summary of the approaches we are considering.
• Amplitude amplification (Section III A). We start by considering amplitude amplification, which can be used to directly amplify the amplitude of the solution. Unlike the other methods, it takes no advantage of the structure of the solution, so is a useful reference point to compare to the other optimization approaches. Amplitude amplification can also be used in combination with the other optimization approaches, by performing amplitude amplification on the output of the optimization.
• The Quantum Approximate Optimization Algorithm (Section III B). The steps of this approach (QAOA) are equivalent to Trotter steps, so the costing for QAOA and Trotter steps is given in the same lines in Table VI and  Table VII. Trotter steps can be used for adiabatic approaches, which are considered in the next subsection. But here we also focusing on strategies for efficiently estimating the QAOA objective value that are more appropriate for a fault-tolerant cost model than standard approaches.
• Adiabatic quantum optimization (Section III C). We review the quantum adiabatic algorithm [2] and the most straightforward way of implementing that approach using a Trotter method that queries the phase oracles presented in Section II. We will then suggest a strategy for implementing the adiabatic algorithm using the LCU oracles presented in Section II. The LCU oracles have a different costing to Trotter/QAOA, so are given in separate lines in Table VI and Table VII. Next, we will review a method for digitizing the adiabatic algorithm while suppressing certain types of errors that is based on inducing quantum Zeno-effect-like projection to the ground state by randomizing phases [21]. We will suggest how this approach can be improved by using carefully chosen probability distributions to eliminate the errors that would manifest from incorrect measurements in the Zeno approach. The time-evolution oracles used by these methods are also suitable for the quantum enhanced population transfer algorithm and the shortest path algorithm. Since we do not introduce new techniques for those algorithms, but rather review how our oracles can be queried within those frameworks, we discuss that content in Appendix D.
• Szegedy walk based quantum simulated annealing (Section III D). Simulated annealing is a classical algorithm that mimics a physical cooling process via Markov chain Monte Carlo techniques. The quantum algorithm of Somma et al. [13] is to replace the Markov chain with a corresponding Szegedy walk. If the spectral gap of the Markov transition operator is ∆, the number of Szegedy walk steps grows as O(1/ √ ∆) in contrast with the best known bound on the worst case scaling of the number of Markov transitions needed in the classical approach, which goes like O(1/∆). Thus, the result appears to be a quadratic speedup over simulated annealing. We note that the O(1/∆) scaling of classical simulated annealing is known to be a very loose bound for a broad class of problems. Typically, simulated annealing is used heuristically by lowering the temperature much faster than would be suggested by this bound. Our results constitute the first complete cost analysis for this algorithm that involves constant factors in the complexity.
• LHPST qubitized walk based quantum simulated annealing (Section III E). Lemieux, Heim, Poulin, Svore, and Troyer (LHPST) [23] give a Metropolis-Hastings-like qubitized walk approach which is significantly more efficient than the direct Szegedy approach. We will refer to this method by their initials, but we provide an improved technique that is efficient for more complicated problem Hamiltonians with high connectivity. LHPST consider a method that is efficient for simpler problem Hamiltonians with low connectivity, but would have exponential cost for the problem Hamiltonians considered here.
• Spectral gap amplification based quantum simulated annealing (Section III F). In [14], the authors construct an inverse-temperature-dependent Hamiltonian whose ground state in the zero-temperature limit is a superposition of solution configurations. By performing spectral gap amplification on their Hamiltonian, they obtain a gap that is similar to that for the quantum walk approach, indicating a similar speedup. Our main purpose is to outline these techniques and summarize the work needed to execute such algorithms in general and for specific algorithm primitive Toffoli complexity ancilla complexity amplitude amplification step  Table III. For both the LHPST walk step and the gap amplified walk step the cost is reduced by 8 log N when N is a power of 2. By combining these scalings with the oracle costs given in Table IV, we arrive at the resource estimates in Table VII. For order ρ Suzuki (meaning that the error is O δt ρ+1 ) we multiply the QAOA cost by 2 × 5 ρ/2−1 . For the QAOA/Trotter step, the b pha ancillas are for a phase gradient state, and may be saved if those are already accounted for in A phase . The quantity is an allowable error in synthesizing a rotation in the state preparation.
problems of interest as outlined in the Introduction. We will also suggest a variant of this algorithm where one can use qubitized quantum walks rather than time evolution for the adiabatic evolution. In both cases, our results provide the first constant factor bounds on the complexity of implementing these algorithms.
We summarize the outcomes of this section in Table VI. The entries of Table VI show how the Toffoli complexity and ancilla cost of each of the above named algorithm primitives depend on the relevant costs of oracles presented in Table IV. We can then use Table VI together with Table IV to calculate the overall Toffoli complexity and ancilla cost of each algorithm primitive for each type of cost function. The results of this analysis are summarized in Table VII. In giving the complexities in this table, we assume 2 b fun /2 < b sm log b sm < b rot to simplify the order terms, which is reasonable for the examples we consider in Section IV.
A. Amplitude amplification

Combining amplitude amplification with quantum optimization heuristics
All of the optimization heuristics discussed in this paper can be seen as methods of preparing a quantum state with overlap on a low energy subspace of interest. We will refer to the subspace of interest as S. Sometimes this subspace of interest is actually the lowest energy state (or states) and other times it is any state with energy less than a certain threshold. Furthermore, all algorithms discussed in this paper are heuristics that can be systematically refined. Let us refer to an algorithm for quantum optimization that is run for duration t as U(t). Let us assume that these algorithms always begin in the state |+ ⊗N and denote the output state of the algorithm by |ψ(t) = U(t) |+ ⊗N . Thus, after running our algorithm U(t) and sampling in the computational basis, the probability of measuring a state in the subspace of interest S is When we say that these heuristics can be systematically refined what we mean is that we can (on average) increase p 0 (t) by increasing t. This refinement will come at a cost C(t) > 0 which we define as the complexity of implementing U(t). This complexity is greater than zero because preparing the initial state |+ ⊗N requires nonzero time even if we do nothing further. We can also boost the probability of seeing a state in S by repeating U(t) more times and sampling. On average we will need to run our algorithm U(t) a number of times equal to 1/p 0 (t) in order to see a state in S. Thus, on average the cost to sample a state S is given by There is a compromise to be reached between the duration t of the optimization heuristic U(t) and the success probability p 0 (t); heuristics run for more time can reach a higher success probability and therefore be repeated fewer times, but increasing t beyond a certain point has a negligible impact on its success probability p 0 (t). While past work [23] has discussed this dichotomy in terms of a minimum time to solution metric which is parameterized in terms of a target success probability, here we focus on the mean cost to succeed because this seems more reasonable to consider in a context where p 0 (t) is unknown. Still, given knowledge of p 0 (t) one could optimize this mean time by choosing t  Resource estimates for the various heuristic optimization primitives explored throughout this paper, applied to our four problems of interest. For both the LHPST walk step and the gap amplified walk step the Toffoli count is reduced by 8 log N when N is a power of 2. In all cases, these algorithms are refined by applying the primitive more times. The parameters used are as follows: N is the number of bits on which our cost function is defined; L is the numbers of terms in an L-term spin Hamiltonian; b pha is the number of bits we use to approximate phases in the implementation of our phase oracle; b dir is the number of bits we use to approximate the value of energies; bLCU is the number of bits used to approximate the square root of Hamiltonian coefficients in LCU methods, and brot is the number of bits of precision used in rotations. The Trotter step and Hamiltonian walk steps can be used to realize the adiabatic algorithm, the Zeno phase randomization variant of the adiabatic algorithm, heuristic variants of the short path algorithm or quantum enhanced population transfer, and many other heuristics based on Hamiltonian time evolution. These scalings result from combining the query complexities in Table VI with the oracle  costs in Table IV. When the algorithm type is decorated with (*) we report T complexity rather than Toffoli complexity. We have only given the main terms in the order expressions to simplify them.
to minimize Eq. (102). But rather than simply repeating the state preparation 1/p 0 (t) times, one could instead boost the success probability with amplitude amplification. Amplitude amplification is an idea which generalized Grover search and can be used to boost the probability of a marked state or subspace. For instance, we might define these marked states to be any state in S. In this context, amplitude amplification would allow us to perform a series of m reflections (involving two preparations of the state |ψ(t) ) which boosts the probability of measuring the marked subspace to p m (t) = sin 2 (2m + 1) arcsin p 0 (t) .
For instance, if we hoped to boost the probability to 1 then by using repeated sampling we would need roughly O(1/p 0 (t)) repetitions. However, by using amplitude amplification we would need only iterations if p 0 (t) is small (this is akin the usual quadratic Grover speedup). For each round of amplitude amplification one needs to reflect about a qubit marking the subspace of interest S. In our context the idea would be to amplify either a target energy (if a target energy, e.g. the ground state energy, is known) or to amplify all states with energy less than a certain threshold. To do this, one will need to compute the energy value into a register and perform either an equality or inequality test to determine whether we have reached a marked state. The energy can be computed simply by using the direct energy oracles introduced in Section II A. However, both that step and the cost of the equality or inequality evaluation will typically have a negligible additive cost to the cost C(t) of actually running the quantum algorithm U(t). Moreover, the ancilla used for storing the value of the energy can be borrowed from ancilla used in other parts of the algorithm.
For amplitude amplification to be most effective one should have an estimate of the overlap p 0 (t) in order to avoid "overshooting" the peak of the function in Eq. (103). Unfortunately, a reliable estimate of p 0 (t) will not be known in advance in general. In some rare cases one might instead have a somewhat tight estimate of a lower bound to p 0 (t) and in those cases some advantages can be realized by using a variant of amplitude amplification known as fixed point amplitude amplification [41]. However one can confirm that fixed point amplitude amplification will have no advantages in our context compared to the exponential search heuristic proposed in [20] when the best lower bound that is available is p 0 (t) > 0. The idea behind the approach in [20] is to run amplitude amplification for m = 2 j iterations for j = 0, 1, 2, 3, ... and so on until we sample a marked state. The cost of each iteration of amplitude amplification is 2 C(t) and so if we need to repeat this procedure until m = 2 k it will have a total cost that goes like Therefore, since the probability of failure in a single run with m = 2 j iterations is 1 − p 2 j (t), we see that the overall mean cost of the procedure is Though the left side of this expression cannot be simplified analytically, it converges quickly and can be easily numerically computed for any p 0 (t) > 0. Comparing Eq. (102) to Eq. (106) we can see that there is a clear asymptotic advantage to using amplitude amplification over classical sampling and expect this advantage will be realizable in practice in many contexts of interest for us. Like with Eq. (102), if one has knowledge of p 0 (t) then one can minimize Eq. (106) with respect to t to make the optimal tradeoff between running the algorithm U(t) for longer and using more rounds of amplitude amplification. In some cases it might actually be the case that the optimal choice is t = 0, which would correspond to using amplitude amplification directly as a heuristic for optimization. The only downside to using amplitude amplification in conjunction with other heuristic quantum algorithms for optimization is that we have traded incoherent repetitions of the primitive of U(t) for coherent repetitions of the primitive of U(t). In some cases this will mean that we need to target a higher error rate to make the calculation fault-tolerant by using an error-correcting code.

Directly using amplitude amplification
In the prior section we described how amplitude amplification can be combined with any of the other optimization heuristics in this paper in order to boost overlap on a target low energy subspace of interest. However, one can also use amplitude amplification by itself as a heuristic for optimization. This heuristic provides an interesting point of comparison to other algorithms because it offers a quadratic advantage over classical brute force search without leveraging any structure that might be available in a particular optimization problem. Thus, it is asymptotically the optimal strategy for solving totally unstructured problems like those described by the typical Grover oracle (all computational basis states have energy zero except for a solution with energy −1) or the random energy model (all computational basis states have a unique, Gaussian distributed energy).
To use amplitude amplification on its own all one needs to do is to regard the algorithm U(t) as the preparation of the symmetric superposition state |+ ⊗N , which requires only Clifford gates. In the analysis of Section III A 1 we assumed that the cost of directly computing the energy and then performing the comparison operation would be negligible compared to the cost of applying U(t) but that is not the case when we aim to directly apply amplitude amplification. Here, the main cost of a step will be the cost to compute (and then later uncompute) the energy. Following Eq. (106), in this context we would find that the mean cost of applying amplitude amplification directly will then scale like where we have used p 0 (t) = 1/2 N . The cost 2 C direct + N + O (b dir ) comes from cost C direct to directly compute the energy, cost O (b dir ) to apply the inequality operator to determine whether the energy is below the target threshold, cost C direct to uncompute the energy, and cost N − 2 to reflect about the equal superposition state. Note that this procedure is exactly the heuristic approach introduced in [20]. When the subspace S contains only a single state this algorithm reduces exactly to standard Grover search [3]. For later comparisons in this paper we will refer to the cost of a single step of amplitude amplification as having Toffoli complexity and requiring A direct + B direct + O(1) ancilla.

B. The Quantum Approximate Optimization Algorithm
The quantum approximate optimization algorithm (QAOA) is another popular approach to quantum optimization, introduced in [12]. The QAOA initially attracted significant interest after it was shown to produce a better approximation ratio for a specific combinatorial optimization problem of bounded occurrence, Max E3LIN2, than any known efficient classical method [12]. While a more efficient classical algorithm was presented shortly afterwards [42], interest in QAOA has only increased since then. While bounds on the performance of QAOA are sometimes available, in most contexts it is studied as a heuristic in the sense that the intention is to use the algorithm without knowing how well it will perform in practice. Part of the appeal of QAOA has been that it is an easy-to-implement algorithm that can be tested on noisy intermediate scale quantum (NISQ) devices even before fault-tolerance is available [43]. Nonetheless, QAOA would still be an interesting algorithm to perform within error-correction.
The QAOA is more straightforward than other algorithms discussed in this work. The QAOA consists of two components that are repeatedly applied. The first component is parameterized evolution under the diagonal problem Hamiltonian C, where in the last equality we emphasize that U C (γ) is equivalent to the phase oracle O phase (γ) that we introduce and provide explicit circuit constructions for in Section II C. The second component is parameterized evolution under a local transverse field driver Hamiltonian B, The QAOA is a variational algorithm that uses repeated application of these unitaries to prepare a parameterized state that is then optimized. The depth of the variational algorithm is usually denoted as "p" in the QAOA literature. Specifically, for depth p we prepare a state parameterized by γ = (γ 1 , . . . , γ p ) and β = (β 1 , . . . , β p ), where |+ ⊗N is the symmetric superposition of all 2 N computational basis states. For a given p, we attempt to find parameters that minimize the expectation value of the cost The QAOA proposes to use the quantum computer to estimate this expectation value and then to use a classical processor to perform a classical optimization, in a fashion similar to other variational algorithms [44,45]. In general finding the globally optimal values of γ and β could prove to be very challenging. However, QAOA is a heuristic algorithm and the idea is that even locally optimal parameter settings might provide good approximations. The original implementation of QAOA suggested that one directly sample the cost function C to estimate C . Using this method, if one wishes to converge an unbiased estimator C so that | C − C | ≤ ∆ C then the state |γ, β must be prepared and sampled a number of times equal to While one will not know σ 2 in advance, one can obtain a reasonable estimate of σ 2 after only handful of measurements and use that to determine how many more measurements are required.
The cost of QAOA is always dominated by the number of times that one must repeat the unitary U C (γ); the cost to implement U B (β) is essentially free in comparison. Thus, if J is the number of outer-loop optimization iterations which each require a query of the energy accurate to within ∆ C then in total we will require Toffoli complexity It is difficult to say what an appropriate choice of the quantities J and ∆ C should be as this depends on the problem, the choice of optimizer one is using, and how aggressively one is attempting to optimize. However, in many circumstances one might not need to perform the outer-loop optimization at all and can thus take J = 1. This is the case when optimal (or "good enough") parameters can be inferred before running the algorithm. Such a situation often arises when running large instances of optimization problems that are characteristic of a well defined ensemble (for example, if one is running instances of the Sherrington-Kirkpatrick model). This is due to the observation that normalized energy landscapes (proportional to C as a function of γ and β) concentrate to instance and size independent average values for large N [46,47]. Thus, surprisingly, it is possible to find the optimal values of γ and β by optimizing much smaller (presumably classically tractable) instances of these problems. Another possibility is that one simply use γ and β parameters that would be obtained from a Trotterization of the quantum adiabatic algorithm; in fact, there is evidence that these parameters become optimal as one increases p [48]. Thus, for problems where it is appropriate to forgo the outer-loop optimization step of QAOA, we can approximate the Toffoli complexity as p M C phase where M is the number of samples we desire. The number of logical qubits required for its implementation will be N (not counting any extra ancilla used for the phase oracle).
Within the context of NISQ computations it makes sense to use this method of sampling to estimate the cost function expectation value and then to perform the optimization on a classical computer. The reason is because both strategies minimize the size of each quantum circuit that must be executed, although potentially at a cost of needing a larger number of repetitions compared to other strategies. However, within cost models appropriate for fault-tolerance the primary resource to consider is the total number of gates required by the computation and no particular distinction is made whether those gates are involved in repeated applications of short quantum circuits or a single application of a longer quantum circuit. Thus, on a fault-tolerant quantum computer it may make sense to consider more elaborate versions of QAOA in which the expectation value estimation and potentially even the optimization is also performed on a quantum computer. For instance, perhaps the variational parameters γ and β can be stored in a quantum register on which the QAOA unitary is controlled. Such a scheme is considered in [49] where it shown that such a method can enable quadratically faster resolution of the gradient than would be otherwise required, however with significant constant overhead. Similarly, by using the amplitude amplification based Monte Carlo techniques discussed in [50] (see Theorem 5 therein) one can reduce the number of state preparations needed for an estimate of the cost function to O((σ/∆ C ) log 3/2 (σ/∆ C ) log log(σ/∆ C )), an almost quadratic improvement over the naive sampling strategy. However, as that method requires a number of copies of the system register scaling as O(log(σ/∆ C ) log log(σ/∆ C )), it might prove to be prohibitively expensive for realization on small fault-tolerant quantum computers. We now consider two alternative ways to measure the energy in QAOA which might prove more practical for small fault-tolerant quantum computers.

Amplitude estimation based direct phase oracle evaluation
Apart from sampling, the next most natural algorithm for estimating the energy is using amplitude estimation to compute the expectation value of each term in the cost function in sequence. Let us assume that the cost function takes the form, C = L =1 w U , where U is a unitary operator (and will typically be a sum of diagonal Pauli operators), as in Eq. (51). Further we will take λ = |w |. The algorithm that we employ is simple, for each from 1 to L we compute the quantity ψ| U |ψ within error ∆ C /(L|w |). An unbiased estimate of the cost function is then given by w ψ| U |ψ and from the triangle inequality the error is at most ∆ C . An estimate of ψ| U |ψ can be obtained by performing the Hadamard test (as shown in Figure 6). Specifically, the probability of measuring the ancillary qubit to be zero is (1 + Re( ψ| U |ψ ))/2. If Amplitude Amplification is used to mark the zero state for this circuit then the eigenphases of the resultant walk operator (within the two dimensional space spanned by the initial state and the marked state) is [20] φ = ±2 arcsin P 0 = ±2 arcsin 1 + Re ( ψ| U |ψ ) 2 .
We then have that 6. Hadamard test circuit for computing the expectation value of one of the terms in the cost-function. Here U ψ is a unitary operation that prepares the ansatz state: U ψ |0 = |ψ .
From calculus we then see that Thus from Taylor's remainder theorem we have that for any δ ≥ 0 and if φ → φ + δ for some error δ we have that the uncertainty that propagates to the expectation value is at most Therefore if we wish to estimate the energy of a configuration within error it suffices to use phase estimation with an error of on the Grover operator. Finally, as discussed above we take = ∆ C /(L|w |) to ensure that the error sums up to ∆ C as required.
Using the QFT-based phase estimation algorithm in [26] we find that, if we neglect the cost of the Quantum Fourier Transform and any additional costs due to additional precision required in the QROM then the number of queries to the Grover oracle needed is (for ≤ π) 2 m ≤ 2 π ≤ 2π . Here the factor of 2 comes from the fact that the need to round to a power of 2 leads to, in the worst case scenario, a factor of 2 in the number of iterations required.
Next the Grover oracle requires two reflection operators, one that reflects about the state yielded by the Hadamard test circuit and another that reflects about the target space which is marked by the top qubit in Figure 6 being zero (i.e. R 0 = 1 1 − 2 |0 0| ⊗ 1 1). The Grover walk operator is a product of these two operators W = −R 1 R 0 and as a result, if we neglect the cost of the additional Hadamard and Toffoli gates needed to implement the conditional phase flip, the costs of this process are entirely due to the reflection about the initial state which requires two applications of the preparation of the initial state. We further will follow the assumption in the previous section that the cost of state preparation dwarfs the cost of applying prepare or select. Thus under these assumptions, and taking the uncertainty in the objective function to be ∆ C the Toffoli complexity for the entire simulation is approximately Thus, under these assumptions, direct energy evaluation yields an advantage over sampling if We expect this to occur when the error tolerance is small and the number of terms is relatively modest. On the other hand if the variance is small, target uncertainty is large, or L is large then sampling will be preferable to the direct phase oracle evaluation process.

Amplitude estimation based LCU evaluation
One inexpensive approach that can be used to estimate the expectation value comes from combining the Hadamard test circuit and amplitude estimation [20]. Here we used a slightly generalized form of a generalized Hadamard test circuit shown in Figure 7. The expectation value of the first qubit for the above circuit is 1/2 + Re( ψ| C |ψ )/2. In 7. Generalized form of a Hadamard test that we will use for our QAOA implementation using LCU oracles. Here U ψ is a unitary operation that prepares the ansatz state: U ψ |0 = |ψ .
order to see this, consider the following, Therefore, the probability of measuring 0 in the top-most qubit in Figure 7 is If amplitude estimation is used, the number of invocations of prepare and select needed to estimate this probability within error is O(λ/ ), which is a quadratic improvement over the sampling bound in Eq. (113). Following the same reasoning used to derive Eq. (119) we find that the over all Toffoli count is then, under the assumptions that the Toffoli count is dominated by applications of the prepare, select and phase circuit operations and further that the cost of adding an additional control to select is negligible, given by Here C Sel and C Prep are the Toffoli counts for select and prepare respectively. Equation Eq. (124) shows that the favorable scalings of the sampling approach and the direct phase evaluation methods can be combined together in a single method. However, this advantage comes potentially at the price of a worse prefactor owing to the additional complexity of the prepare and select circuits. In particular, we find that this approach will be preferable to sampling and direct phase estimation, respectively, when In general, we suspect that in fault-tolerant settings this this approach will be preferable to direct phase oracle evaluation because the costs of the prepare and select circuits will often be comparable, or less than, that of U ψ as we will see in the following section where we provide explicit constructions for the prepare and select oracles.
C. Adiabatic quantum optimization

Background on the adiabatic algorithm
The adiabatic algorithm [51] works by initializing a system as an easy-to-prepare ground state of a known Hamiltonian, and then slowly (adiabatically) deforming that system Hamiltonian into the Hamiltonian whose ground state we wish to prepare. For instance, we might use a Hamiltonian parameterized by s ∈ [0, 1], where H 0 is a Hamiltonian with an easy-to-prepare ground state and H 1 is a Hamiltonian whose ground state we wish to prepare. We start the system in the ground state of H(0) = H 0 and then slowly deform the Hamiltonian by increasing s from 0 to 1 until H(1) = H 1 . If this is performed slowly enough, then the system will be in the ground state of H 1 at the end of the evolution. The main challenge with the adiabatic algorithm is that we may need to turn s on extremely slowly in order for the procedure to succeed. The rate at which we can turn on s will depend on features of the spectrum of H(s), including its derivatives and the minimum gap ∆ between the ground state eigenvalue and first excited state eigenvalue. It is often empirically observed that the total time of the evolution T should scale as O(1/∆ 2 ). Indeed, this result has been proven using the so-called boundary adiabatic theorem. This result analyzes the adiabatic algorithm in terms of phase randomization between the different paths that describe quantum dynamics for a slowly varying timedependent Hamiltonian. This randomization causes paths that lead different excitations to destructively interfere, which effects a mapping from the eigenvectors of an initial Hamiltonian to the corresponding eigenvectors of the target Hamiltonian in the limit of slow evolution relative to a relevant gap in the instantaneous eigenvalues of the time-dependent Hamiltonian. The boundary adiabatic theorem holds that if we let |ψ k (s) be the k th instantaneous eigenvector of any Gevrey-class time-dependent H(s) then we have that [52] T where ∆ is the minimum eigenvalue gap between the state |ψ k (s) and the remainder of the spectrum. It then follows if we pick an appropriate value for T ∈ O 1/∆ 2 then we can make the error less than for an arbitrary gapped adiabatic path. Alternatively, if very high precision is required then the time required for adiabatic state preparation can also be improved for analytic Hamiltonians to O(poly( Ḣ , Ḧ , . . .)(1/∆ 2 + log(1/ )/∆)) by adaptively choosing the adiabatic path to have obey ∂ q s H(0) = ∂ q s H(1) = 0 for all positive integers less than Q( ) ∈ O(log(1/ )); however, this approach requires small error tolerance on the order of ∈ O(∆) in order to see the benefits of these improved adiabatic paths [53][54][55].
Note that the boundary adiabatic theorem only tells us about the state at the end of the evolution, and does not actually tell us anything the state we would be in at the middle of the evolution. For that there are "instantaneous" adiabatic theorems which bound the probability of being in the ground state throughout the entire evolution. For instance, one such way to show this is based on the Zeno stabilized adiabatic evolutions described in Section III C 3 [21]. These instantaneous adiabatic theorems have complexity O L 2 /( ∆) , where is the path length. In the case of simulated annealing, one can show that the path length is independent of ∆, whereas in general the worst-case bound is L ≤ Ḣ /∆, which yields O Ḣ 2 /∆ 3 complexity [21]. It is not completely clear which style of adiabatic evolution will give the best results when using the approach as a heuristic, and so we discuss both here. With either approach we typically take H 1 to be the cost function of interest and take H 0 to be a simpleto-implement Hamiltonian that does not commute, with an easy-to-prepare ground state. For instance, a common choice is to take H 0 = N i=1 X i where X i is the Pauli-X operator, so that the initial state is |+ ⊗N . Other H 0 Hamiltonians (or more complicated adiabatic paths) are also possible.
The simplest way to use the adiabatic algorithm as a heuristic is to discretize the evolution using product formulas. For instance, if we assume the adiabatic schedule in Eq. (127) then we could attempt to prepare the ground state as where M is the number of first order Trotter steps used to discretize the adiabatic evolution. The idea of the heuristic is to choose M based on available resources. T will also need to be chosen heuristically rather than based on knowledge of the gap, which we do not expect to have in general. For fixed M , smaller T will enable more precise approximation of the continuous-time algorithm, but smaller T also means the system is less likely to stay adiabatic. Of course, one can also easily extend this strategy to using higher-order product formulas, or to using either different adiabatic interpolations or adiabatic paths. For example, if we define Higher-order versions of such integrators of order can be formed via Suzuki's recursive construction (for any s ∈ [0, 1]): Here γ ρ = (4 − 4 1/(ρ−1) ) −1 which approaches 1/3 as the order of the formula, ρ, goes to infinity. Furthermore we have that the error in the Trotter-Suzuki algorithm scales as as which results in near linear scaling with T in the limit as T approaches infinity. In practice, however, since the number of exponentials in the Trotter-Suzuki formula grows exponentially with the order there is in practice an optimal tradeoff in gate complexity that is satisfied by a finite order formula for a fixed and T . For simplicity, we will assume that the time evolution under H 0 will be much cheaper to implement than the time evolution under H 1 . As H 1 can be implemented using the phase oracles O phase discussed in Section II, the total cost of the procedure will be approximately M C phase . This implies that, for a finite value of M , the cost of performing the heuristic optimization using the above adiabatic sequence is approximately Again, assuming that our target error in the adiabatic sweep is and ∆ 2 ∈ O( ) then it suffices to take T ∈ O(1/∆ 2 ) and further after optimizing the cost by setting M equal to 2 ρ/2−1 we find that M ∈ (T 1+o(1) / o(1) ). Therefore the total cost obeys Similarly, if we are interested in the limit where ∆ 2 ∈ ω( ), then boundary cancellation methods [53,54] can be used to improve the number of gates needed to reach the global optimum to These results show that, provided the eigenvalue gap is polynomial, we can use a simulation routine for e −iH1(t) and e −iH0(t) to find the local optimum in polynomial time. However, in practice we will likely want to use such an algorithm in a heuristic fashion wherein the timesteps do not precisely conform to the adiabatic schedule. To give the cost for a single step a little more precisely, we can also include the cost of implementing a transverse driving field. Since that involves applying a phase to b pha bits to each of N qubits, using repeat-until-success circuits, this has cost 1.15N b pha + O(N ) in terms of T gates, with a single ancilla qubit. It is also possible to sum the bits with Toffoli cost N , then phase by the sum with cost b 2 pha /2 + O(b pha log b pha ) (accounting for multiplying the phase by a constant factor), though that has large ancilla cost. Using the sum of tree sums approach would give complexity 4N + b 2 pha /2 + O(b pha log b pha ), with 3 log L + O(1) temporary ancillas. There will be b grad = b pha + O(log b pha ) persistent ancillas needed for a phase gradient state as well, but in many cases that state will be the same as in other steps of the procedure, so does not increase the ancilla cost. Using this approach, and omitting the factor of 2 × 5 ρ/2−1 for order ρ Suzuki, gives Toffoli cost for a single step.

Heuristic adiabatic optimization using quantum walks
While the procedure we described for heuristically using the adiabatic algorithm with Trotter based methods is well known, it is less clear how one might heuristically use LCU methods with the adiabatic algorithm. One reason we might try doing this is because the qubitized quantum walks that we discuss in Section II are sometimes cheaper to implement than Trotter steps for some problems. One approach to using LCU methods for adiabatic state preparation might be to directly attempt to simulate the time-dependent Hamiltonian evolution using a Dyson series approach, as was recently suggested for the purpose of adiabatic state preparation in [56]. However, this would require fairly complicated circuits due to the many time registers that one must keep to index the time-ordered exponential operator. In principle, we could always use quantum signal processing (or more generally quantum singular value transformations) to convert the walk operator at time t into the form e −iH(t)δ for some timestep δ.
Instead, here we will suggest a strategy which is something of a combination between using qubitized quantum walks and using a product formula approximation. Our method is unlikely to be asymptotically optimal for this purpose but it is simple to implement and we suspect it would be cheaper than either a Dyson series approach or a Trotter approach for some applications on a small error-corrected quantum computer. The idea is to stroboscopically simulate time evolution as a short-time evolved "qubitized" walk. The result will be that we actually simulate the adiabatic path generated by the arccosine of the normalized Hamiltonian H(s) rather than the adiabatic path generated directly by H(s), but we expect that the relevant part of the eigenspectrum will be in the linear part of the arccosine, which means there will be not much effect on the dynamics. The main challenge in this approach will be to artificially shrink the effective duration of these quantum walk steps so that the method can be refined.
In the following we will assume that select 2 = 1 which is to say that every Hamiltonian in the decomposition is self-adjoint (consistent with the problem Hamiltonians we consider). For every eigenvector |ψ k (t) of H(t) with H |ψ k (t) = E k (t), if we define |L = prepare |0 then we can write The walk operator can be seen as a direct sum of two different walk operators, W = W H ⊕ W ⊥ , where W H is the portion of the walk operator that acts non-trivially on |ψ k (t), L = |ψ k (t) ⊗ |L and W ⊥ is the operator that acts on the remaining states. Next, if for each k and t we define |ψ ⊥ k (t) such that then we can express It may be unclear how to implement a time-step for W (t) since the operation is only capable of applying unit-time evolutions. Fortunately, we can address this by taking for any r ≥ 1 In this case we can block encode the Hamiltonian using a unary encoding of the extra two operators via The select oracles for this Hamiltonian require one additional control for each of the original terms in the Hamiltonian and the additional terms only need a single Pauli-Z gate to implement. We will define this operator to be select . With these two oracles defined, we can then describe the walk operator W r (t) for any fixed value of t to be W r (t) = (I − 2I ⊗ |L(t, r) L(t, r)|)select .
This new Hamiltonian has exactly the same eigenvectors, however its value of λ is greater by a factor of r. In particular, we can express the walk operator (restricted to the eigenspace supported by the instantaneous eigenvectors of H(t)) is Using the fact that arccos(x) = π/2 − arcsin(x) we have that, up to an irrelevant global phase this operator can be written as Thus the operator V H,r (t) can be seen to generate a short time-step of duration 1/r for an effective Hamiltonian Note that as r → ∞ the eigenvalues of this Hamiltonian approach ±E k (t)/λ(t) and more generally For any fixed value of r we can choose an adiabatic path between an initial Hamiltonian and a final Hamiltonian. The accuracy of the adiabatic approximation depends strongly on how quickly we traverse this path so it is customary to introduce a dimensionless time s = t/T which allows us to easily change the speed without altering the shape of the adiabatic path. In Appendix B we are able to show that the adiabatic theorem then implies that the number of steps of the quantum walk required to achieve error in an adiabatic state preparation for a maximum rank Hamiltonian with gap ∆ is in The reason why this result depends on the minimum value of E k is an artifact of the fact that several of the eigenvalues of the walk operator can be mapped to 1 under repeated application of W r . This potentially can alter the eigenvalue gaps for eigenvalues near zero which impacts the result.
The key point behind this scaling is that it shows that as the number of time slices increases this heuristic converges to the true adiabatic path. Just as the intuition behind Trotterized adiabatic state preparation hinged on this fact, here this result shows that we can similarly use a programmable sequence of parameterizable walk operators to implement the dynamics. The main advantage relative to Trotter methods is that the price that we have to pay using this technique does not depend strongly on the number of terms in the Hamiltonian which can lead to advantages in cases where the problem or driver Hamiltonians are complex.
This scaling can be improved by using higher-order splitting formulas for the time evolution [57] and by using boundary cancellation methods to improve the scaling of the error in adiabatic state preparation. In general, if we assume that ∆ ∈ O(1) for the problem at hand then it is straightforward to see that we can improve the scaling from O(1/ 3/2 ) to 1/ o(1) [53][54][55]. It is also worth noting that the bounds given above for the scaling with respect to the derivatives of the Hamiltonian and the coefficients of the Hamiltonian is expected to be quite loose owing to the many simplifying bounds used to make the expression easy to use. On the other hand, the scaling with the gap and error tolerance is likely tighter.

Zeno projection of adiabatic path via phase randomization
The principle of adiabatic path traversal with phase randomization was given in [21,58]. This is essentially an alternative strategy for discretizing adiabatic evolutions. Here we summarize that method, and show how to further optimize it. The principle of the Zeno approach is that we increment the parameter for the Hamiltonian s or β by some small amount such that the overlap of the ground state of the new Hamiltonian with that of the previous Hamiltonian is small. One can then perform phase estimation to ensure that the system is still in the ground state.
In the case where the phase estimation verifies that the system is still in the ground state, one continues with incrementing the parameter. If the ground state is not obtained from the phase estimation, one could abort, in which case no output is given and one needs to restart. Because the probability of failure is low, one could just continue regardless, and check at the end. That means that the result of the phase estimation is discarded.
The phase estimation is performed with control qubits controlling the time of the evolution, then an inverse quantum Fourier transform on the control qubits to give the phase. But, if the result of the measurement is ignored, then one can simply ignore the inverse quantum Fourier transform, and regard the control qubits as being measured in the computational basis and the result discarded. That is equivalent to randomly selecting values for these control qubits in the computational basis at the beginning. But, if these qubits take random values in the computational basis, one can instead just classically randomly generate a time, and perform the evolution for that time.
In performing a phase measurement using control qubits, one uses a superposition state on those control qubits, and the error in the phase measurement corresponds to the Fourier transform of those amplitudes. That is, with b control qubits, we have a state of the form where φ is a phase that would correspond to −Eδt, the energy eigenvalue of the Hamiltonian times the shortest evolution time. Then the phase measurement using the quantum inverse Fourier transform corresponds to the POVM |φ φ |, with The probability distribution for the error δφ =φ − φ is then given by These measurements are equivalent to the theory of window functions in spectral analysis. A particularly useful window to choose is the Kaiser window, because it has exponential suppression of errors [59].
In the case where the evolution time is chosen classically, it can be given by a real number, and we do not need any bound on the evolution time. Then the the expected cost is the expectation value of |t| |t| = dt |t|p time (t). (151) Because there is no upper bound on t, we can obtain a probability distribution for the error that drops strictly to zero outside the given interval, rather than being exponentially suppressed. Still considering a coherent superposition for the moment, the state is given by where E is the energy, t is the evolution time, and p time (t) = |χ t | 2 . Then the POVM is |Ê Ê | with The probability distribution for the error in the measurement of E is An alternative description is to describe the system as being in state where |ψ j is an eigenstate of the Hamiltonian with energy E j . Then evolving for time t with probability p time (t) gives the state wherep If the width of the Fourier transform of the probability distribution p time is less than the spectral gap ∆, then the state is In comparison, if Pr(δE) is equal to zero for |δE| ≤ E max , then the same result will be obtained for 2E max = ∆. This is what would be expected, because if p time (t) = χ 2 t , then the Fourier transform of p time is the autocorrelation of the Fourier transform of χ t , and therefore has twice the width.
Next we consider appropriate probability distributions. A probability distribution for t that was suggested in [21] That gives |t| = 12 ln 2/(π∆), so |t| ∆ ≈ 2.648. There Pr(δE) is equivalent to the square of a triangle window, but greater performance can be obtained by using the triangle window Then the corresponding ψ t is obtained from the Fourier transform of Pr(δE) as χ t = sin(∆t/2)C( ∆t/π) − cos(∆t/2)S( ∆t/π) (∆t/2) 3/2 , where C and S are Fresnel integral functions. That gives |t| = 7/(3∆), so |t| ∆ ≈ 2.333. To find the optimal window, we can take for x the difference in energy divided by E max . We use only even orders, so it is symmetric, and the factor of (1 − x 2 ) ensures that it goes to zero at ±1. Then Then the expectation of the absolute value of the time is where We also need, for normalization, where B k = 16 [2(k + ) + 1][2(k + ) + 3][2(k + ) + 5] . (167) Then defining b = B 1/2 a, the normalisation corresponds to b = 1. Then the minimum |t| corresponds to minimizing a T A a/π, which is equivalent to minimizing a T B −1/2 AB −1/2 a/π, so we need to find the minimum eigenvalue of B −1/2 AB −1/2 . That gives |t| E max ≈ 1.1580 with terms up to a 22 (a 46th order polynomial). This explanation is for the case where there is Hamiltonian evolution for a time t which can take any real value. In the case of steps of a quantum walk with eigenvalues e ±i arccos(H/λ) , the number of steps would take an integer value. For the Hamiltonian evolution it could be implemented by steps of a quantum walk as well but it is more efficient to simply use the steps of that quantum walk directly without signal processing. To obtain the corresponding probability distribution for a discrete number of steps, we simply take the probability distribution for t at points separated by 1/λ. That will yield a probability distribution for the error that is the same as for the continuous distribution, except with a periodicity of λ. That periodicity has no effect on the error, because it is beyond the range of possible values for the energy. The reason for this correspondence is that taking the probability distribution at a series of discrete points is like multiplying by a comb function, equivalent to convolving the error distribution with a comb function.

D. Szegedy walk based quantum simulated annealing
In the remainder of Section III we consider quantum simulated annealing, where the goal is to prepare a coherent equivalent of a Gibbs state and cool to a low temperature. More specifically, the coherent Gibbs state is where β is the inverse temperature. For annealing, we have transition probabilities of obtaining y from x denoted Pr(y|x), which must satisfy the detailed balance condition Pr(y|x)π β (x) = Pr(x|y)π β (y).
The detailed balance condition ensures that π β is the equilibrium distribution with these transition probabilities. For the costings in this work we take for y differing from x by a single bit flip, and Pr(x|x) = 1 − y =x Pr(y|x). This choice is similar to that in [23]. Another choice, used in [14], is Pr(y|x) = χ exp (β (E x − E y )) for χ chosen to prevent sums of probabilities greater than 1. If one were to construct a Hamiltonian as x| H β |y = δ x,y − Pr(x|y) Pr(y|x), then the detailed balance condition ensures that the ground state is |ψ β with eigenvalue zero. One can then apply an adiabatic evolution under this Hamiltonian to gradually reduce the temperature (increase β).
In the approach of [13], the method used is to instead construct a quantum walk where the quantum Gibbs state is an eigenstate. One could change the value of β between each step of the quantum walk similarly to the adiabatic algorithm for the Hamiltonian. Alternatively, for each value of β one can apply a measurement of the walk operator to project the state to |ψ β via the quantum Zeno effect. Reference [13] also proposes using a random number of steps of the walk operator to achieve the same effect as the measurement. The advantage of using the quantum walk is that the complexity scales as O(1/ √ δ), where δ is the spectral gap of H β , rather than O(1/δ), which is the best rigorous bound for the scaling of (classical) simulated annealing.
The quantum walk used in [13] is based on a Szegedy walk, which involves a controlled state preparation, a swap between the system and the ancilla, and inversion of the controlled state preparation. Then a reflection on the ancilla is required. The sequence of operations is as shown in Figure 8. The dimension of the ancilla needed is the same as the dimension as the system. The reflection and swap have low cost, so the Toffoli cost is dominated by the cost of the controlled state preparation.
The Szegedy approach builds a quantum walk in a similar way as the LCU approach in Figure 2, where there is a block encoded operation followed by a reflection [28]. That is, preparation of the ancilla in the state |0 , followed by unitary operations U and projection onto |0 on the ancilla would yields the block encoded operator A = 0| U |0 .
Instead of performing a measurement on the ancilla, the reflection about |0 results in a joint operation that has eigenvalues related to the eigenvalues of A as e ±i arccos a , where a is an eigenvalue of A.
Here the controlled state preparation is of the form where the sum is taken over all y that differ from x by at most one bit. As a result, the block-encoded operation is Pr(x|y) Pr(y|x) |y x| .
Thus the block-encoded operation has a matrix representation of the form Pr(x|y) Pr(y|x), which is equivalent to 1 1 − H β . Therefore the quantum Gibbs state |ψ β is an eigenstate of this operation with eigenvalue 1. Combining this operation with the reflection gives a step of a quantum walk with eigenvalues corresponding to the arccosine of the block-encoded operator [60,61]. It is this arccosine that causes a square root improvement in the scaling with the spectral gap. This is because if the block-encoded operation has gap δE from the eigenvalue of 1 for the target state, taking the arccosine yields a gap of approximately √ 2δE for the quantum walk. This gap governs the complexity of the algorithm based on the quantum walk.
In implementing the step of the walk, the state preparation requires calculation of each of the Pr(y|x) for a given x. In turn these require computing the energy difference under a bit flip, and the exponential. The probability Pr(x|x) is computed from the formula Pr(x|x) ≡ 1 − y =x Pr(y|x) required for normalization of the probabilities. To prepare the state one can first prepare a state of the form where x k indicates that bit k of x has been flipped with k = 0 indicating no bit flip, and |k is encoded in one-hot unary. The state |α x can then be prepared by applying cnots between the respective bits of the two registers. In order to prepare the state |ψ x in unary, an obvious method is to perform a sequence of controlled rotations depending on the transition probabilities. However, that tends to be expensive because our method of performing rotations involves multiplications, and high precision is required because the error in each rotation accumulates. A better method can be obtained by noting that the amplitudes for k > 0 are limited. We can then perform the state preparation by the following method.
1. Compute N Pr(x k |x) for all N bit flips, and subtract those values from N to obtain N Pr(x|x). Note that N Pr(x k |x) ≤ 1, and we compute this value to b sm bits. The value of N Pr(x|x) will need log N + b sm bits, but only the leading b sm bits can be regarded as reliable. The complexity of the subtractions is N ( log N + b sm ).

2.
We have N qubits in the target system we need to prepare the state and five ancillas, where K is the target system, A, B, and C are single-qubit ancillas, and Z and ZZ are s-qubit ancillas. Apply Hadamards to the ancillas to give equal superpositions on all except ZZ and B.
3. Controlled on ancilla A, prepare an equal superposition state on log N qubits of K. If N is a power of 2, then it can be performed with log N controlled Hadamards, each of which can be performed with two T gates. It is also possible to prepare an equal superposition for N not a power of 2 with complexity O(log N ). For more details see Section III E 2.
4. We can map the binary to unary in place, with cost no more than N − log N , to give Compute the approximate square of z, denotedz 2 , placing the result in register ZZ, to give The complexity is no greater than s 2 /2, as discussed in Appendix C 6. To obtain b sm bits of precision in the square, we need to take s = b sm + O(log b sm ), giving complexity b 2 sm /2 + O(b sm log b sm ). 6. For each k = 1, . . . , N , perform an inequality test between N Pr(x k |x) and z 2 in the ZZ register, controlled by qubit k in K, placing the result in B. This has cost N b sm Toffolis.
7. Controlled on ancilla A being zero, perform an inequality test between N Pr(x|x) and N z 2 , with the output in B. The inequality test has complexity b sm . In the case where N is not a power of 2, multiplying by N has complexity approximately b 2 sm + O(b sm log b sm ) to obtain b sm bits, and we incur this cost twice, once for computation and once for uncomputation. If N is a power of 2 the multiplication by N has no cost. We obtain the state where Pr indicates an approximation of the probability, with the imprecision primarily due to imprecise squaring of z.
8. Uncompute z 2 in register ZZ with complexity no more than s 2 /2.
9. Use a sequence of CNOTs with the N qubits of K as controls and ancilla A as target. This will reset A to zero.
10. Perform Hadamards on the qubits of K, giving a state of the form where |ψ ⊥ is the component of the state perpendicular to zero states on Z, B, and C.
11. Now conditioned on |0 Z |0 B |0 C , we have the correct state with amplitude approximately 1/2. We simply need to perform one round of amplitude amplification. We reflect about |0 Z |0 B |0 C , invert steps 10 to 2, reflect about zero, then perform steps 2 to 10 again. In the limit of large s we then have the correct state. As well as incurring three times the cost of steps 2 to 10, we have a cost of N + O(b sm ) for the reflection.
The overall Toffoli complexity of this procedure, excluding the computation of Pr(x k |x), is Here is first term is for the subtractions in step 1, the second term N is for the reflection, then the terms inside the square brackets are from steps 2 to 10. In the square brackets N is for the binary to unary conversion, b 2 sm is for computation and inverse computation of z 2 , 2b 2 sm is for multiplication by N (computation and uncomputation), which is only needed for N not a power of two, and (N + 1)b sm is for the N + 1 inequality tests. The cost log N in the order term is for the controlled preparation of an equal superposition state, and b sm log b sm is the order term for the squaring and multiplication.
Note that the preparation will not be performed perfectly, because the initial amplitude is not exactly 1/2. We will use a flag qubit to indicate success, which will control the swap. To see the effect of this procedure, suppose the system is in basis state x. Then the state that is prepared is where the first qubit flags success, µ y is an amplitude for success, ν x is an amplitude for failure, and φ x is some state that is prepared in the case of failure and can depend on x. Here we have ignored the imperfect approximation of Pr(y|x), and are focusing just on the imperfect success probability. Then the swap is only performed in the case of success, which gives Then we can write where we define That is, the effect of the imperfect preparation is that the qubitized step corresponds to a slightly lower probability of transitions, which should have only a minor effect on the optimization.
The cost of the quantum walk in this approach is primarily in computing all transition probabilities N Pr(x k |x). If we were only concerned with the inequality tests for k > 0, then we could incur that cost only once with a simple modification of the above scheme. The problem is that we also need N Pr(x|x), which requires computing all N Pr(x k |x). The steps of computing each N Pr(x k |x) are as follows.
1. Query the energy difference oracle to find the energy difference δE of a proposed transition to b dif bits, 2. Calculate exp(−βδE) to b sm bits using the QROM/interpolation method from Section II E.
The costs for the energy difference oracles were discussed in Section II A, and are as in Table IV. In this table, the costs for the energy difference oracles for the L-term spin model and LABS problem are obtained by evaluating the energy twice. Computing N values of the energy difference would suggest we multiply this cost by N , but we can save computation cost by just calculating the energy for x once, and computing the energy for each of the x k . That means the cost for these problem Hamiltonians can be given as the cost for a single energy evaluation multiplied by N + 1. For QUBO and the SK model it is considerably more efficient to compute the energy difference than the energy, so in these cases we simply compute the energy difference N times. The number of output registers is increased by a factor of N in all cases. For the cases where we compute the starting energy and the N energies under bit flips, we can compute the starting energy first, copy it into the N outputs, and subtract the energy under the bit flip from each of the output registers. In summary, the complexity can be given as the minimum of N + 1 times the cost of the energy oracle, and N times the cost of energy difference oracle.
To perform the state preparation, we need to compute the energy differences, use those to compute the transition probabilities, prepare the state, then uncompute the transition probabilities and energy differences. In each step of the Szegedy walk as shown in Figure 8, we need to do the controlled preparation and inverse preparation, which means that the energy differences and need to be computed four times for each step. That would give a cost of However, we can save a factor of two by taking the controlled preparation and moving it to the end of the step, as shown in Figure 9. The reason why we can save a factor of two is that then, in between the controlled inverse preparation and preparation, there is a reflection on the target, but the control is not changed. That means we can keep the values of the energy differences and transition probabilities computed in the controlled inverse preparation without uncomputing them, then only uncompute them after the controlled preparation.
This approach does not change the effect of a sequence of steps if β is kept constant. However, if β is changed between steps, then the procedure as shown in Figure 8 will be different to that taking the controlled preparation and moving it to the end of the state. That is, the value of β is changed at the swap operation, rather than the reflection.
The qubitized quantum walk operator W using the Szegedy approach. 9. The quantum walk operator using the Szegedy approach, where we have moved the controlled preparation to the end.
We have verified numerically that this does not change the performance. Because there is only a factor of 2 rather than 4, the resulting cost is Now adding twice the complexity of the state preparation from Eq. (182) gives complexity Here we have omitted b sm log b sm in the order term because it is smaller than N for the parameter ranges we are interested in. The term 9b 2 sm includes 3b 2 sm from squaring and 6b 2 sm from multiplication. In the case where N is a power of 2 the cost of 6b 2 sm can be omitted. To evaluate the numbers of ancillas needed, we need to distinguish between the persistent ancillas and temporary ancillas in Table IV. This is because the persistent ancillas need to be multiplied by N , whereas the temporary ancillas are reused, so we only need to take the maximum. Considering the persistent ancillas first, the ancilla costs are as follows.
1. The N qubits for the Szegedy walk for the copy of the system. For the temporary ancillas, we have contributions from the energy difference evaluation, the function evaluation, and the state preparation. Since these operations are not done concurrently, we can take the maximum of the costs. The most significant will be that for the state preparation. In the state preparation we have costs 1. Ancilla ZZ has b sm + O(log b sm ) qubits, and it is temporary because it is uncomputed.
2. If N is not a power of 2 then we need another b sm + O(log b sm ) qubits for an ancilla with N z 2 .
3. We use b sm + O(log b sm ) qubits for squaring, or 2b sm + O(log b sm ) qubits if we are performing the multiplication by N .
As a result, the temporary ancilla cost is 2b sm + O(log b sm ) qubits if N is a power of 2, or 4b sm + O(log b sm ) otherwise.
Considering the worst-case that N is not a power of 2, this temporary ancilla cost is larger than that for the difference function evaluation, giving a total ancilla cost E. LHPST qubitized walk based quantum simulated annealing The same quantum walk approach to quantum simulated annealing can be achieved using an improved form of quantum walk given by Lemieux, Heim, Poulin, Svore, and Troyer (LHPST) [23] that requires only computation of a single transition probability for each step. Here we provide an improved implementation of that quantum walk that can be efficiently achieved for more general types of cost Hamiltonians than considered in [23]. The operations used to achieve the step of the walk areŨ R : Here p x,y = N Pr(y|x) in the notation used above, and we have specialized to an equal superposition over j and only single bit flips. This walk is equivalent to the Szegedy approach of [13] because it yields the same block-encoded operation. That is, 0| V † B † F BV |0 has matrix representation Pr(x|y) Pr(y|x). To show this fact, the sequence of operations gives Just as with the Szegedy approach, most operations are trivial to perform, and the key difficulty is in the operation B which depends on the transition probability. However, B only depends on one transition probability, whereas the Szegedy approach requires computing all the transition probabilities for a state preparation. Lemieux et al. [23] propose a method for the B operation that is not useful for the cost Hamiltonians considered here, but is useful for Hamiltonians with low connectivity. Instead of computing the energy difference then the exponential, they consider an approach where the required angle of rotation is found from a database.
That is, one considers the qubits that the transition probability for the move (here a bit flip) depends on, and classically precomputes the rotation angle for each basis state on those qubits. For each value of j, one sequentially performs a multiply-controlled Toffoli for each computational basis state for these qubits, and performs the required rotation on the ancilla qubit C. The complexity that is given by [23] is O(2 |Nj | |N j | log(1/ )), where |N j | is the number of qubits that the transition probability for move j depends on. That complexity is a slight overestimate, because each multiply controlled Toffoli has a cost of |N j |, then the cost of the rotation synthesis is O(log(1/ )). It should also be noted that this is the cost for each value of j, and there are N values of j, giving an overall cost O(N 2 |Nj | [|N j | + log(1/ )]).
To improve the complexity, one can divide this procedure into two parts, where first a QROM is used to output the desired rotation in an ancilla, and then those qubits are used to control a rotation. Using the QROM procedure of [26] to output the required rotation, the cost in terms of Toffoli gates is O(N 2 |Nj | ). Then one can apply rotations using the phase gradient state, which was discussed above in Section II C. Addition of the register containing the rotation to an ancilla with state |φ from Eq. (34) results in a phase rotation. To rotate the qubit, simply make the addition controlled by this qubit, and use Clifford gates before and after so that the rotation is in the y-direction. The cost of this rotation is O(log(1/ )) Toffolis; for more details see Appendix A. With that improvement the complexity is reduced to O(N 2 |Nj | + log(1/ )).
Even with that improvement, any procedure of that type is exponential in the number of qubits that the energy difference depends on, |N j |. That is acceptable for the types of Hamiltonians considered in [23], but here we consider Hamiltonians typical of real world problems where the energy difference will depend on most of the system qubits, because the Hamiltonians have high connectivity. We thus propose alternative procedures to achieve the rotation B.

Rotation B
We propose a completely different method to perform the rotation B than that of LHPST [23]. We can first compute the energy difference E x − E xj , then the rotation arcsin √ p x,xj with the result put in an ancilla register. The rotation of the qubit ancilla C is controlled on the value of this ancilla as explained above, then the value of arcsin √ p x,xj is uncomputed. There are many possible approaches to the computation of arcsin √ p x,xj , for example that of [62]. For the purposes of quantum optimization, we expect that we will not need to compute this function to high precision as long as the function we compute is still monotonic in the actual energy, so there is the opportunity to use methods that are very efficient for low precision but would not be suitable for high precision. We propose a method based on using a piecewise linear approximation, with the coefficients output by a QROM, as described in Section II E. One could then apply the controlled rotation with cost b sm Toffolis using the phase gradient state in Eq. (34), as described in detail in Appendix A. Then after uncomputing the rotation angle we would have implemented B. That approach would then mean that a single step of the walk has four times the cost of computing arcsin √ p x,xj , because it needs to be computed and uncomputed for B, and the operation B is applied twice in each step. It is possible to halve that cost by only computing and uncomputing once in a step, and retaining the value of arcsin √ p x,xj during the F operation. Because F is a controlled flip of bit j of x, it would reverse the role of x and x j , and the sign of E x − E xj would be flipped. In more detail, the procedure is as follows.
1. Compute the energy difference between x and x j , E x − E xj .

Compute arcsin
3. If E xj < E x then perform an X operation on the qubit C. That can be achieved with a cnot (Clifford) controlled by the sign bit of E x − E xj .
4. The remaining rotations for the case of E xj > E x need to be controlled on −1 for the sign bit.
5. When we apply F , as well as applying the Toffolis to change x to x j , we need to flip the sign bit on E x − E xj controlled on qubit C. This is another cnot, with no non-Clifford cost.
6. Then at the end we uncompute arcsin √ p x,xj and E x − E xj .
This procedure assumes that E x − E xj is represented as a signed integer. The computation of E x − E xj uses two's complement, so there will be an additional cost of b dif to switch to a signed integer. Because there is only a factor of two instead of four, the overall cost will then be 2C diff + 2C fun + 2b dif + O(1). Next we consider the other (simpler) operations used in the step of the quantum walk.

Equal superposition V
The operation V generates the equal superposition starting from a zero state In the case where N is a power of 2, then we can create the equal superposition over binary by using Hadamards (and no Toffolis). More generally, if we wish to create an equal superposition where the number of items is not a power of 2, we can rotate an ancilla qubit such that the net amplitude is 1/2 for |1 |1 on the result of the inequality test and the ancilla qubit. We can then perform a single step of amplitude amplification to obtain the superposition state. Our procedure is explained below and gives a complexity of 4 log N + O(1) Toffolis. Our method for V is also very different that of LHPST [23]. There they proposed encoding the M register in unary, whereas here we use binary which greatly reduces the ancilla cost (which is sublinear in N ). Moreover, LHPST did not consider using equal superpositions in cases where N is not a power of 2, and instead just allowed for a constant factor overhead in the complexity.
Our procedure to create an equal superposition over N < 2 k items is as follows. With Hadamards we prepare Then we have an inequality test between j and N to give This is an inequality test on k bits, and since it is an inequality test with a constant we save a Toffoli gate. The cost is therefore k − 2 Toffolis as per the explanation in [39]. We would have an amplitude of N/2 k for success, and would aim to multiply it by another amplitude of approximately 1 2 2 k /N so the amplitude is 1/2 and we can use a single step of amplitude amplification. For an amplitude of 1 2 2 k /N , we can rotate another register according to the procedure in Appendix A to give We can then perform a single step of amplitude amplification for |1 on both this qubit and the result of the inequality test.
The steps needed in the amplitude amplification and their costs are as follows. If we use the procedure for the rotation with s bits, it would take s − 3 Toffolis because the angle of rotation is given classically.

2.
A first reflection on the rotated qubit and the result of the inequality test. This just needs a controlled phase (a Clifford gate).
3. Inverting the rotation and inequality test has cost k + s − 5.
4. Hadamards then reflection of the k qubits and the single qubit ancilla about zero (k − 1 Toffolis).

Applying the inequality test (k − 2 Toffolis).
The total cost is 4k + 2s − 13 Toffolis. Conditioned on success of the inequality test, the state is The probability for success is then given by the normalization It is found that highly accurate results are obtained for s = 7, as shown in Fig. 10. This procedure enables construction of equal superposition states flagged by an ancilla qubit for N not a power of 2. If we take s = 7, then the cost is 4k + 1.

Controlled bit flip F
We also need to modify the operation F compared to that in LHPST to account for the M register being encoded in binary. This operation flips bit j on x for the control qubit C being in the state |1 , This operation can be achieved using the iteration procedure of [26] with Toffoli complexity N , which allows us to perform the operation with register M encoded in binary. A complication is that, in the case where N is not a power of 2, there is a nonzero cost of the state preparation in V failing. We should only perform the operation F in the case where we have success of the state preparation. We include another two Toffolis to create and erase a register that gives a control qubit that flags whether the C register is in the state |1 and there is success of the state preparation. Because the other operations are inverted, in the case that the state preparation does not work the net operation performed is the identity.
To be more specific, V prepares a state of the form with the first register flagging success. Since we only perform F for the flag qubit in the state |1 , we obtain To determine the block encoded operation we note that where 1 1 indicates the identity on the target system. The first term is the desired operation we would obtain if the equal superposition state was obtained exactly (with a multiplying factor corresponding to the probability of success), and the second term is proportional to the identity. This small offset by the identity just gives a trivial shift to the eigenvalues.

Reflection R
This operation applies a phase flip for zero on the ancillas as As well as the ancillas M and C, this reflection also needs to be on any ancilla qubits used to encode the ancilla for the preparation of the equal superposition state, and the flag qubit. There are log N qubits used to encode j, one qubit for C, and one ancilla used for the rotation, for a total of log N + 2 qubits. Therefore the number of Toffolis needed for reflection about zero is log N .

Total costs
The total Toffoli costs of implementingŨ W = RV † B † F BV are as follows.
1. The cost of V and V † is 8 log N + O(1).
2. The cost of F is N Toffolis.
3. The cost of R is log N .

The cost of two applications of
The total cost of a step is then Note that 8 log N of this cost is for preparing equal superposition states, and can be omitted if N is a power of 2.
The ancilla qubits needed are as follows.
1. The ancilla registers M and C need log N + 1 qubits.
2. The resource state used to implement the controlled rotations needs b sm qubits.
3. The ancilla requirements of the energy difference and function evaluation oracles.
For the temporary ancilla cost, we need to take the maximum of that for the energy difference and function evaluation, giving the total ancilla cost of F. Spectral gap amplification based quantum simulated annealing An alternative, and potentially simpler, approach to preparing a low-temperature thermal state is given by [14]. The idea behind this approach is to construct a Hamiltonian whose ground state is a purification of the Gibbs state. Similarly to the case with the quantum walk, one can start with an equal superposition state corresponding to infinite temperature, and simulate the Hamiltonian evolution starting from β = 0 and gradually increase β. This approach can correspond to using an adibatic approach on this Hamiltonian, or one can also apply a quantum Zeno approach by phase measurements on the Hamiltonian evolution, or apply Hamiltonian evolutions for randomly chosen times.
A simple choice of Hamiltonian is similar to the block-encoded operation for the quantum walks, so has a small spectral gap. In order to obtain a speedup, one needs to construct a new Hamiltonian with the square root of the spectral gap of the original Hamiltonian, thus yielding the same speedup as the quantum walks. That procedure, from [14], is called spectral gap amplification. Simulating the time-dependent Hamiltonian, for example using a Dyson series, has significant complexity.
To avoid that complexity, here we suggest that one instead construct a step of a quantum walk using a linear combination of unitaries. Such a quantum walk could be used to simulate the Hamiltonian evolution, but as discussed in [60,61] one can instead just perform steps of the quantum walk which has eigenvalues that are the exponential of the arccosine of those for the Hamiltonian. By applying the steps of the quantum walk we can obtain the advantage of the spectral gap amplification, without the difficulty of needing to simulate a time-dependent Hamiltonian. Unlike

Implementing the Hamiltonian
In order to implement the Hamiltonian, we will use a linear combination of unitaries. We can rewrite the square root of the Hamiltonian as This is a 2-sparse Hamiltonian, then summing over k to obtain A β gives a 2N -sparse Hamiltonian. To express A β as a linear combination of unitaries, we can express H β,k as where and we are taking the inequality test to yield a numerical value of 0 for false and 1 for true. Note that with these definitions, q xk and r xk can take values in the range [0, 1]. We use the procedure from [63] (Lemma 4.3) to obtain a linear combination of unitaries. The operator is then approximated as a sum The operator A β is then approximated by For the part ∆ β (1 1 − |0 0|), we can write it as where Therefore the complete approximation of the Hamiltonian with spectral gap amplification is Here we have grouped the terms such that the operations in square brackets are unitaries. Summing the coefficients in the sums gives a λ-value of To implement the operator by a linear combination of unitaries, we need two single qubit ancillas, a register with z and a register with k. The prepare operation is trivial, and just needs to prepare the state The roles of these registers are as follows.
1. The register with k selects terms in the sum over k in Eq. (230).
2. The register with z selects terms in the sum over z in Eq. (230).
3. The F register selects between the terms in square brackets in the first and second lines of Eq. (230).
There are registers containing k for both this prepared control state and the target state. We will call these the control and target K registers. In the prepare operation, creating the superposition over z can be trivially achieved with Hadamards. The superposition over N can be achieved similarly if N is a power of 2, but otherwise the procedure outlined in Section III E 2 can be used with cost 4 log N + O(1). The rotation on qubit F can be achieved with precision using 1.15b rot + O(1) T operations, where b rot = log(1/ ). The select procedure for the linear combinations of unitaries may be performed as follows.
1. Perform a test of whether the target system K-register is in the space {|0 , |k }, placing the result in an ancilla qubit, call this qubit E.

2.
Controlled on E being |1 and F , compute q xk or r xk .
3. Controlled on E being |0 , place the value δ β into the output register also used for q xk or r xk .
5. Apply a Z gate to the output of the inequality test.
6. Controlled on the E register being |1 and the register F being |1 , apply X to qubit k of the target system. 7. Apply a not between |0 and |k for the target system. That gives |k 0| + |0 k|.
8. Invert the inequality test from step 4.
12. Apply a Z gate to F to introduce the −1 sign.
Here we call the register that would carry |k for the target system the K-register. The cost of these steps may be quantified as follows, ignoring O(1) costs.
Steps 1 and 11. We need an equality test between the K-register for the ancilla and the K-register for the system, with cost log N + O (1). We also test if the system has 0 in its K-register, with cost log N + O(1), and perform an OR on the results of the two comparisons with cost 1. Since the comparisons needs to be computed and uncomputed, there is cost 4 log N + O(1) for the two steps.
Steps 2 and 10. Computing q xk and r xk may be performed by first computing the energy difference, then using a QROM to output coefficients for linear interpolation. The cost estimation is as given in Section II E, and we pay the QROM lookup cost twice for q xk and r xk , but we pay the multiplication cost only once. Since that is the dominant cost, the cost may be regarded as that of a single function oracle. The computation and uncomputation in the two steps means we pay twice the cost of the energy difference and function oracles. Note that q xk and r xk are unchanged under the bit flip in step 6 (since there is no bit flip for q xk and r xk is symmetric under the bit flip). There is O(1) cost to making the computation controlled on the ancilla in E.
Steps 3 and 9. Outputting δ β controlled on a single ancilla may be performed with CNOTs (no Toffoli cost) because δ β is classically computed.
Steps 4 and 8. The inequality test is simply performed in the form z < 2 s−1 (1 + q xk ) and similarly for r. There are no multiplications involved, because q xk and r xk are output as integer approximations. The inequality test has cost s Toffolis, so computation and uncomputation for the two steps has cost 2s.
Step 5. This is just a Z gate with no Toffoli cost.
Step 6. The cost is two Toffolis to prepare a control qubit that flags whether the conditions required are satisfied. Then this qubit is used as a control register for the QROM on the value of k to apply a X operation to the target system. That QROM has complexity N .
Step 7. Controlled on the system K-register being equal to k, we subtract k from it, and controlled on the system K-register being 0 we add k to it. We then swap the registers with the results of these two equality tests. Since we still have the qubits with the results of the equality tests from step 1, we have no additional cost for that here. The cost of the two additions is 2 log N + O(1).
The Toffoli cost of the steps is therefore 2s + N + 6 log N + O(1), plus two times the cost of the function evaluation and energy difference oracles. Note that we pay four times the cost of the QROM lookup within the function evaluation oracle, but we are regarding the cost as two function oracles because the QROM lookup cost is a smaller cost given in an order term. The cost of the preparation and inverse preparation is 8 log N + O(1) Toffolis and 2.3b rot + O(1) T gates, or just 2.3b rot + O(1) T gates if N is a power of 2. Taking s = b sm + O(1), that gives total cost where we have put the T cost in the order term. The ancilla cost is as follows.
1. Two qubits for the E and F ancillae.
2. Two qubits from the results of the two equality tests for the system K-register.
3. The register with k for the control ancilla and that with k for the system each need log N qubits.
4. The register with z for the control ancilla needs s qubits.
5. The ancillas for the energy difference oracle.
6. The ancillas for the function evaluation oracle.
The number of qubits s used for z can be taken to be within O(1) of the number of qubits c used for q xk or r xk . We need temporary qubits for working, but the same working qubits as for the oracles can be used, so we will not count these ancilla costs again. The function evaluation oracle may use more or less temporary ancilla than the energy difference, so we need to take the maximum of these two costs. That gives an ancilla cost of 2 log N + b sm + O(1) plus the ancilla costs of the two oracles, or

IV. ERROR-CORRECTION ANALYSIS AND DISCUSSION
Previous sections of this paper have discussed and optimized the compilation of various heuristic approaches to quantum optimization into cost models appropriate for quantum error-correction. Specifically, we focused on reducing  [35]. Surface code overheads in parenthesis assume a physical error rate of 10 −4 whereas the overheads not in parenthesis assume a physical error rate of 10 −3 . The target success probability is 0.9. These estimates are based on Table VII where we somewhat arbitrarily choose to set all values of the parameter quantifying the number of bits of precision (b) that appear in the table to 20 except for b fun and bsm which can be smaller so we take b fun = bsm = 7.
the Toffoli (and in some cases T) complexity of these algorithms while also keeping the number of ancilla qubits reasonable. This cost model is motivated by our desire to assess the viability of these heuristics within the surface code (the most practical error-correcting code suitable for a 2D array of physical qubits) [19,[64][65][66]. T gates and Toffoli gates cannot be implemented transversely within practical implementations of the surface code. Instead, one must implement these gates by first distilling resource states. In particular, to implement a T gate one requires a T state (|T = T |+ ) and to implement a Toffoli gate one requires a CCZ state (|CCZ = CCZ |+ + + ); in both cases these states are consumed during the implementation of the associated gates. Distilling T or CCZ states requires a substantial amount of both time and hardware.
Here, we will analyze the cost to implement our various heuristic optimization primitives using the constructions of [35] which are based on applying the lattice surgery constructions of [67] to the fault-tolerant Toffoli protocols of [68,69]. We will further assume a correlated-error minimum weight perfect matching decoder capable of keeping pace with 1 µs rounds of surface code error detection [70], and capable of performing feedforward in about 15 µs. We will assume that our physical hardware gates have error rates of either 10 −3 or 10 −4 , the former consistent with the best error rates demonstrated in state-of-the-art intermediate scale superconducting qubit platforms [1] and the latter consistent with improvements in the technology that we hope would be feasible in the next decade. Under these assumptions the spacetime volume required to implement one Toffoli gate or two T gates with two levels of state distillation and code distance d = 31 (which is safely sufficient for the computations we analyze here) is equal to roughly 26 qubitseconds [35]. For instance, to distill one CCZ state using the approach in [35] requires 5.5d + O(1) cycles using a factory with a data qubit footprint of about 12d × 6d (the total qubit count includes measurement qubits, and so is roughly double this figure). Specifically, in our estimates we will assume that executing a Toffoli gate requires about 170 microseconds and 150,000 physical qubits (see the resource estimation spreadsheet included in the supplementary materials of [35] for more detailed assumptions). Due to this large overhead we will focus on estimates assuming that we distill CCZ states in series, which is likely how we would operate the first generation of fault-tolerant surface code computers.
In Table VIII and Table IX we estimate the resources that would be required to implement various heuristic optimization primitives within the surface code (given the assumptions of the the prior paragraphs) for the Sherrington-Kirkpatrick and Low Autocorrelation Binary Sequences problems, respectively. We perform this analysis for the primitives of amplitude amplification, a first order Trotter step (which can be used for QAOA, population transfer, the adiabatic algorithm, etc.), a qubitized Hamiltonian walk realized from the linear combinations of unitaries query model (which can be used for measuring energies in QAOA, performing population transfer, the adiabatic algorithm, etc.), the qubitized quantum walk approach to quantum simulated annealing ("LHPST walk") and the spectral gap amplified approach to quantum simulated annealing. The only primitive discussed in this paper omitted from these tables is the Szegedy walk approach to quantum simulated annealing. This is because we can see from Table VII that the Szegedy walk approach is strictly less efficient than the qubitized variant, and would require so many ancilla that analyzing it under the assumption of serial state distillation seems unreasonable. Because we do not know how many times one would need to repeat these primitives to solve the various optimization problems, in Table VIII and Table IX we report how many times one would be able to implement these primitives for various system sizes, assuming maximum run times of one hour or one day (24 hours). We also report how many physical qubits would be required to realize these computations assuming physical gate error rates of 10 −3 or (10 −4 ).
We focus on the SK and LABS cost functions primarily for concreteness. As seen in Table IV and Table VII, the choice to focus on these specific problems rather than QUBO or the H L model means that we do not need to choose a precision parameter in some cases. For example, with amplitude amplification we know how many bits of precision we should compute the energy to since SK and LABS both have integer valued energies in a well defined range. However, in order to produce specific numerical estimates for other primitives it is necessary to assume values for the precision parameters b appearing Table IV (defined in Table III); e.g., for the Trotter steps one must realize time evolutions of non-integer duration so that the phase is accurate to within some precision b pha which we must choose independently of the particular problem. Thus, in order to produce actual numerical estimates, in Table VIII  and Table IX we choose to set many variants of the free precision parameter b to 20; thus, b = 20 bits of precision. However, as discussed in previous sections, the parameters b fun and b sm can generally be chosen to be smaller than the other values of b without compromising precision; here we take b fun = b sm = 7.
It is tempting to directly compare the costs of the various primitives shown in Table VIII and Table IX. While comparisons of the same primitives between the two problem types are straightforward (e.g., SK is more efficient than LABS in most, but not all, cases), comparisons between the different primitive types are challenging. Quantum simulated annealing, amplitude amplification, QAOA, population transfer, and the adiabatic algorithm are simply different algorithms so it is difficult to compare the relative values of a step of these algorithms.
It seems more reasonable to compare the Trotter steps to the qubitized Hamiltonian walk steps since these primitives can be used for the same ends (e.g., population transfer or the adiabatic algorithm). But first, the choice of b = 20 means something different for these two algorithms. And second, while the Hamiltonian walks are capable of more precise evolutions (scaling as O(log 1/ ) in terms of precision compared to the O(poly(1/ )) scaling of fixed order Trotter based methods), for heuristic optimization the evolution does not necessarily need to be precise, so the Trotter approach may be more efficient by using large steps. The Trotter steps can be made arbitrarily large without increasing gate count (although at a cost of less precision), whereas the Hamiltonian walk effectively simulates time of at most 1/λ where λ SK ≈ N 2 /2 and λ LABS ≈ N 3 /3 (but it does so quite precisely). Thus, although the Hamiltonian walk steps require the fewest Toffolis in Table IX, they may still be less efficient than other approaches.
For the various forms of quantum simulated annealing, the number of steps needed is governed by the spectral gap. The qubitized annealing (LHPST) and Szegedy approaches are directly comparable because they have the same gap, which means the same number of steps should be sufficient. This means that the smaller step cost of LHPST means that it is more efficient. The spectral gap amplified approach has a similar gap as the LHPST and Szegedy approaches, because it provides a similar square-root improvement. The problem is that the Hamiltonian has a λ-value proportional to √ N , as shown in Eq. (231). This increases the cost of implementing the Hamiltonian by a factor of √ N , so the cost given for a single step should be multiplied by √ N for a fair comparison with the other simulated annealing approaches. When that is taken into account, the spectral gap amplified approach is less efficient.
With these caveats and context, we believe that Table VIII and Table IX give a rough sense for the feasibility of implementing these various heuristic optimization primitives on a small fault-tolerant surface code quantum processor. In most cases one can attempt these algorithms up to roughly a thousand bits with around a million physical qubits or less (especially given 10 −4 error rates). However, we can see that the significant overheads of state distillation make one hour runtime one day runtime algorithm applied to problem logical Toffolis maximum physical maximum physical LABS problem size, N qubits per step steps qubits steps qubits TABLE IX. Estimates of resources required to implement steps of various heuristic algorithms for the Low Autocorrelation Binary Sequence (LABS) problem within the surface code. All overheads are reported assuming a single Toffoli factory using state distillation constructions from [35]. Surface code overheads in parenthesis assume a physical error rate of 10 −4 whereas the overheads not in parenthesis assume a physical error rate of 10 −3 . Target success probability is 0.9. These estimates are based on Table VII where we somewhat arbitrarily choose to set all values of the parameter quantifying the number of bits of precision (b) that appear in the table to 20 except for b fun and bsm which can be smaller, so we take b fun = bsm = 7.
the execution of these algorithms painfully slow. The quantum simulated annealing steps are often more efficient to implement than most other steps. The one exception is the Hamiltonian walk steps, which are highly efficient. But again, there it is likely that the large value of λ means that many more Hamiltonian walk steps would be required. We see that for SK model problem sizes between N = 64 and N = 1024 one can perform between about 4×10 3 and 3×10 4 quantum simulated annealing updates per hour. As a comparison, the work of [71] discusses the implementation of a very performant classical simulated annealing code for optimizing sparse spin glasses. This same code deployed to an N = 512 spin instance of SK is capable of performing a simulated annealing update step in an average of 7 CPU-nanoseconds [72] (this average accounts for the fact that most updates for the Sherrington-Kirkpatrick model are rejected). This works out to about 6×10 11 attempted updates per core-hour, or about one-hundred million times more steps than the quantum computer can implement in that same amount of time for an N = 512 spin SK model. The state produced after the 2×10 5 quantum simulated annealing steps that our quantum computer can make in one day for the N = 512 spin SK model could be produced by a single classical core in about four CPU-minutes, assuming that the classical algorithm would require exactly quadratically more (4×10 10 ) steps. The comparison is even less favorable for quantum computing if we consider larger problem sizes. Furthermore, given the high costs of quantum computing, it is unclear why we should restrict the classical competition to one core rather than to millions of cores.
The quantum computer must give a speedup for a sufficiently difficult problem if we assume a quadratic speedup in the number of annealing steps required. For the N = 512 spin SK model, by comparing the number of steps that the classical algorithm from [71] can make in one hour (5×10 11 ) to the number of steps that the quantum algorithm can make in one hour (8×10 3 ), we can estimate a crossover point. In particular, solving M/(8×10 3 ) = M 2 /(5×10 11 ) yields M ≈ 7×10 7 as the minimum number of steps that would be required for the quantum algorithm to give an advantage. Unfortunately, this would mean the quantum computer would need to run for a number of hours that is 7 × 10 7 /(8 × 10 3 ), which works out to about one year. Moreover, this analysis is very favorable to the quantum computer in that (1) it does not adjust the surface code distance (and thus, resource overheads) for runtimes longer than an hour, (2) it compares to a single classical core and (3) it assumes that N = 512 is a large enough instance to warrant this many steps in some cases. Of course, most N = 512 instances of the SK model can be solved with much less than a CPU year of simulated annealing run time, thus precluding the possibility of a quantum speedup for most instances at that size under the assumptions of our analysis.
Comparisons for amplitude amplification are similarly discouraging. For these two problems one can perform between about ten and three thousand steps of amplitude amplification using between about one-hundred thousand and one-million qubithours of state distillation. In the same amount of time one could conservatively check hundreds of billions of solutions on even a single core of a classical computer. Assuming the quantum computer would require quadratically fewer steps of amplitude amplification (still at least a hundred thousand steps) compared to random classical energy queries, we would still need roughly billions of qubithours of state distillation in order to compete with what a single core of a classical computer can do in one hour. Once again, if we instead make our comparisons to a classical supercomputing cluster rather than to a single classical core, the overheads appear even more daunting.
The LABS problem is an example where the scaling of the best known classical algorithm is worse than O(2 N/2 ) and thus, an approach based on amplitude amplification would have better scaling. In particular, the best scaling method in the literature goes as Θ(1.73 N ) [31]. That scaling is obtained for a branch-and-bound type method that queries the effect of local spin flips (and thus, not the entire objective function). Each of these queries is slightly faster than requiring 7 CPU-microseconds with an optimized classical implementation for N = 64 (about 5×10 8 steps per hour). If we were to compete with this approach using amplitude amplification on a quantum computer (where we can perform about 2×10 3 steps per hour at N = 64) then we can approximate the crossover point as 2 M/2 /(2×10 3 ) = 1.73 M /(5×10 8 ) so long as we remember that these numbers are only valid in the vicinity of M ≈ N = 64. Coincidentally, that is the case as we find that M = 62, which corresponds to about 2 × 10 9 queries, which would take about 116 years. Once again, here we are being generous to the quantum computer by making comparisons to a single core and not adjusting the code distance for long runtimes. Still, we again see that a small error-corrected quantum computer cannot compete with classical methods under such a modest scaling advantage.
The heuristics based on Trotter steps or qubitized walk LCU queries are more difficult to compare to classical competition since algorithms such as QAOA, the adiabatic algorithm, or population transfer lack a clear classical analog. In that sense, it is not straightforward to predict what being able to perform a few hundred Trotter steps or a few thousand qubitized walk steps in an hour might buy us, but it is clear that these would be able to perform only very short quantum walks or time evolutions, or very inaccurate time evolutions. Eventually, it will at least be possible to find out by using our constructions to realize these algorithms on a small fault-tolerant quantum computer and experimentally discovering what happens. We note that for these algorithms the number of steps should be interpreted as the product of the number of repetitions of the primitive and the total number of times the algorithm is repeated. For instance, we see that for either the SK model or LABS at N = 256, slightly more than 100 Trotter steps can be implemented in an hour. In the context of QAOA, this could mean that we run QAOA at p = 100 and draw one sample, or we run QAOA at p = 10 and draw ten samples or we run QAOA at p = 1 and draw one-hundred samples, etc. However, as we have explained in Section III A and Section III B one is probably better off using coherent repetitions in the context of an amplitude-amplification like scheme rather than making classical repetitions.
Although we have tried to optimize the realization of these heuristic primitives for the cost functions considered in this paper, clever improvements to our approaches might further reduce the resources required. However, we would expect the complexity of these primitives to be no better than N . In particular, LCU-based methods require a minimum of N − 1 Toffolis just to access N qubits in a controlled way. For Trotter step methods, evolution under the problem Hamiltonian could be below N for a particularly simple problem Hamiltonian, but then the evolution under the transverse field driver Hamiltonian would be the dominant cost and require O(N ) non-Clifford gates. For amplitude amplification, one could again have a small cost for a particularly simple problem Hamiltonian, but amplitude amplification requires a reflection on at least N qubits, with cost at least N − 2 Toffolis.
We are already at about 5N for the LHPST walk with SK, so we would not expect more than about a factor of 5 improvement even for the easiest problem. If we were to use the sum of bits directly as in [25], then the complexity would be about 2N , but another N ancilla qubits would be needed. One could also propose to use a larger faulttolerant quantum computer and distill more Toffoli states in parallel. But even if this strategy is pursued to the fullest extent possible (requiring tens or hundreds of millions of physical qubits) and parallelized near-optimally, the surface code will then be bottlenecked by Clifford gates (or the overhead of routing) which are, at best, only about a hundred to a thousand times faster to implement.
In conclusion, we have optimized and compiled the basic primitives required for many popular heuristic algorithms for quantum optimization to a cost model appropriate for practical quantum error-correction schemes. This allowed us to assess and compare the cost of several quantum algorithms that have not previously been compiled in such detail. We focused on doing this for only a subset of the possible cost function structures that one might hope to algorithmically exploit for more efficient implementations, but our constructions led to the development of various methodologies which we expect will be useful in a more general context. For instance, we expect that work outside the context of quantum optimization might benefit from our strategy of interpolating arithmetic functions using an adaptive QROM. However, despite our attempts at optimization, the concrete resource estimates from Table VIII  and Table IX are predictably discouraging. The essential reason for this is the substantial constant factor slowdown between error-corrected quantum computation and classical computation. Based on these numbers we strongly suspect that in order for early fault-tolerant quantum computers to have a meaningful impact on combinatorial optimization, we will either need quantum optimization algorithms that afford speedups which are much better than quadratic, or we will need significant improvements in the way that we realize error-correction.
given classically, this technique is slightly less efficient than rotation synthesis with T gates as in [34], because Toffolis have a cost equivalent to two T gates in magic state distillation [35]. On the other hand, rotation angles that are integer multiples of 2π/2 b grad can be performed exactly, up to the accuracy of synthesizing the resource state |φ .
To obtain a rotation that performs the mapping one can simply perform the operations SHe −2πi Z/2 b grad H. Here the Hadamard H and S gates are Clifford gates, so this gives the state preparation with the only Toffoli cost in synthesizing the Z-rotation.
The complexity of performing the addition is only b grad − 2 rather than b grad − 1, as would normally be the case for addition of b grad -bit numbers (modulo 2 b grad ). The reason is that the most significant qubit of |φ is in a |+ state, so not gates on this qubit can be replaced with phase gates, and this qubit can be discarded. Doing that yields the circuit shown in Figure 11. The Toffoli is not immediately saved, but the cnots and Z gate on the final carry qubit can be replaced with two Z gates as shown in Figure 12. Then the Toffoli used on the final carry qubit can simply be replaced with a controlled phase, as shown in Figure 13. The resulting complexity is b grad − 2 Toffolis. If the angle to be rotated by is given as a classical variable, then the cost is further reduced to b grad − 3 Toffolis, because addition of a classical number takes one fewer Toffoli. This means that b grad = 4, which would give a T gate, takes one Toffoli.
Next we consider the case that we need to multiply an integer k with b bits by a constantγ to give the phase. Given thatγ is represented on n bits, we can writeγ as a sum of no more than (n + 1)/2 powers of two, with positive and negative signs. This formula is checked in Figure 14. To prove the formula, assume that it is true for numbers with ≤ n 0 bits, and consider a number m with n = n 0 + 2 bits (so the most significant bit must be a 1). There are then three cases to consider.
1. For m < (3/4)2 n , we find that m − 2 n−1 < 2 n−2 , so m − 2 n−1 has no more than n − 2 = n 0 bits, and so can be written as a sum of at most (n 0 + 1)/2 powers of two. That means m can be written as a sum of at most (n 0 + 1)/2 + 1 = (n + 1)/2 powers of two.
Since we have checked that the formula is true for small numbers of bits in Figure 14, the formula is true for all n by induction. To perform the multiplication, we will take each term in the sum forγ, and add or subtract a bit-shifted copy of k to the phase gradient state. We have no more than (n + 2)/2 additions/subtractions, each of which is into the phase gradient state with b grad bits, which gives cost no more than (b grad − 2)(n + 2)/2.
The error due to omitted bits in the multiplication (those omitted in bit-shifting k) can be bounded as follows. First, note that the error for the additions is entirely in underestimating the product, since we are omitting digits. For the subtractions the error is in overestimating the product. Therefore, to obtain the maximum error we need to consider the case with entirely additions, since the subtractions would cancel the error. For each addition the error is upper bounded by 2π/2 b grad , because we omit adding bits that would correspond to phase shifts of 2π/2 b grad +1 , 2π/2 b grad +2 , and so forth. That means the upper bound on the error from (n + 2)/2 additions is (n + 2)π/2 b grad . To make the error in the multiplication no larger than we should take b grad = log[(n + 2)π/ ] = log(n/ ) + O(1).
(A7) 11. A circuit to perform addition on 5 qubits modulo 2 5 when the most significant target qubit is in a |+ state. Figure 11 to eliminate the cnots on the last carry qubit. The |+ state is omitted here because it is not acted upon. 13. A simplification of Figure 12 where the last carry qubit is eliminated entirely and the Toffoli is replaced with a controlled phase. However, the Hamiltonian H eff for the time evolution operator in this case is not known except in terms of its action on the space containing the instantaneous eigenvectors of H. In order to use this result, we need to bound the derivatives acting on the entire space. In order to find an asymptotic bound on these derivatives we define,

FIG. 12. A simplification of
It is then clear from the unitarity of W r (s) that for any |s| ∈ O(1), if we choose the principal logarithm for H eff then H eff (s) ∈ O(1). The derivatives of the Hamiltonian are more involved to estimate.

Derivatives of matrix logarithms of unitary matrices
In order to compute the derivatives of the effective Hamiltonian, we will need to compute the derivatives of the logarithm function. Such an analysis is usually based on differentiating the Mercator series for the matrix logarithm; however, the Mercator series of log(A) does not converge for A ≥ 1. For greater generality we will use an integral representation for the matrix logarithm log(A) from [73], This representation converges unless there is a real non-positive eigenvalue. For the case where A is unitary, this requirement prohibits matrices that have any eigenvalues equal to precisely −1. Next, defining V := [t(A − 1 1) + 1 1] −1 , and in turn (A − 1 1) = t −1 (V −1 − 1 1) this expression simplifies to Next, note that for any invertible matrix-valued function A(s) we have from the product rule that Taking the derivative of (B6) gives We can use the fact that A is unitary to see that In the case where A is close to the identity, if the absolute values of the phases of the eigenvalues are no greater than Γ, then Next we consider W 2 r (t) in the case where the Hamiltonian is a linear combination of self-inverse unitaries so select 2 = 1 1, which is the case for all Hamiltonians considered here. Expanding it out we have Using we have L(s, r)| select |L(s, r) = H(s) λ(s)r .
The adiabatic theorem then implies that, under reasonable assumptions about the derivatives of the Hamiltonian [52] (specifically that the Hamiltonian is Gevrey class G α for α ≥ 1), the value of T needed to achieve error , given that the minimum eigenvalue gap for the effective Hamiltonian is ∆ eff , scales at most as This implies that if λ ∈ Ω(1) and ∆ eff ∈ o(1) If the Hamiltonian H is maximum rank then the spectral gap of the effective Hamiltonian is on the order of ∆ eff ∈ Ω(min(∆, min k |E k |)/λ) where ∆ is the minimum spectral gap of the Hamiltonian H. The minimum over energy comes from the fact that the eigenvalues of W r in the set {±1, ±i} are mapped to 1, which can lead to degeneracies in the effective Hamiltonian that were absent in the original Hamiltonian. Thus the final scaling that we obtain is This confirms that by taking the number of steps sufficiently large that we can force the diabatic error to become arbitrarily small. Thus we can use the walk operator in place of a Trotterized sequence for adiabatic state preparation and in turn as a heuristic that will converge to the global optima given a large enough r. It should be noted, however, that the bounds used in this analysis are extremely loose and if a quantitatively correct estimate of the scaling is desired then many of the simplifications used above can be eschewed at the price of increasing the complexity of the expression.
Note that in practice, the adiabatic paths can be chosen such that the second derivative of the Hamiltonian is zero and similarly we can choose paths such that λ is constant by absorbing it into the definition of the evolution time for each infinitesimal step. However, we give the above expression for generality. Higher order versions of this can also be derived using time-dependent Trotter-Suzuki formulas [57].

Appendix C: Cost of multiplication
As the multiplication operation is a major contributor to the overall complexity of our algorithms, we need to be quite careful in our analysis of the operation. We also frequently require only low-precision arithmetic, meaning that we can make our multiplications less accurate and therefore computationally cheaper. This Appendix presents our algorithms for performing four variations of the multiplication task, with modifications to be used when one of the inputs is given classically rather than quantumly.
Our strategy is to use schoolbook multiplication. In schoolbook multiplication, the product γ := κ × λ is calculated by writing κ = 2 κ with κ ∈ {0, 1} and then calculating the sum γ = 2 κ λ. This reduces the task of multiplication to two very simple multiplications and the task of adding a list of numbers. The two multiplications are simple because multiplication by a power of two can be accomplished by an appropriate bit-shift operation, and multiplying by a single bit can be accomplished by using that bit as a control for the addition operation. That is, we perform that part of the addition if and only if the control bit is one.
We begin in Appendix C 1 by reviewing the parts of the main text where we need to multiply two numbers together. In Appendix C 2 we explain how to add a constant value to a quantum register, which is used separately in Algorithm 1 but is also used through the rest of this appendix in order to multiply a quantum variable to a classical constant. We then explain the simplest variant of multiplication in Appendix C 3, where we must multiply two integers together. 15. A circuit to perform addition on 5 qubits modulo 2 5 from [32].
We then explain the remaining variants by modifying the integer-integer multiplication algorithm as appropriate.
In Appendix C 4, we explain the case where we multiply an integer to a real number. In Appendix C 5, we explain the case where we multiply a real number to another real number. Finally, in Appendix C 6, we explain the case where we calculate the square of a given real number. In all cases we indicate how the algorithm is to be modified when one of the inputs is classically specified.

Uses of multiplication in this paper
In Section II C we need to multiply a quantum register by a classical constantγ or γ to obtain the phase to apply. The multiplication is performed directly into the phase gradient state, so we cannot use the savings where the multiplication result is placed in an initially zero register. The fastest method seems to be to write the classical constant as a sum of powers of 2 with plus and minus signs.
In Section II E we consider QROM for interpolation of functions, and we need to multiply the input register by the slope. In that case, both registers are quantum. The input register is given to b dif bits, and the goal is to give the approximation to the function to b sm bits. This may require giving the slope to b sm + O(log b sm ) bits, or b sm + b fun + O(log b sm ) bits in the case of the arcsine. For Szegedy walks, we need to take the square of a quantum register, and need to multiply a quantum register by a constant.

Methods for addition
When adding a classically given constant to a quantum register, it is possible to save the qubits that would be used to store this classical constant. Consider the quantum circuit for addition following [32], as shown in Figure 15 where i is the classically given integer and t is the quantum register. For this diagram we use the convention of [32] where a Toffoli with a target known to be initially zeroed is shown with a for the target. That is the first operation on the left in Figure 15. The Toffolis with targets that are known to be zero afterwards are shown with for the target. These may be performed with measurements and Cliffords so do not add to the non-Clifford cost.
The circuit for the adder contains a subsection where a cnot gate is performed on a qubit of i, say i 1 , as shown in Figure 16(a). The state after the cnot can alternatively be obtained on the control by switching the control and target for the cnot. Then for the following Toffoli where i 1 would be the control, we switch the control to the carry register at the top. After that the carry register needs to be used as a control where it should take its original value, so we need another cnot to undo the first. The resulting section of the circuit is as shown in Figure 16(b). Replacing all these sections of the circuit in this way, we obtain an addition circuit as shown in Figure 17. This adder only uses the i j registers as controls. Since these registers have classically known values, all controls by these qubits may be replaced with classical controls, and these qubits need not be used. This also reduces the Toffoli cost by 1, because the first Toffoli is replaced with a cnot. The Toffoli cost is therefore the number of bits minus 2. The number of ancillas needed is the number of bits minus 1. 16. (a) The component of the adder circuit where the qubit containing classical data is the target of a cnot. (b) The circuit may be rewritten so the value on the second qubit is never changed.
FIG. 17. A circuit to perform addition on 5 qubits modulo 2 5 such that the ij registers are only used as controls. Because they are only used as controls, if the number i is given classically the addition can be performed entirely using classical control, without using any ancillas to store i.

Multiplying two integers
In this variant of the multiplication task, we are to multiply the d A -bit integer κ to the d B -bit integer λ. These integers are encoded into quantum registers A and B, respectively. Thus our task is to prepare a (d A + d B )-qubit register out as follows: We explain how to perform this multiplication using schoolbook multiplication and the Gidney adder [32], and we explain how to reduce the computational cost if one of the inputs is presented to us classically rather than quantumly. We now explain the schoolbook multiplication algorithm in some detail. Let the bits of κ and λ be denoted as follows: Thus λ d B refers to the least-significant bit of λ. Our procedure is then as follows. 2. For each = d B − 1, . . . , 1, add 2 times the value of A to out in place, controlled on the (d B − ) th qubit of B. This can be done by using the control to copy the d A bits to an ancilla, and adding this ancilla. The ancilla can be erased with no Toffoli cost. We add A to out with the final qubits of out ignored. Note that the number of nonzero bits will always be no greater than d A .
The total number of Toffolis is 2d A d B − d A , and the total number of temporary ancilla qubits needed is 2d A − 1 since we are copying d A qubits out to an ancilla as well as using d A − 1 temporary qubits in the addition.
We now consider how the cost of the algorithm can be reduced when one of the inputs is presented classically, rather than quantumly. The effect on the algorithm is different depending on whether κ or λ is known classically. In the case that κ is known classically, we can replace all the Toffolis in the copy operation in step 1 with CNOTs or identity gates, depending on whether the relevant bit of κ is 1 or 0. More interestingly, each addition step would involve adding a known constant rather than an unknown variable to be read from a quantum register during computation. The effect on computational complexity depends on the classical constant κ; in particular, on the largest power of two that divides κ. In the worst case (κ mod 2 = 1), we save one Toffoli per addition step. In the best case (κ = 0), we have zero computational cost because we are multiplying by zero and we know we are multiplying by zero.
In the case that λ is presented to us classically rather than quantumly, we can make the addition controlled by performing the (non-controlled) addition circuit in the case of 1, or doing nothing when the classical control is 0. The number of quantum-to-quantum additions would thus depend on the number of non-zero classical bits -the greater the Hamming weight of the classical input, the greater the number of additions to be performed. Note that this is distinct from the cost of performing classical-to-quantum multiplication when κ is the classical variable, in which case the complexity is determined by the number of zeros on the far right of the number.
The case where λ is given classically is more relevant for this paper. We are unlikely to be dealing with classically specified integers that are a multiple of a large power of two. On the other hand, we will frequently have some information about the Hamming weight λ of the classically known number λ. Each addition costs at most d A Toffolis, we perform λ ≤ d B such additions, and no other operations require Toffoli gates. We therefore have a total Toffoli cost of at most d A d B , which can be replaced with d A λ if we can assume knowledge of the Hamming weight of λ. Thus we save a factor of 2 if one of the inputs is classical.
In the following subsections, we explain how to modify the above procedure for variants of the multiplication task. These variants have at least one of the inputs being a real number between 0 and 1, rather than an integer. Thus the task is not to calculate the multiplication exactly, as this would involve infinitely many bits for real numbers. Instead, we truncate the binary expansions of real numbers to ensure that an error threshold is achieved.

Multiplying an integer to a real number
Now we consider a variant of the multiplication task where one of the inputs is a real number between zero and one. For reasons that become clear below, we specify κ to be the real number and λ to be the integer. We assume that the real number is defined to infinitely many digits and that our task is to approximate γ := κ × λ to within an error tolerance ε. Thus our task is to calculate someγ such that |γ −γ| < ε. That is to say, we are to prepare a new quantum register out as follows: Here we are free to choose the number of bits for the register A and hence the number of bits for the register out. This choice will naturally depend on the error tolerance ε. We begin by specifying symbols for the bits of the inputs κ and λ. Note that the indexing differs somewhat from the previous section. We define We then select an integer d A ≥ d B (presuming ε < 1) that will count the number of bits of the input κ we plan to use. We use d A − 1 bits of κ. We explain our plan by first representing the ideal product as where the vertical line denotes where we truncate the binary expansion of κ. We thus calculatẽ To ensure that the error tolerance ε is achieved, we should choose d A > d B + log(d B /ε). We therefore choose We follow a similar strategy to that described in Appendix C 3, meaning that we are to perform a sequence of controlled additions. We work bottom to top in Eq. (C5).
We start with the bottom line by copying d A − d B bits into the output register, with Toffoli cost d A − d B . After that the number of Toffolis is twice the number of bits. The total number of Toffolis is then Hence the Toffoli cost of multiplying a real number to an integer on a quantum computer is no more than where d B is the number of bits used to specify the integer and ε is the allowable error in the overall multiplication. The algorithm requires that the real number is specified to d A = d B log(d B /ε) bits and uses d A − 1 ancilla qubits.

Multiplying two different real numbers
In this subsection we consider the task where we are to multiply two real numbers κ (0 ≤ κ < 1) and λ (0 ≤ κ < 1). Our task is to calculate an approximationγ to γ := κ × λ such that |γ −γ| < ε, where ε > 0 is some given error tolerance. That is to say, we are to prepare a new quantum register out as follows: We are free to choose the number of qubits in each of the registers A, B, and out to ensure that the output encodes a value forγ that approximates γ to within the error tolerance ε. We begin by discussing these choices of register size, starting with the size of A and B.
To explain our choice for the numbers of qubits for registers A and B, we begin by introducing notation for the inputs κ and λ. As before, we define the bits of the inputs according to the equations κ := ∞ =1 κ /2 ; λ := ∞ =1 λ /2 ; κ , λ ∈ {0, 1}. (C11) This suggests our strategy for calculating γ. As before, we have where solid lines indicate where we truncate the calculation in order to produce the approximationγ instead of γ.
Here we have assumed that d A ≥ d B ; if d A < d B , our repeated addition procedure would involve several additions by zero. The repeated addition strategy has a Toffoli cost of (d A − d B + 1) + 2(d A − d B + 2) + . . . + 2(d A − 1) = (d A − d B + 1) + 2 . We plot the ratio of (d + 1)/2 d to the error bound ε. A value less than 1 ensures that our choice of d yields a multiplication result whose error is less than ε.
It seems reasonable to set d A = d B , and numerical evidence indicates that this choice makes the optimal tradeoff between computational complexity and error tolerance.
We can ensure that the error of the approximationγ is within tolerance ε by setting Though the above could be solved exactly using a Lambert-W function, it is satisfied with d = 1 + log(1/ε) + log(1 + log(1/ε)). (C17) Figure 18 justifies this choice by depicting the value of d+1 2 d ε as a function of ε with d chosen as per Eq. (C17). To choose d, we take the ceiling of this expression, because d must be chosen to be an integer.
Hence our strategy for multiplying two real numbers uses d 2 − d − 1 = log 2 (1/ε) + 2 log(1/ε) log log(1/ε) + O(log(1/ε)) (C18) Toffoli gates to achieve an output with error less than ε. The ancilla cost is d − 1 bits for a copy of the bits of κ for the controlled addition, and another d − 1 bits for the addition itself. Thus the ancilla cost is 2 log(1/ε) + O(log log(1/ε)). (C19) 6. Squaring a real number We are given a quantum register A with a real number κ that satisfies 0 ≤ κ < 1. Our task is to calculate an approximationγ of γ := κ 2 such that |γ −γ| < ε, where ε is given (0 < ε < 1). That is to say, we are to prepare a new quantum register out as follows: We will include d bits in the sum, so the sum can be expressed as (C22) The first term in this sum contains the parts where n > m, and is multiplied by 2 because those parts with n < m are equal by symmetry. The second term is that for n = m. This sum is more efficient, because only about half as many terms appear. Now the term in the second sum for n = d/2 is half the size of any of the other terms, so it is convenient to omit it. The form of the sum can be shown, for the odd example d = 15, (C23) Here we have shown terms from the second sum from Eq. (C22) in blue. In the case where d is odd, d/2 = (d − 1)/2, and we can write the sum as To compute the sum we start with n = d − 1, and copy the value κ d−1 κ 1 into the output at position d − 1 (to initialize the output as κ d−1 κ 1 2 −(d−1) ) with Toffoli cost 1. Next, we use κ d−2 to control addition of κ 1 2 −(d−2) + κ 2 2 −(d−1) into the output. This controlled addition has cost 2 × 2 because it is for 2 bits. At step j = d − n ≤ (d − 1)/2, the cost of controlled addition of j bits is 2j. The cost of that part is therefore 1 + For the remaining steps with n = (d − 1)/2 to 2, we have a cost of n − 1 to produce the n − 1 values of κ n κ m , and there are n + 1 bits that need to be added into the output. That includes the bit for κ n 2 −2n which is spaced by one bit from the remaining bits for κ n κ m . That gives cost (n − 1) + (n + 1) = 2n. For n = 1, we just have a cost of one Toffoli to add in the single bit (without a control, because it is just κ n ). That gives the same cost as the first half, for a total cost of In the case where d is even, d/2 = d/2, and we can write the sum as The form of the sum for an even example d = 16 is ( The middle sum in Eq. (C30) has cost d − 2, then the final sum has cost 2n for n = 2 to d/2 − 1, and cost 1 for n = 1, giving the same cost as the first sum. That gives a total complexity Thus in both the odd and even cases the complexity is less than d 2 /2.
Thus the average probability of success is at least as large as | ψ 0,1 |ψ | 2 | ψ 0,0 |ψ 0,1 | 2 as given by Hastings' approach. A minor drawback as compared to Hastings' approach is that only a single time is used, so if it happens that this time gave p succ (t) significantly smaller than average the amplitude amplification would not give the solution. The original motivation for the SPA seems to be primarily to produce an algorithm where a rigorous analysis can be performed, and so it is debatable whether one would actually want to try to use the algorithm heuristically rather than via some other approach. If did one want to use this approach heuristically there are many ways that could be accomplished; for instance, by choosing E target , B and K heuristically and then resolving to use a fixed number of rounds of amplitude amplification. Note that in the variant we have described it is no longer necessary to have an E threshold , although one will still need to choose the precision to which one performs phase estimation. One can see that such a heuristic variant of this algorithm could be implemented by using either our Hamiltonian walk or Trotter step oracles for the evolution, followed by our direct energy oracles for computing the amplitude amplification target.

Quantum enhanced population transfer
Another heuristic algorithm for optimization which has been proposed is the quantum enhanced population transfer (QEPT) method of [10,11]. Unlike quantum heuristics which begin in a uniform superposition state, QEPT proposes to use quantum dynamics to evolve from one low energy solution of an optimization problem to other low energy solutions of similar energy. The idea is motivated by the search of configuration space in the classically non-ergodic phase associated with hard optimization problems. Such energy landscapes contain an extensive number of local minima separated by large Hamming distances. Algorithms relying on classical dynamics satisfying the detailed balance condition, such as simulated annealing, tend to get trapped at local minima of these landscapes. Thus, one could alternatively apply classical simulated annealing until the algorithm becomes trapped, then apply QEPT starting from that state, then again apply simulated annealing starting from the QEPT solutions, and so on.
Specifically, the context studied in [10,11] is as follows. Consider a cost function C on N qubits and bitstring x with energy E x (so that C |x = E x |x ). The problem solved by QEPT is to produce another bitstring y within a small energy window E y ∈ [E x − δ/2, E y + δ/2] such that the Hamming distance d x,y between x and y is O(N ). In the presence of a spin glass type energy landscape finding such states y using a classical algorithm takes exponential resources. The QEPT procedures suggests solving the above computational task as follows.
1. Prepare the system in the initial state |x . In general, we would expect that T will scale exponentially in order for the procedure to succeed with fixed probability. However, for the worst case scenario when there are M states with energy −1 and 2 N − M states of energy 0, the work of [11] was able to show that this procedure succeeds with high probability for T = O( 2 N /M ), which is the same as the Grover scaling. However, unlike Grover, this protocol does not require any fine tuning of the transverse field or computation time. The procedure has also been shown empirically to produce similar results for the random energy model (where each bit string has a totally random energy). The suggestion to use this algorithm heuristically is simply to choose T , as well as the accuracy with which we implement the time evolution, heuristically. Like with the adiabatic algorithm, this will essentially correspond to the number of steps that we take in either a product formula, or quantum walk approach to simulating the evolution. We propose that when using the quantum walk form of the algorithm, one not use signal processing and instead perform population transfer directly on the quantum walk. We note that since the norm of the problem and driver Hamiltonians are similar in magnitude, there would be no advantage to performing simulation in the interaction picture and so an approach based on qubitization is likely the best LCU style algorithm for QEPT.