What is the Computational Value of Finite Range Tunneling?

Quantum annealing (QA) has been proposed as a quantum enhanced optimization heuristic exploiting tunneling. Here, we demonstrate how finite range tunneling can provide considerable computational advantage. For a crafted problem designed to have tall and narrow energy barriers separating local minima, the D-Wave 2X quantum annealer achieves significant runtime advantages relative to Simulated Annealing (SA). For instances with 945 variables, this results in a time-to-99%-success-probability that is $\sim 10^8$ times faster than SA running on a single processor core. We also compared physical QA with Quantum Monte Carlo (QMC), an algorithm that emulates quantum tunneling on classical processors. We observe a substantial constant overhead against physical QA: D-Wave 2X again runs up to $\sim 10^8$ times faster than an optimized implementation of QMC on a single core. We note that there exist heuristic classical algorithms that can solve most instances of Chimera structured problems in a timescale comparable to the D-Wave 2X. However, we believe that such solvers will become ineffective for the next generation of annealers currently being designed. To investigate whether finite range tunneling will also confer an advantage for problems of practical interest, we conduct numerical studies on binary optimization problems that cannot yet be represented on quantum hardware. For random instances of the number partitioning problem, we find numerically that QMC, as well as other algorithms designed to simulate QA, scale better than SA. We discuss the implications of these findings for the design of next generation quantum annealers.


I. INTRODUCTION
Simulated annealing (SA) [1] is perhaps the most widely used algorithm for global optimization of pseudo-Boolean functions with little known structure. The objective function for this general class of problems is where N is the problem size, s j = ±1 are spin variables and the couplings J j1...j k are real scalars. In the physics literature H cl P (s) is known as the Hamiltonian of a K-spin model. SA is a Monte Carlo algorithm designed to mimic the thermalization dynamics of a system in contact with a slowly cooling reservoir. When the temperature is high, SA induces thermal excitations which can allow the system to escape from local minima. As the temperature decreases, SA drives the system towards nearby low energy spin configurations.
Almost two decades ago, quantum annealing (QA) [2] was proposed as a heuristic technique for quantum enhanced optimization. Despite substantial academic and industrial interest , a unified understanding of the physics of quantum annealing and its potential as an optimization algorithm remains elusive. The appeal of QA relative to SA is due to the observation that quantum mechanics allows for an additional escape route from local minima. While SA must climb over energy barriers to escape false traps, QA can penetrate these barriers without any increase in energy. This effect is a hallmark of quantum mechanics, known as quantum tunneling. The standard time-dependent Hamiltonian used for QA is where H P is written as in Eq. (1) but with the spin variables s j replaced with σ z j Pauli matrices acting on qubit j, and the functions A(t) and B(t) define the annealing schedule parameterized in terms of time t ∈ [0, T QA ] (see Fig. 7). These annealing schedules can be defined in many different ways as long as the functions are smooth, A(0) B(0) and A(T QA ) B(T QA ). At the beginning of the annealing, the transverse field magnitude A(t) is large, and the system dynamics are dominated by quantum fluctuations due to tunneling (as opposed to the thermal fluctuations used in SA).
The question of whether D-Wave processors realize computationally relevant quantum tunneling has been a subject of substantial debate. This debate has now been settled in the affirmative with a sequence of publications [6,8,9,12,13,18,21] demonstrating that quantum resources are present and functional in the processors. In particular, Refs. [35,36] studied the performance of the D-Wave device on problems where eight [37] qubit cotunneling events were employed in a functional manner to reach low-lying energy solutions.
In order to investigate the computational value of finite range tunneling in QA, we study the scaling of the exponential dependence of annealing time with the size of the tunneling domain D, where α = a min / and a min is the rescaled instanton action (see Eqs. (21) and (24)). In SA, the system escapes from a local minimum via thermal fluctuations over an energy barrier ∆E separating the minima. The time required for such events scales as which is exponentially long with respect to ∆E. However, for sufficiently tall and narrow barriers such that QA can overcome barriers exponentially faster than SA. This situation was studied in Ref. [38] and it also occurs in the benchmark problems studied in this paper. Path integral Quantum Monte Carlo (QMC) is a method for sampling the quantum Boltzmann distribution of a d dimensional stoquastic Hamiltonian as a marginalization of a classical Boltzmann distribution of an associated d+1 dimensional Hamiltonian. For specific cases, it was recently shown that the exponent α for physical tunneling is identical to the corresponding quantity for QMC [39]. However, in the present work we find a very substantial computational overhead associated with the prefactor B QMC in the expression for the runtime of QMC, T QMC = B QMC e αD . In other words, B QMC can exceed B QA by many orders of magnitude. The role of this prefactor becomes essential in the situations where the number of cotunneling qubits D is finite, i.e., is independent of the problem size N (or depends on N very weakly).
Because tunneling is more advantageous when energy barriers are tall and narrow, we expect this resource to be most valuable in the upper part of the energy spectrum. For instance, a random initial state is likely to have an energy well above the ground state energy for a difficult optimization problem such as the one in Eq. (1). However, the closest lower energy local minimum will often be less than a dozen spin flips away. Nevertheless, the energy barriers separating these minima may still be high. In such situations, if the transverse field is turned on to facilitate tunneling transitions, the transition rate to lower energy minima will often increase. By contrast, once the state reaches the low part of the energy spectrum, the closest lower local minimum is asymptotically N spin flips away [2,[40][41][42]. There, finite range tunneling may assist by effectively "chopping off" narrow energy ridges near the barrier top but the transition probability is still largely given by the Boltzmann factor. This description illustrates that finite range tunneling can be useful to quickly reach an approximate optimization, but will not necessarily significantly outperform SA when the task is to find the ground state (see Fig. 1).
The canonical QA protocol initializes the system in the symmetric superposition state, |+ ⊗N , which is the ground state at t = 0. By a similar argument, we expect that finite range tunneling will drive the system adiabatically across energy gaps associated with narrow barriers, FIG. 1. Upper: quantum annealer dynamics are dominated by paths through the mean field energy landscape that have the highest transition probabilities. Shown is a path that connects local minimum A to local minimum D. Lower: the mean field energy V (q(τ )) along the path from A to D, as defined by Eqs. (12) and (15). Finite range, thermally assisted tunneling can be thought of as a transition consisting of three steps: I. The system picks up thermal energy from the bath (red arrow up); II. The system performs a tunneling transition between the entry and exit points (blue arrow); III.
The system relaxes to a local minimum by dissipating energy back into the thermal bath (red arrow down). In transitions A → B or B → C finite range tunneling considerably reduces the thermal activation energy needed to overcome the barrier. For long distance transitions in the lower part of the energy spectrum, such as C → D, the transition rate is still dominated by the thermal activation energy and the increase in transition rate brought about by tunneling is negligible.
preventing transitions to higher energies. However, in general, finite range tunneling will not be able to prevent Landau-Zener diabatic transitions for very small gaps resulting from emerging minima in the energy landscape separated by a wide barrier. This will often include the gap separating the ground state from the first excited state [2,[40][41][42]. The paper is organized as follows: In Section II we present our main results consisting of benchmarking the D-Wave 2X processor against SA and QMC on a crafted problem with a rugged energy landscape; Section III introduces the theory of instantons in multi-spin systems, discusses tunneling simulation in QMC, and presents numerical results comparing QA and QMC for the "weakstrong cluster pair" problem; Section IV presents numerical studies of generic problems with rugged energy landscapes that can potentially benefit from QA; Section V discusses the challenges associated with designing future annealers of practical relevance; and Section VI concludes with an overview and discussion. Further technical details can be found in Appendix A and B.

II. BENCHMARKING PHYSICAL QUANTUM ANNEALING ON A CRAFTED PROBLEM WITH A RUGGED ENERGY LANDSCAPE
In this section, we consider a problem designed so that finite cotunneling transitions of multiple spins strongly impacts the success probability. The previous section outlined several reasons why QA has a chance to outperform SA for problems with a rugged energy landscape. FIG. 2. A pair of weak-strong clusters, consisting of 16 qubits in two unit cells of the Chimera graph. All qubits are ferromagnetically coupled and evolve as part of two distinct qubit clusters. At the end of the annealing evolution, the right cluster is strongly pinned downwards due to strong local fields acting on all qubits in that cell. However, the local magnetic field h1 in the left cluster is weaker and serves as a bifurcation parameter. For h1 = 0.44 < 1/2 the left cluster will reverse its orientation during the annealing sweep and eventually align itself with the right cluster.
In previous work we proposed a problem consisting of a pair of strongly connected spins (called clusters) to study the presence of functional cotunneling in D-Wave processors [35,36]. Each cluster coincides with a unit cell of the native hardware graph, known as the Chimera graph.
The problem Hamiltonian H P in Eq. (2) is of Ising form All the couplings are ferromagnetic with J = 1. The index k ∈ {1, 2} indexes a unit cell of the Chimera graph, the first sum in (7) goes over the intra-cell couplings depicted in Fig. 2, the second sum goes over the inter -cell couplings corresponding to j = 5, 6, 7, 8 in Fig. 2, and h k denotes the local fields within each cell. The local fields 0 < h 1 < 0.44 (weak cluster) and h 2 = −1 (strong cluster) are equal for all the spins within the cell. In this parameter regime, all spins of both clusters will point along the direction of the strong local fields in the ground state of the problem Hamiltonian H P . However, in the initial phase of the annealing evolution, all spins in the weak cluster orient themselves by following the local field in the opposite direction. At a later stage of the annealing evolution, the pairwise coupling between clusters becomes dominant; however, in the mean field picture, the state of the weak cluster is driven into a local minimum. Using a noise model with experimentally measured parameters for the D-Wave 2X processor, we numerically verify that the most likely mechanism by which all spins arrive at the energetically more favorable configuration is multi-qubit cotunneling (see Ref. [35] and Appendix A).
Using the weak-strong cluster pairs as building blocks, larger problems are formed by connecting the strong clusters to one another in a glassy fashion. That is, the four connections between two neighboring strong clusters are all set either to +1 (ferromagnetic) or −1 (antiferromagnetic), at random. With this procedure, we define a large class of instances having any size that we refer to as the "weak-strong cluster networks" problem. Fig. 3 shows several examples of the layout of instances that were used in these benchmark tests [43]. Due to the fact that not all of the qubits in the D-Wave 2X processor are properly calibrated, and hence available for computation, the instances become somewhat irregular.

A. D-Wave versus Simulated Annealing
We now compare the total annealing time of SA to the total annealing time of the D-Wave 2X processor. Fig. 4 shows the time to reach the ground state with 99% success probability, as a function of problem size for different quantiles.
For D-Wave, we fix the annealing time at 20 µs, the shortest time available due to engineering compromises, and estimate the probability p j of finding the ground state for a given instance j [44]. Shorter times are optimal in this benchmark, as explained in Appendix A and Ref. [35]. The total annealing time to achieve p j = 0.99 is 20 µs × log(1 − 0.99)/ log(1 − p j ). See Appendix A for details of the physical QA parameters (see also [45]).
We measure the computational effort of SA in units of runtime on a single core, which is natural when using server centers and convenient in order to compare to other classical approaches. Of course, the total runtime can be shortened by parallelizing the overall computation. Part of that process is embarrassingly parallel since SA finds its best solutions not in a single run, but by restarting from random bit configurations. Every restart could be executed on a different core. In fact, we used this strategy for our numerical benchmark. We find that median-case instances required 10 9 independent runs (with 945 × 5 · 10 4 spin updates each) to find the optimal solution with 945 variables, on average.
We note that for instances with more qubits, more quantum hardware resources are brought to bear and therefore a fair comparison needs to take this into account [12,15]. As an extreme example, one could contemplate building special purpose classical hardware that would update as many spins in parallel as possible at state-of-the-art clock rates. The sets of spins that could be updated in parallel depends on the connectivity graph. Though such considerations are reasonable, we do not explore this possibility here as we believe that future quantum annealers will have higher connectivity which will severely limit the usefulness of such parallelism.
When estimating runtimes for SA, we follow the protocol laid out by Refs. [12,15,46] and tuned SA for every problem size and quantile. Tuning means that the starting and end temperature, as well as the number of spin updates and the number of restarts are optimized to achieve a short overall runtime. We first measure the computational effort in units of sweeps (one sweep attempts to update all the spins). These times are plotted is n sweeps × N × T su , where N is the number of spins. We used a spin update time T su = 1/5 ns (see Ref. [12]).
Our key finding in this comparison is that SA performs very poorly on the weak-strong cluster networks. The D-Wave 2X processor is 1.8 · 10 8 faster at the largest size instances we investigated, which consisted of 945 variables. This problem was specifically engineered to cause the failure of SA: as explained above, the "weak-strong cluster networks" problem is intended to showcase the performance of annealers on a problem that benefits from  To assign a runtime for the classical algorithms we take the number of spin updates (for SA) or worldline updates (for QMC) that are required to reach a 99% success probability and multiply that with the time to perform one update on a single state-of-the-art core. Shown are the 50th, 75th and 85th percentiles over a set of 100 instances. The error bars represent 95% confidence intervals from bootstrapping. This experiment occupied millions of processor cores for several days to tune and run the classical algorithms for these benchmarks. The runtimes for the higher quantiles for the larger problem sizes for QMC were not computed because the computational cost was too high. For a similar comparison with QMC with different parameters please see Fig. 13.
finite range cotunneling. By contrast, the random Ising instances studied in Refs. [12,15] have only low energy barriers, as explained in Ref. [47].

B. D-Wave versus Quantum Monte Carlo
Next we compared the performance of path integral Quantum Monte Carlo (QMC) with that of D-Wave for the same benchmark. QMC samples the Boltzmann distribution of a classical Hamiltonian which approximates the transverse field Ising model. In the case of a 2-spin model, the discrete imaginary time QMC classical Hamiltonian is where σ j (τ ) = ±1 are classical spins, j and k are site indices, τ is a replica index, and M is the number of replicas. The coupling between replicas is given by where β is the inverse temperature. The configurations for a given spin j across all replicas τ is called the worldline of spin j. Periodic boundary conditions are imposed between σ j (M ) and σ j (1). We used continuous path integral QMC, which corresponds to the limit ∆τ → 0 [48], and, unlike discrete path integral QMC, does not suffer from discretization errors of order 1/M . We numerically compute the number of sweeps n sweeps required for QMC to find the ground state with 99% probability at different quantiles. In our case, a sweep corresponds to two update attempts for each worldline. The computational effort is n sweeps ×N ×T worldline , where N is the number of qubits and T worldline is the time to update a worldline. We average T worldline over all the steps in the quantum annealing schedule; however the value of T worldline depends on the particular schedule chosen. As explained above for SA, we report the total computational effort of QMC in standard units of time per single core. For the annealing schedule used in the current D-Wave 2X processor, we find T worldline = β × 870 ns (11) using an Intel(R) Xeon(R) CPU E5-1650 @ 3.20GHz. This study is designed to explore the utility of QMC as a classical optimization routine. Accordingly, we optimize QMC by running at a low temperature, 4.8 mK. We also observe that QMC with open boundary conditions (OBC) performs better than standard QMC with periodic boundary conditions in this case [39]; therefore, OBC is used in this comparison. We further optimize the number of sweeps per run which, for a given quantile, results in the lowest total computational effort. We find that the optimal number of sweeps for the 50th percentile at the largest problem size is 10 6 . This enhances the ability of QMC to simulate quantum tunneling, and gives a very high probability of success per run in the median case, p success = 0.16.
All the qubits in a cluster have approximately the same orientation in each local minimum of the effective mean field potential. Neighboring local minima typically correspond to different orientations of a single cluster. Here, tunneling time is dominated by a single purely imaginary instanton and is described by Eq. (24) below. It was recently demonstrated that, in this situation, the exponent a min / for physical tunneling is identical to that of QMC [39]. As seen in Fig. 4, we do not find a substantial difference in the scaling of QMC and D-Wave (QA). However, we find a very substantial computational overhead associated with the prefactor B in the expression T = Be Damin/ for the runtime. In other words, B QMC can exceed B QA by many orders of magnitude. The role of the prefactor becomes essential in situations where the number of cotunneling qubits D is finite, i.e., is independent of the problem size N (or depends on N very weakly). Between some quantiles and system sizes we observe a prefactor advantage as high as 10 8 .

C. D-Wave versus other Classical Solvers
Based on the results presented here, one cannot claim a quantum speedup for D-Wave 2X, as this would require that the quantum processor in question outperforms the best known classical algorithm. This is not the case for the weak-strong cluster networks. This is because a variety of heuristic classical algorithms can solve most instances of Chimera structured problems much faster than SA, QMC, and the D-Wave 2X [49][50][51] (for a possible exception see [25,52]) [53]. For instance, the Hamzede Freitas-Selby algorithm [49,50], performs a greedy sequence of random large neighborhood optimizations. Each large neighborhood is defined by first replacing each 4-qubit column in a cluster with a large spin, and then expanding a tree of large spins which covers ∼ 77% of all spins (and more than half of all 8 qubit clusters). In the particular case of a weak-strong cluster pair this algorithm avoids the formation of the energy barrier.
However, we believe that such solvers will become ineffective for the next generation of annealers currently being designed. The primary motivation for this optimism is that the connectivity graphs of future annealers will have higher degree. For example, we believe that a 10-dimensional lattice is an engineerable architecture. For such dense graphs, we expect that cluster finding will become too costly. On the contrary, multi-qubit cotunneling is a general phenomenon in spin glasses that is not limited to sparse graphs.
We have also learned from the Janus team, working with special purpose FPGAs to thermalize Ising models on a 32 3 cube, that they found cluster finding not to be helpful [54].

A Remark on Scaling
Certain quantum algorithms have asymptotically better scaling than the best known classical algorithms. While such asymptotic behavior can be of tremendous value, this is not always the case. Moreover, there can be substantial value in quantum algorithms which do not show asymptotically better scaling than classical approaches. The first reason for this is that current quantum hardware is restricted to rather modest problem sizes of less than order 1000 qubits. At the same time, when performing numerical simulations of quantum dynamics, in particular when doing open system calculations, we are often limited to problem sizes smaller than 100 qubits. Extrapolating from such finite size findings can be misleading as it is often difficult to determine whether a computational approach has reached its asymptotic behavior.
When forecasting the future promise of a given hardware design, there is a tendency to focus on the qubit dimension. However, this perspective is not necessarily helpful. For instance, when looking at runtime as a function of qubit dimension, one may conclude that the measurements reported here indicate that QMC and physical annealing have a comparable slope, scale similarly and that therefore, the upside of physical annealing is bounded. However, the large and practically important prefactor depends on a number of factors such as temperature. Furthermore, we expect future hardware to have substantially better control precision, substantially richer connectivity graphs and dramatically improved T 1 and T 2 times. With such changes, next generation annealers may drastically increase the constant separation between algorithms, leading to very different performance from generation to generation. To illustrate how dramatic this effect can be, when we ran smaller instances of the weak-strong cluster networks on the older D-Wave Vesuvius chips we predicted that at 1000 variables D-Wave would be 10 4 times faster than SA. In fact, we observed a speedup of more than a factor of 10 8 . This is because certain noise parameters were improved and the new dilution refrigerator operates at a lower temperature. Similarly, we suspect that a number of previous attempts to extrapolate the D-Wave runtimes for 1000 qubits will turn out to be of limited use in forecasting the performance of future devices. For this reason, the current study focuses on runtime ratios that were actually measured on the largest instances solvable using the current device, rather than on extrapolations of asymptotic behavior which may not be relevant once we have devices which can attempt larger problems.

III. SPIN COTUNNELING IN QA AND QMC
A. Instantons in systems with multiple spins Cotunneling consists of system state transitions in which a group of spins simultaneously change their orientation with energy well below the energy of the (mean field) potential barriers. Tunneling is a quintessential quantum phenomenon; real time dynamics of classical trajectories cannot describe barrier penetration when the system wavefunction extends to classically forbidden regions. In such situations, the exponential decay of the wavefunction under the barrier is often captured through the path integral formalism by computing the minimum action of the trajectories in imaginary time [55,56]. This approach was also extended to treat the tunneling of large magnetic moments with conserved total spin [57].
Tunneling in mean field spin models can be described using the path integral over spin-coherent states in imaginary time [58]. The tunneling path connects the minima of the mean field potential where H(t) is the time-dependent QA Hamiltonian from Eq. (2) and |Ψ m is a product state The coherent state of the j-th spin is defined by a vector on the Bloch sphere and the corresponding state of the N -spin system is defined by the vector m = (n 1 , n 2 , . . . , n N ).
Towards the beginning of a QA evolution, the system remains near m 0 (t), the global minimum of the time-dependent potential V (m, t), which connects to the global minimum at the initial time. Later on, V (m, t) undergoes a bifurcation, which may cause the initial minimum to become metastable. At that point, the system may be able to tunnel to the new global minimum Here we omit the argument t, whose value corresponds approximately to the moment when the minima exchange order. Such tunneling events are sometimes accompanied by thermal activation if QA is performed at finite temperatures (i.e. thermally assisted tunneling) [38]. The sequence of bifurcations and associated tunneling events can continue multiple times before the global minimum of H P is reached.
Mathematically, the tunneling process can be described as an evolution of a spin system along the instanton path q(τ ) = (n 1 (τ ), n 2 (τ ), . . . , n N (τ )) (15) over imaginary time τ ∈ (0, β) where β = /k B T and T is the system temperature. The details of this analysis will be provided elsewhere [59]. Here, we simply outline the main argument. The path component for the j-th spin is described by a vector n j (τ ) with periodic boundary conditions n j (0) = n j (β). The vector n(τ ) is complex (see Eqs. (67), (68) in the Supplementary Material of Ref. [39]) and can be written in the form n = (sin θ j cosh ϕ j , −i sin θ j sinh ϕ j , cos θ j ) , corresponding to a purely imaginary azimuthal angle φ j (τ ) = −iϕ j (τ ). The initial point of the instanton q(0) corresponds to the initial minimum of V (m). The midpoint of the trajectory q(β/2) typically corresponds to the exit point of the potential barrier in the vicinity of the final minima We first consider the simplified situation where the domain of D spins tunnel simultaneously. The total spin of the tunneling domain is conserved and all spins in the domain move identically through the instanton trajectory, n j (τ ) ≡ n(τ ) for all j ∈ [1, D]. Then, the mean field potential for the instanton can be rescaled as V (q(τ )) = Dυ(n(τ )) ..
In the limit of low temperatures β → ∞, the instanton action takes the form S[n(τ )] = Da[n(τ )] with where ω describes a "Berry phase" type contribution The exponential dependence of the tunneling rate, is given by the minimum of the rescaled action. We now consider the case when the tunneling of the spin domain is enabled by the transverse field and the Hamiltonian is of the type in Eq. (2), where H cl P (s 1 , . . . , s D ) is a classical cost function of binary variables s k = ±1. Assuming that H cl P = H cl P ( D j=1 σ z j ), and the system is in a state of maximum total spin, all spins will tunnel together. Thus, υ(n(τ )) = −A sin θ(τ ) cosh ϕ(τ ) − Bg(cos θ(τ )), (23) where the function g(x) = D −1 H cl P (Dx) is the rescaled mean field potential energy of the spin system. At zero temperature, the minimum action can be written in a more familiar form (cf. Supplementary Material in Ref. [39] and also Ref. [60]), where v(θ) = B 2 (g(cos θ) − g(cos θ 0 )) 2 − A 2 sin 2 θ is a linear velocity along the instanton path and the angles θ i correspond to the minima of the potential υ(n) (the values of sinh ϕ j = 0 at the minima). In general, in mean field spin models, the total spin is not conserved throughout the instanton trajectory (15). The corresponding action [58] can be written in a form that is a direct generalization of the simplified case (19) where ω is given in Eq. (20). If the azimuthal angles are purely imaginary, as discussed above, the action is real.
The dominant contribution to the path integral will be given by the instanton path with the least action In general, the action in Eq. (26) also grows linearly with the number of cotunneling spins as in the simplified case of Eq. (21) because the individual spin contributions to the action are highly correlated at the instanton trajectory. Accordingly, we may define S min = Da min .

B. Tunneling simulation in QMC
In the path integral QMC algorithm one introduces an extra dimension associated with the imaginary time axis in order to simulate the multispin tunneling phenomenon on a classical computer, as seen above. This is done by using a Suzuki-Trotter decomposition and representing the partition function of the system Z in terms of the path integral over the spin trajectories σ(τ ) = {σ j (τ )} N j=1 , see Eq. (9). These trajectories are periodic along the imaginary time axis σ j (0) = σ j (β). For each spin j, the set of values σ j (τ ) = ±1 form a path component referred to as a worldline. The time step along the worldline is ∆τ = β/M where M is the number of Trotter slices (i.e. the number of spin replicas in the worldline). Sampling the system states along this extra dimension introduces an additional overhead in classical computation that does not exist for the corresponding quantum dynamics.
The runtime T QMC of QMC can be thought of as a product of three factors, T QMC = N n sweeps T worldline (27) where N is the problem size that is equal to the number of worldlines. The number of sweeps n sweeps , in general, depends exponentially on the typical size D of the cotunneling domain, n sweeps ∝ e αD , where α = α(β) also depends on the inverse temperature β. In cases where D = O(N ) the growth of this factor with N reflects a major computational bottleneck of QMC and QA. As was shown recently by some of the authors [39], the exponent in QMC and QA is the same for a broad class of problems.
According to the findings reported in this paper, the prefactor in n sweeps , along with the factors N and T worldline in T QMC , are significantly different for QMC and QA. The value of T worldline found in our simulations is given in Eq. (B1). We also find that n sweeps 1 ns/T QA , where T QA is the duration of QA and we use a normalization factor of one nanosecond, corresponding to the typical time scale of superconducting QA devices. We expect that the above relation will remain true even if the scaling of both quantities with Da min / is the same.
C. Comparison of QA and QMC for the "weak-strong cluster pair" problem We use the modeling considerations described above to compare QA and QMC in the system corresponding to the weak-strong cluster pair problem (see discussion in Sec. II, Fig. 2, and Eqs. (6),(7), (8)). Tunneling in this system corresponds to an avoided crossing between the two lowest energy levels of the Hamiltonian, shown in Fig. 5. All other levels lie high above the first two and are never excited. During tunneling, the total spin of the left cluster reverses orientation. The number of cotunneling spins is D = 8 in this case while the total number of spins is N = 16. Proper analysis of the tunneling probabilities and related QA success rates should also account for coupling to the environment. We studied the success probabilities in QA for the 2-cluster problem using the approach developed in [35]. The results are summarized in Fig. 11 of App. A. Decreasing the temperature of QA compared to the temperature of the D-Wave device suppresses steeply the transition rates between the states because of the increase in the reconfiguration energy [35,61,62] (see Fig. 10). This, together with the suppression from the Boltzman factor in the transition rates, leads to an increase of the final success probability p 0 to find the ground state. Once the temperature reaches 5 mK, the success probability stays above 90%, even at QA schedules that are faster than T QA =300 nanoseconds. On the other hand, adiabatic transitions near the avoided crossing are suppressed even at T QA 71 nanoseconds, as can be seen from solutions to the time-dependent Schrödinger equation, shown in Fig. 12.
In the previous studies involving D-Wave devices (see [35] for references) it was inferred from the data that the low frequency noise components of the spectral density, providing a leading contribution to the qubit line-width W (A1),(A2), have effective frequency cutoffs much below 314 MHz (15 mK). In the current QA schedule of 20 µs, the system spends only a small fraction of this time in the vicinity of the avoided crossing where thermal excitations from the ground state are possible. For a QA schedule duration of ∼ 100 ns, we expect that the effective noise strength will be weaker than at the current schedule. This would lead to the suppression of the thermal excitations from the ground state.
To compare simulations of QA at a temperature of 5 mK with the QMC performance, we choose a duration of QA such that the probability to reach the ground state at the end of QA equals 0.95. As we discussed above this can be achieved at In setting up the QMC simulations our objective is to select the two parameters, number of sweeps per qubit n sweeps and β, to minimize T QMC for a given probability of success p 0 to find the system at the end of the QA in the ground state where all spins point down. Essentially we need to minimize the product of βn sweeps keeping p 0 fixed.  In Fig. 6 we plot the success probability of QMC p 0 = p 0 (β, n sweeps ) as a function of β (inverse temperature) for different number of sweeps. We see that increasing β increases p 0 ; however, the success probability saturates at some value p 0 (β, n sweeps ) ≤ p sat 0 (n sweeps ), which itself depends on the number of sweeps. The saturation probability p sat 0 (n sweeps ) is plotted in the insert to Fig. 6. By fixing the success probability p 0 = 0.95 we select the optimal number of sweeps. Then, by looking on the main plot we determine the value of β sat were saturation occurs. A detailed study shows that this procedure gives the minimum value of the product βn sweeps . The optimal values are n sweeps = 23, 000, β = 32.5 .
The total time to update one worldline with the D-Wave 2X schedule is and the total runtime QMC per qubit, according to Eq. (27), is By comparing this with Eq. (29), we estimate that T QM C /N T QA ∼ 10 7 (T QA = 71 ns, T = 5 mK). (34) This ratio will need to be multiplied by the number of qubits to obtain the overall speed up factor (e.g., ∼ 10 10 for 1000 qubits). Implementing fast QA schedules or operating flux qubits at 5 mK will require improvements in the control electronics and other elements of the design. Furthermore, readout will need to be made much faster than in the current D-Wave devices. However, the estimates presented above serve to emphasize the significant promise of QA, as compared to QMC when the system adiabatic evolution "under the gap" becomes coherent and thermal excitations are suppressed.

IV. NUMERICAL STUDIES OF QUANTUM ANNEALING FOR GENERIC PROBLEMS WITH RUGGED ENERGY LANDSCAPES
Runtime advantages for the quantum processor described above are only valuable if they extend to problems of practical interest. While rather obvious, it may we worth delineating criteria for problems that are suitable for treatment with a quantum annealer: 1. Solutions to the problem are valuable or interesting.
2. The problem is representable on hardware that can be built in the near future.
3. Quantum annealing offers a runtime advantage.

A. Number Partitioning
A valuable and interesting practical problem is the Number Partitioning Problem (NPP). The NPP is defined as follows: Given a set of N positive numbers (a 1 , ..., a N ) find a partition P of this set into two groups that minimizes the partition residue E = | j∈P a j − j / ∈P a j |. A partition P can be encoded by Ising spin variables s j = ±1 : s j = +1 if j ∈ P and s j = −1 otherwise. Thus, the NPP cost function is a j s j (35) where s = (s 1 , . . . , s N ) is a spin configuration and Ω s is a signed partition residue. Number partitioning is also one of Garey and Johnson's six basic NP-hard problems that lie at the heart of the theory of NP-completeness [63]. In studies of the average-case computational complexity of NPP, one usually assumes that {a 1 , ..., a N } are independent, uniformly distributed random numbers in the interval [0, 1). The average-case complexity of NPP is exponential in N when the number of bits b used to represent the numbers a j satisfies the condition: . Our focus is on hard instances of NPP and we will be studying the random NPP ensemble with b = N . NPP has many practical applications including multiprocessor scheduling and the minimization of VLSI circuit size and delay, public key cryptography and others (see references in [65]). NPP is attractive for numerical studies because for b/N > κ c , the typical runtime of all known algorithms for NPP scales exponentially with large coefficients in the exponent and often, the asymptotic behavior can already be seen at sizes as low as 20 variables. For our purposes, NPP is a useful problem to study in the context of quantum annealing because it possesses extremely rugged energy landscapes where a single bit flip can result in dramatic energy changes. Its low energy band resembles that of the random energy model as there is almost no correlation between the state and its energy [66]. The 2 N signed partition residues Ω can be thought of as drawn from the Gaussian distribution where a 2 = 1 N N j=1 a 2 j . The distribution of the cost function values E = |Ω| is given by 2p(E). By picking a bit string at random, one gets an average value of the cost function E = 2 a 2 N/π. The minimum value of this cost function is exponentially small, with median value E min ∼ E 2 −N [65].
An obvious heuristic algorithm for NPP starts by placing the largest number in one of the two subsets. The next largest number is then placed in the set whose elements sum to the smallest value and this continues until all numbers are assigned. The idea behind this greedy heuristic is to keep the discrepancy small with every decision. This gives the scaling of the resulting partition residue as O(1/N ). The differencing method of Karmarker and Karp [65], also called the KK heuristics, is a polynomial time approximation algorithm. The key idea of this algorithm is to reduce the size of the numbers by replacing the two largest numbers by the absolute value of their difference. It has been proven [67] that the differencing method gives a minimum residue E KK min such that The time complexity of both greedy and KK heuristics is N log N [65]. The residual energies reached by both methods are much smaller than the average partition residue E , but far greater than the minimum residue E min . The absence of efficient heuristics for these hard cases is a particular feature of NPP. It is attributed to the extremely rugged energy landscape in the low part of the energy spectrum [65]. The statistics of the NPP energy landscape was studied analytically in [68] and numerically in [69].
This type of landscape leads to the exponential complexity of QA for NPP that was obtained in [68] via direct solution of the time-dependent Schrödinger equation. We observe that the particularly challenging instances of NPP violate the second condition of our "suitability" criteria as the numbers a j ∈ (0, ..., 2 N − 1) should be drawn from a set whose cardinality grows exponentially with N . This translates to a requirement that the bit precision for the coupling coefficients J jk grow with N 2 if one were to express NPP as a quadratic binary optimization problem with objective function N jk=1 a j a k s j s k corresponding to the form (1) with K = 2. However, for numerical studies this is not a concern.
The runtime behavior of SA, QMC and QA, simulated using the Schrödinger equation or other methods, is shown in Table I. The relative performance of these algorithms applied to NPP is similar to their relative performance on the weak-strong cluster networks problem. QMC scales like QA and both scale better than SA. We believe this is the case because both problems are characterized by rugged energy landscapes for which tunneling transitions are a more useful way to reach low energy states than thermal transitions. In Table I we also give pointers to some more efficient algorithms that achieve asymptotically better performance by exploiting existing problem structure.
To achieve such scaling behavior it will be necessary for the size of the domains of cotunneling qubits to grow with the problem size. However it is interesting to explore the problem from a different perspective and ask the question: "How much can the residual energy be lowered in QA until the system reaches such a state where lowering the energy further would require cotunneling of spin domains with sizes greater than κ?" To answer this question one can use an algorithm in which one starts at a random initial state and, at each step, i) looks at all groups of bits of size κ and ii) flips the bits of the group that results in the largest reduction of residual energy and then one iterates. We call this procedure "Algorithmic Tunneling" (AT). In other communities this would be referred to as κ-opt or large neighborhood search. We emphasize that AT does not provide any information about the actual system dynamics during QA, nor the runtime of QA. We investigate AT as  Annealing to work at all it has to be run at very high temperatures to overcome the enormous energy barriers present in this problem. However, at these high temperatures, SA behaves almost like random sampling and hence its scaling is almost that of exhaustive search. The scaling of Quantum Monte Carlo as well as other methods that simulate Hamiltonian evolution are comparable to each other and they all scale better than SA. This mirrors the situation we encountered for the weak-strong cluster networks. However, these scalings are close enough to warrant an estimation of the standard error in these values, which we have not done. The value of α for the solution of the time-dependent Schrödinger equation was initially obtained in [68]. We also give references for the state of the art classical and quantum algorithms in the bottom rows of the table.
an upper-bound on the typical performance of QA. AT does not consider the entropic component of tunneling events arising from the statistics of different mechanisms for arriving in the same minima. Likewise, AT does not consider how the height of energy barriers effects tunneling events. However, AT allows us to develop our intuition about the value of an optimal tunneling procedure which always finds the lower energy solution within a finite Hamming distance.
To investigate the lowest residual energies that AT can reach we focus on the conditional distribution of the signed partition residues Ω (35) over all possible spin configurations {s } generated from a given (ancestor) configuration s by simultaneously flipping a fixed number of spins κ. This conditional distribution was studied in [68] and has the form where P (Ω) is given in (36) and ζ(s) = sin(∆Ωs/4)/((∆Ωs/4). In the distribution P κ (Ω, Ω ), we averaged over the residue of the ancestor configuration within the interval Ω s ∈ [Ω − ∆Ω/2, Ω + ∆Ω/2] where E / n κ ∆Ω Ω. Near its maximum, the distribution P κ (Ω|Ω ) has the form where For small κ N the width of the distribution is approximately κ a 2 . After a single step of the AT algorithm, the average partition residue is reduced by a factor of 1 − 2κ/N . Therefore, once the number of steps of AT far exceeds N/κ, the algorithm reaches the residues |Ω| κ a 2 . For those residues the distribution becomes In total we have N κ samples of this distribution (spin configurations located on a Hamming distance κ from the ancestor configuration). Therefore the minimum value is given by extreme order statistics [74]. I.e., within the set of bit configurations {s } generated from a given (ancestor) configuration s by simultaneous flipping of a fixed number of spins κ, the minimum partition energy E is a random variable drawn from the exponential distribution Therefore, average (and median) values of the minimum partition residue achieved in the AT algorithm, It is instructive to compare the minimum cost values reached in AT and KK heuristics. Using (37) and (43) we get One can see from here that tunneling over barriers of length κ > α log N allows AT to reach cost values lower than that of the KK heuristics as N increases.
Consider AT with values of κ that do not scale with N . We observe that in asymptotic limit the KK heuristic produces smaller residues than AT. If we consider tunneling with κ = 8, corresponding to the case of the weak-strong clusters studied in this paper, then N κ 22735. We observe that for the high-precision (B N ) instances, NPP becomes intractable already for N > 100. Thus, for a broad range of problem sizes AT reaches cost values much smaller than conventional heuristics.
Motivated by the above observation we conclude that asymptotic scaling behavior is not essential for this analysis. For example, one can choose AT with the barriers sizes κ = α 0 log N where α 0 − α 1/ log N . In this case the ratio E AT min /E KK min approaches 0 in the asymptotic limit N → ∞. However the length of the barriers remains relatively small in a very broad range of N . For example, consider α 0 = 1.16. Then for N = 1000 the barrier length is κ = 10 while E AT min /E KK min ∼ 10 −9 . In summary, a greedy search procedure with flipping at most κ bits at a time (referred above as Algorithmic Tunneling) allows to find cost values in NPP that are much below those given by KK heuristics for κ ∼ 10 at all realistic values of N . It would be interesting to compare the minimum residue obtained by AT with the minimum residue obtained by the other algorithms in Table I when constrained to terminate in polynomial time.

B. K th Order Binary Optimization
Our current best candidate for a problem class that fulfills all three criteria consists of K th order binary optimization problems with K > 2. K th order binary optimization is NP-Hard and occurs naturally in many engineering disciplines and many computational tasks. In unpublished work underway, we seek to establish that for many K-local problems, QA indeed offers a runtime advantage over SA. Currently we are focusing on K ∈ {4, 5, 6}. As energy landscapes get more rugged with higher K, our conjecture is that we will see larger subsets of instances for which QA runs faster as K increases. However, representing K-body terms in hardware becomes more challenging as K grows.

V. DESIGNING FUTURE ANNEALERS OF PRACTICAL RELEVANCE
Should numerical studies confirm that QA offers a substantial runtime advantage, there is still one more hurdle to overcome. We need to ensure that K-local problems can be economically represented in hardware. We would like to be able to tell a user: "If you have a binary optimization problem with N variables and L terms, and the many-body order of the highest term is K, then you can send this problem to the quantum annealing coprocessor." However, annealers built to-date only support pairwise qubit couplings, i.e. K = 2. Two routes have been proposed to increase the locality.
One route is to build physical K-body couplers. However, it may prove difficult to layout K-local couplers on a two dimensional chip or even in layered architectures. Of course, the general case in which one aims to implement all possible K k=1 N k couplings will be infeasible. While many applications will only necessitate L = O(N ) coupling terms, this could still prove challenging. Furthermore, economically embedding problems in a fixed graph with only a limited number of specific K-local terms may prove difficult.
Another possibility is that we could use logical embeddings that map K-local problems to 2-local problems. A new proposal on how to accomplish such embeddings has been put forth [75], which has reinvigorated interest in this direction. Our main worry regarding any reduction to quadratic problems is that this will involve ancillary qubits. As argued above, it is crucial that the problem features tall and narrow barriers for tunneling transitions to contribute and the introduction of additional variables may cause these barriers to become wider. This makes purely thermal annealing more competitive and may negate gains seen in the numerics prior to embedding.

VI. SUMMARY
It is often quipped that Simulated Annealing is only for the "ignorant or desperate". Yet, in our experience we find that lean stochastic local search techniques such as SA are often very competitive and they continue to be one of the most widely used optimization schemes. This is because for sufficiently complex optimization tasks with little structure to exploit (such as instances of K th order binary optimization) it often takes considerable expert effort to devise a faster method. Therefore we regard SA as the generic classical competition that Quantum Annealing needs to beat.
Here we showed that for carefully crafted proof-ofprinciple problems with rugged energy landscapes that are dominated by large and tall barriers QA can have a significant runtime advantage over SA. We found that for problem sizes involving nearly 1000 binary variables, quantum annealing is more than 10 8 times faster than SA running on a single core. We also compared the quantum hardware to QMC. While the scaling of runtimes with size between these two methods is comparable, they are again separated by a large factor sometimes as high as 10 8 .
For higher order optimization problems, rugged energy landscapes will become typical. As we saw in our experiments with the D-Wave 2X, problems with such landscapes stand to benefit from quantum optimization because quantum tunneling makes it easier to traverse tall and narrow energy barriers. Therefore, we expect that quantum annealing might also deliver runtime advantages for problems of practical interest such as K th order binary optimization with larger K.
More work is needed to turn quantum enhanced optimization into a practical technology. The design of next generation annealers must facilitate the embedding of problems of practical relevance. For instance, we would like to increase the density and control precision of the connections between the qubits as well as their coherence. Another enhancement we wish to engineer is to support the representation not only of quadratic optimization but of higher order optimization as well. This necessitates that not only pairs of qubits can interact directly but also larger sets of qubits. Such improvements will also make it easier for end users to input hard optimization problems.
The work presented here focused on the computational resource that is experimentally most accessible for quantum annealers: finite range tunneling. However this analysis is far from complete. A coherent annealer could accelerate the transition through saddle points, an issue slowing down the training of deep neural networks, for reasons similar to those that make a quantum walk spread faster than a classical random walker [76][77][78]. It could also dramatically accelerate sampling from a probability distribution via the mechanism of many body delocalization [79]. The computational value of such physics still needs to be properly understood and modeled.
In the present study we apply this detailed model for the new schedule functions A(s), B(s) (see Fig. 7) and for the new values of the noise parameters, line-width W , Ohmic coefficient η, and temperature of the device T for the D-Wave 2X processor. The noise parameters were measured near the end of the quantum annealing schedule, s = 1. The values of the noise parameters at a point during the annealing can be related to the measured  where W jk (s) is a transition rate from the state j to the state k whose explicit form is given in Ref. [35]. The transition rate W 10 (s) decays very fast after the avoided crossing (see Fig. 8) because the weak cluster (left cluster in Fig. 2) becomes progressively more polarized along the z-axis and the effective size of the tunneling domain D = D(s) grows. This gives rise to the multi-qubit "freezing phenomenon" where the system population gets partially trapped in the excited state after certain value of s in later stages of QA [35]. Fig. 9 shows the ground state population given by the solution of (A2). The success probability to be at the ground state at the end of the annealing is p 0 = 0.85, which is close to the experimentally observed mean value of 0.9. It is seen that the equilibrium population of the ground state exceeds the actual population for s 0.64, corresponding to the onset of freezing of the transition rates.

Appendix B: D-Wave versus Quantum Monte Carlo with Linear Schedule
There is ongoing work directed to optimize the QMC parameters further. In preliminary results, we compared QMC with a linear schedule against D-Wave [80]. The transverse field in this case is lower, resulting in a faster time T worldline to update a worldline (T worldline scales linearly with the transverse field). We measured this time t QA = 20 μs This time is consistent with the one reported in Ref. [12].
We also took a different approach when optimizing β and the number of sweeps per run to minimize the total computational effort. In the case reported in Sec. II B we optimized the number of sweeps for each quantile at fixed β = 10. In the case of a linear schedule, we used  To assign a runtime for QMC we take the number of worldline updates that are required to reach a 99% success probability and multiply that with the time to perform one update on a single state-of-the-art core. Shown are the 50th, 75th and 85th percentiles over a set of 100 instances. The error bars represent 95% confidence intervals from bootstrapping. The runtimes for the higher quantiles for the larger problem sizes for QMC were not computed because the computational cost was too high.
our knowledge of the structure of the weak-strong cluster networks problem to optimize β and the number of sweeps n sweeps concurrently. We first measure the prob-ability of success p(n sweeps , β) for a single weak-strong cluster pair. Then we estimate the performance for the cluster network problems taking into account the number of cluster pairs c for each size. The estimate is total time ∝ n sweeps × β × log(1 − 0.99) log(1 − p(n sweeps , β) c ) .
Here we also run with open boundary conditions (OBC) in imaginary time. We estimate an optimal β = 130 for all sizes. We then optimize the number of sweeps for each quantile and size running the actual benchmark. Finally, we modified the path integral QMC code to search for the minimum energy configuration along all replicas at the end of the annealing. The results, following the same methodology as in Sec. II B, are plotted in Fig. 13. We obtain a prefactor ∼ 10 6 for the median and up to ∼ 10 8 for the 85th quantile.
When optimizing QMC to the extent performed here a methodological concern arises. Since QMC has many parameters and modes of execution (e.g. temperature, number of sweeps, annealing schedule, open versus closed boundary conditions in imaginary time, discrete or continuous time Monte Carlo), overlearning can become an issue when working with just 100 instances. Moreover, optimizations over many parameters will become computationally prohibitive as the problem size increases. By contrast, the current quantum hardware has only a single parameter that can be tuned, the number of annealing sweeps.