Demonstration of a scaling advantage for a quantum annealer over simulated annealing

The observation of an unequivocal quantum speedup remains an elusive objective for quantum computing. A more modest goal is to demonstrate a scaling advantage over a class of classical algorithms for a computational problem running on quantum hardware. The D-Wave quantum annealing processors have been at the forefront of experimental attempts to address this goal, given their relatively large numbers of qubits and programmability. A complete determination of the optimal time-to-solution (TTS) using these processors has not been possible to date, preventing deﬁnitive conclusions about the presence of a scaling advantage. The main technical obstacle has been the inability to verify an optimal annealing time within the available range. Here we overcome this obstacle using a class of problem instances constructed by systematically combining many-spin frustrated-loops with few-qubit gadgets exhibiting a tunneling event — a combination that we ﬁnd to promote the presence of tunneling energy barriers in the relevant semiclassical energy landscape of the full problem — and we observe an optimal annealing time using a D-Wave 2000Q processor over a range spanning up to more than 2000 qubits. We identify the gadgets as being responsible for the optimal annealing time, whose existence allows us to perform an optimal TTS benchmarking analysis. We perform a comparison to several classical algorithms, including simulated annealing, spin-vector Monte Carlo, and discrete-time simulated quantum annealing (SQA), and establish the ﬁrst example of a scaling advantage for an experimental quantum annealer over classical simulated annealing. Namely, we ﬁnd that the D-Wave device exhibits certiﬁably better scaling than simulated annealing, with 95% conﬁdence, over the range of problem sizes that we can test. However, we do not ﬁnd evidence for a quantum speedup: SQA exhibits the best scaling for annealing algorithms by a signiﬁcant margin. This is a ﬁnding of independent interest, since we associate SQA’s advantage with its ability to transverse energy barriers in the semiclassical energy landscape by mimicking tunneling. Our construction of instance classes with veriﬁably optimal annealing times opens up the possibility of generating many new such classes based on a similar principle of promoting the presence of energy barriers that can be overcome more eﬃciently using quantum rather than thermal ﬂuctuations, paving the way for further deﬁnitive assessments of scaling advantages using current and future quantum annealing devices.


I. INTRODUCTION
The elusive and tantalizing goal of experimentally demonstrating a quantum speedup is being actively pursued using a variety of quantum computing platforms. The holy grail is an exponential speedup, such as that believed to be attainable with Shor's algorithm for factoring integers [1], or with the simulation of quantum systems [2][3][4]. However, by necessity all experimental demonstrations are carried out on physical systems of finite size, as measured in terms of the number of qubits. The accuracy threshold theorem provides a theoretical guarantee that for sufficiently low noise levels and through the use of quantum error correction, a finite-size device can be scaled up fault tolerantly [5]. In the absence of an asymptotic guarantee provided by fault-tolerance, a finite-size device provides evidence of what can be expected at larger, future sizes, provided the device temperature, coupling to the environment, and calibration and accuracy errors, can be appropriately scaled down. With this caveat in mind, how do we benchmark a quantum speedup on a finite-size device? This was first systematically addressed in Ref. [6], which defined several types of quantum speedup, followed by some refinements in Ref. [7]. Implicit in the notion of a quantum speedup is that the advantage over classical scaling is directly traceable to quantum effects. For noisy quantum hardware without quantum error correction, this is potentially a contentious issue, so instead we ask whether we can demon-strate a scaling advantage for a well-defined family of computational problems running on such hardware. We answer this question in the affirmative and report on the first observation of a quantum hardware scaling advantage when comparing simulated annealing with singlespin updates (SA) [8] to a quantum annealer [9], for a specific problem class we introduce here. SA is not only a very general meta-heuristic, but can also be viewed as a classical analogue of quantum annealing (QA) [10], in the sense that it performs temperature annealing with thermal fluctuations instead of quantum fluctuations. In fact, it is exactly this analogy that gave rise to QA in its present form [11]. Hardware quantum annealers are devices designed to represent physical implementations of QA and the quantum adiabatic algorithm [12] for combinatorial optimization problems.
However, the scaling advantage we observe for hardware-based QA relative to SA does not qualify as a quantum speedup as defined in Ref. [6]: we find that several classical algorithms have superior scaling. This includes discrete-time simulated quantum annealing (SQA) [13], and the spin-vector Monte Carlo (SVMC) algorithm (also known as SSSV, the author initials in Ref. [14]). Both SQA and SVMC are transverse field annealing algorithms, and thus more closely model QA than SA does. But while SVMC is purely classical, in the sense that it provides a time-dependent description in terms of unentangled planar rotors, SQA is a classical algorithm based on path-integral Monte Carlo, which in its continuoustime limit and for sufficiently many spin updates (or "sweeps") generates samples from the quantum Gibbs state. In particular, at sufficiently low temperatures SQA can describe entangled ground states such as those followed by QA. We find that SQA exhibits a speedup relative to SVMC for our problem class, and argue that this performance difference is related to SQA's ability to traverse energy barriers more efficiently even as the temperature of the simulation is lowered, in a way mimicking tunneling. We use this to argue that a key reason for the quantum annealer's slowdown relative to SVMC and SQA is its sub-optimally high temperature.
A great deal of work has already been invested in the design of suitable scaling benchmarks for optimization via QA [6,[15][16][17] and the identification of suitable classes of problem instances [7,[18][19][20][21][22][23]. Despite the large body of work on benchmarking quantum annealers, conclusive evidence about how their performance scales has been unattainable. The primary reason, as we shall discuss in detail, is that it has not been possible to identify an optimal annealing time for any class of problem instances. Here, using a new class of problem instances we introduce, we show that this obstacle can be overcome, and present for the first time a complete algorithmic scaling analysis of a quantum annealer, up to the largest available problem size of more than 2000 spins or qubits. We stress that our observation of an optimal annealing time is not simply due to advances in the quantum annealing hardware; we demonstrate optimal annealing times for these problem instances on the two most recent generations of D-Wave processors. Our work thus provides a clear path forward towards a rigorous assessment of the possibility of attaining an unqualified scaling advantage using such devices.
We first review and discuss, in Sec. II, the time-tosolution metric, and how to establish optimality. Section III presents our results. First, Sec. III A establishes the empirical evidence for optimal annealing times for our class of problem instances. Then, Sec. III B presents the empirical evidence we have found for a QA scaling advantage over SA, but a disadvantage relative to SQA and SVMC. In Sec. III C, we introduce and describe the properties of the class of problem instances for which we observe the optimal annealing time and the scaling advantage over SA. We discuss the implications of our results and provide an outlook in Sec. IV. Additional technical details and methods are provided in the Appendices.

II. OPTIMAL TIME-TO-SOLUTION
We consider the standard setting where the goal of the optimizer is to find the optimal solution (i.e., the global minimum of the cost function) and one is interested in minimizing the time taken to find the solution at least once. There is a tradeoff between finding the solution with a high probability in a single long run of the algorithm, and running the algorithm multiple times with a shorter runtime and (usually) a smaller single-run suc-cess probability. This tradeoff is reflected in the time-tosolution (TTS) metric, which measures the time required to find the ground state at least once with some desired probability p d (often taken to be 0.99): .
(1) Here p S (t f ) is the success probability of a single-instance run of the algorithm with a runtime t f , and R(t f ) is the required number of runs; success means that the optimal solution was found. The instance size is N , and N max is the size of the largest instance that the device accommodates (typically set by the total number of qubits); the factor N/N max accounts for maximal parallel utilization of the device. While R and N max /N should correspond to whole numbers, we do not round them here since this can result in sharp TTS changes that complicate the extraction of scaling with N ; see Appendix A for more details.
However, when considering the performance of an algorithm evaluated over an ensemble of randomly chosen instances from the same class, we are typically interested not in the TTS of a single instance but in a given quantile q of the TTS distribution over such instances at a given problem size N ∈ [N min , N max ]. We denote the qth quantile of the TTS evaluated at t f (N ) by TTS(t f ) q , and suppress the N dependence for simplicity. Since the goal of optimization is to find the solution as rapidly as possible, there is an optimal t f value for a given quantile q, t * q , where TTS(t f ) q is minimized, and we denote TTS * q ≡ TTS(t * q ) q . While the success probability of individual instances may exhibit many minima (as in the case of coherent evolution when oscillations in the success probability are observed as the annealing time is varied), the quantile of the TTS distribution exhibits only one minimum because the many minima of the individual instances are unlikely to coincide, and there is no ambiguity in the determination of an optimal t f value. Finally, it is important to note that since quantiles are solverdependent, a comparison between different solvers at the same quantile involves different sets of instances.
We can now state the precise nature of the critical obstacle alluded to above: if t * q < t min , where t min is the smallest possible annealing time on the given quantum annealer, then it becomes impossible to determine TTS * q . As was shown in Refs. [6,18], when operating with a suboptimal t f , one can easily be led to false conclusions about the scaling with N of TTS(t f ) q compared to the all-important scaling as captured by TTS * q , and even be led to conclude that there is a scaling advantage where there is none.
None of the experimental quantum annealing benchmarking studies to date [6,7,[15][16][17][18][19][20][21][22][23] have provided a complete scaling assessment, precisely because it has not been possible to verify that t * q > t min (for any quantile). The culprit was the absence of a suitable class of problem instances for which an optimal annealing time could be verified. Here we report on a class of instances that ex- A clear minimum in the TTS is visible at t * = 50µs, thus demonstrating the existence of an optimal annealing time for this particular instance (but not by itself for the problem class). Note that the decreasing or increasing TTS is associated with pS growing sufficiently fast or too slowly, respectively, with increasing t f . (b) Median TTS as a function of annealing time for L ≥ 12, from 1000 instances. Dotted curves represent best-fit quadratic curves to the data (see Appendix G for the scaling of t * with L, and Appendix K for details on the fitting procedure). The position of the minimum of these curves gives t * . The position of TTS * shifts to larger t f as the system size increases. An optimum could not be established for L < 12 for this instance class, i.e., it appears that t * < 5µs when L < 12. (c) The distribution of per-instance optimal annealing times t * i for different system sizes, as inferred directly from the positions of the minima as shown in (a). It is evident that the number of instances with higher optimal annealing times increases along with the system size, in agreement with (b).
hibits an optimal annealing time greater than t min = 5µs on the D-Wave 2000Q (DW2KQ, fourth generation, for which the largest energy scale is ∼ 50GHz in = 1 unitssee Appendix B) and the D-Wave 2X (DW2X, third generation, for which the largest energy scale is ∼ 40GHz). In the main text, we focus on the DW2KQ and provide results from the DW2X in the Appendix. This allows us to obtain the first complete optimal-TTS scaling results for an experimental quantum annealer, defined as the TTS scaling obtained from certifiably optimal annealing times.
The D-Wave processors used in our study are designed to implement quantum annealing using a transverse field Ising Hamiltonian: where j is the Ising, or 'problem' Hamiltonian whose ground state we are after. The σ x i and σ z i are the Pauli matrices acting on superconducting flux qubits that occupy the vertices V of a 'Chimera' hardware graph G with edge set E [24,25] and the local fields h i and couplings J ij are programmable analog parameters. The system is initialized in or near the ground state of the initial Hamiltonian H(0), and the annealing schedules A(s) and B(s), which set the energy scale, are described in Appendix B, along with further technical and operational details, including a schematic of the Chimera graph. The DW2KQ processor comprises 16 × 16 unit cells, so we can consider L × L subgraphs up to L max = 16 for our analysis, where each subgraph comprises L 2 unit cells, and each complete unit cell comprises 8 qubits (a small number of unit cells are incomplete, as a total of 21 out 2048 qubits are inoperative).

III. RESULTS
We start by describing our key results: the evidence for optimal annealing times, and the evidence for a scaling advantage of a physical quantum annealer over SA, along with its scaling disadvantage against the SQA and SVMC algorithms (we review these algorithms and discuss how we implemented and timed them in Appendix C). We then describe in detail the construction of the class of problem instances exhibiting these properties, and the role of tunneling in explaining them.

A. Evidence for optimal annealing times
We first present the evidence for optimal annealing times in Fig. 1. Figure 1(a) shows the TTS for a single representative L = 16 instance from a class we call 'logical-planted' problems. The unambiguous minimum at t f = 50µs is the optimal annealing time for this instance. The presence of a minimum is a robust feature: Fig. 1(b) shows that the optimal annealing time feature persists for the median TTS ( TTS * 0.5 ), at all sizes L ∈ [12,16]. In all previous benchmarking work, only the rise in TTS as a function of t f was observed, i.e., t * was always below t min , thus precluding the identification of an optimal annealing time. The increase in the optimal annealing time from L = 12 to L = 16 seen in Fig. 1(b) can be attributed to the general increase with problem size of the per-instance optimal annealing time as shown in Fig. 1(c), which shows the distribution of optimal annealing times over all the logical-planted problem instances we tested.
B. Evidence for a scaling advantage for QA hardware over simulated annealing, and a disadvantage against SQA and SVMC Having established accessible optimal annealing times (≥ 5µs) for the logical-planted instances, we are now ready to present a complete optimal-TTS scaling analysis. Our results for the dependence of TTS * on problem size are shown in Fig. 2, where we compare the DW2KQ results to three classical algorithms: SA with single-spin updates [8], SQA based on discrete-time pathintegral quantum Monte Carlo [13], and SVMC [14]. Figure 3 summarizes the performance of each algorithm in terms of the coefficients of exponential and polynomial fits, respectively, for several quantiles and two simulation temperatures for SQA and SVMC (a hybrid polynomialexponential fit does not work as well; see Appendix D).
The results presented in Fig. 3 demonstrate a (95% confidence) scaling advantage for the DW2KQ over SA in the case of the logical-planted instances, for the entire range of quantiles [0. 25, 0.9]. This represents the first observation of a scaling advantage over a classical algorithm on an experimental quantum annealer.
However, the SQA algorithm outperforms the DW2KQ in all quantiles and at both the colder inverse temperature of β = 2.5 and the warmer β = 0.51 (which corresponds to the operating temperature of the DW2KQ). The SVMC algorithm outperforms the DW2KQ in all quantiles at the warmer inverse temperature of β = 0.51 and in all quantiles at β = 2.5 except q = 0.9 where the error bars are too large to make a statistically significant determination. Thus, the scaling advantage over SA we observe is definitively not an unqualified quantum speedup.
We note that our results are robust to modifying the SA annealing schedule from a quadratic to a linear function in β (see Appendix E), and under a change of the metric to the so-called 'quantile-of-ratios speedup' [6] (see Appendix F).
We also note that of all the solvers featured in Fig. 3, the scaling of SVMC at β = 2.5 increases fastest from the easiest to the hardest quantile. As we discuss in more detail below, and is clear from Fig. 3, the SVMC and SQA performance depends strongly on the temperature at which the simulations are run. Specifically, we find that SVMC performs better at higher quantiles at higher temperatures, whereas SQA performs better at all quantiles at lower temperatures. We attribute this to harder instances involving an energy barrier that SVMC must thermally hop over, while SQA can mimic tunneling through. This also suggests that the DW2KQ per-formance is severely hindered by its sub-optimally high temperature. To explain this, we next discuss and motivate how we constructed our problem instances.

C. Construction of problem instances with an optimal annealing time
Having presented the evidence for optimality and the scaling analysis, we next describe the instance class with these properties. The two key properties we wish our instances to possess are (1) a guarantee of knowing the ground state energy (a useful feature for benchmarking optimizers at ever-growing problem sizes), and (2) an optimal annealing time on the D-Wave processors.

Planted solutions
In order to guarantee a known ground state energy, we construct 'planted' solution instances. The method builds the problem Hamiltonian as a sum of frustrated loop Hamiltonians H , such that itself is 'frustration-free', i.e., the planted solution is the simultaneous ground state of all H terms and hence is the ground state of H P [18]. Without loss of generality, we can always pick the planted solution to be the |0 · · · 0 (all-zero state) configuration, where henceforth the states |0 and |1 denote the eigenstates of the σ z operator with +1 and −1 eigenvalues, respectively. We consider planted solutions defined on the logical graph formed by the complete unit cells of the D-Wave hardware graph (i.e., without faulty qubits or couplers; in the case of an ideal Chimera graph, this would form a square lattice) [20]. Frustrated loops are then built on this logical graph, where logical couplings between adjacent unit cells are imposed only when all four physical inter-unit cell couplings are available. The intra-unit cell couplers are then all set to be ferromagnetic, guaranteeing that the planted-solution on the hardware graph is the plantedsolution on the logical graph with all physical spins in the unit cell set to their corresponding logical spin value. We refer to these as 'logical-planted' instances. In Appendix G we introduce 'hardware-planted' instances and demonstrate that they also exhibit an optimal annealing time.

Gadgets
In order to identify problem instances that exhibit an optimal annealing time, we first recall that previous studies of planted-solution instances on the D-Wave processors [18][19][20] found a TTS that rises monotonically as a Scaling of the optimal TTS with problem size. Result shown are for the 'logical-planted' instance class. The data points represent the DW2KQ (blue circles) and three classical solvers, SA (red diamonds), SVMC (purple left triangles), and SQA (green right triangles). The dashed and dotted curves correspond, respectively, to exponential and polynomial best fits with parameters shown in Fig. 3 (also given in table format in Table III in the Appendix). Panels (a), (b) and (c) correspond to the 25th quantile, median, and 75th quantile, respectively. SVMC and SQA were run with β = 2.5 here. Additional simulation parameters for SA, SVMC, and SQA are given in Appendix C. The data symbols obscure the error bars, representing the 95% confidence interval for each optimal TTS data point (computed from the fit of ln TTS to a quadratic function as explained in Appendix K). function of the annealing time. Keeping Fig. 1(a) in mind, a decreasing or increasing TTS results from the success probability rising sufficiently fast or too slowly, respectively, with increasing annealing time (we formalize this in Sec. III C 4 below). In the case where the system is very weakly coupled to its thermal environment, we can expect a competition between adiabaticity (unitary dynamics) and open system effects such as thermal excitations [26,27], resulting in a peak in the success probability and a minimum in the TTS. Though we note that it is unlikely that the DW2KQ operates entirely in the weak coupling regime (the minimum gap associated with the gadget is already below the temperature energy scale, as shown in Fig. 4, and we expect the minimum gap of the large instances to be smaller), from this perspective, shifting the minimum in the TTS to larger t f val-ues corresponds to prolonging the timescale over which adiabaticity dominates over open system effects. One way to try to accomplish this is by enhancing the role of finite-range tunneling in the dynamics. Motivated by this insight, and by recent work on the possibility of a computational role of finite multi-qubit tunneling in quantum annealers [23,28], we introduce a key modification and supplement the planted-solution instance Hamiltonian [Eq. (3)] with terms corresponding to the addition of identical 8-qubit 'gadgets' that exhibit tunneling during their anneal: Here H Gi denotes the gadget Hamiltonian in unit cell i, and the gadgets are placed into randomly chosen unit cells: S denotes a randomly chosen subset comprising a fraction p of complete unit cells (we use p = 0.1). The specific 8-qubit gadget we used fits into the unit cell of the D-Wave processors, and its connectivity and parameters are depicted in Fig. 5. The ground state of the gadget is the all-zero state of the eight qubits, so that the ground state of the full Hamiltonian remains the all-zero state.
The first excited state of the gadget is doubly degenerate with average Hamming weight seven. Generically, one would not expect the annealing properties of H G to be shared by H P , but below we show to what extent they are for our instances.

Tunneling
We next establish in what sense our gadget exhibits tunneling. We show in Fig. 4 the expectation value of the Hamming weight operator HW = 1 2 n i=1 (1 − σ z i ) in the ground state and first excited state, computed by numerically solving the time-dependent Schrödinger equation for the evolution of the gadget. This expectation value exhibits a sharp change at the same point in the evolution where the minimum gap occurs (shown in the inset). The ground state reorients itself to the |0 · · · 0 state, while the first excited state reorients to align closely with the |1 · · · 1 state. This already suggests a tunneling transition, but in order to confirm this we wish to establish the presence of an energy barrier in the semiclassical potential that the quantum system must tunnel through during the anneal. Such tunneling transitions have been well-studied in the context of systems with qubit-permutation invariance [28][29][30][31][32], but less so in the context of systems such as ours without this symmetry.
The semiclassical potential as derived from the spincoherent path integral formalism [33] is given by the expectation value of H(t) in the spin-coherent state In the context of the transverse field Ising Hamiltonian [Eq. (2)], the semiclassical potential becomes: Equation (5) provides a multi-dimensional energy landscape for the quantum annealing protocol. Unfortunately, due to the absence of any symmetries it is infeasible to exhaustively explore this landscape and identify the actual location of barriers, even under the simplification where ϕ i = 0, ∀i. Instead, as proxies for a direct calculation of tunneling transition matrix elements or an instanton analysis [34], we consider the behavior of the SVMC and SQA algorithms. SVMC performs Metropolis updates on the potential energy landscape, Eq. (5) [14]. Since this algorithm can only thermally 'hop' over energy barriers, we expect its performance to deteriorate with decreasing temperature in the presence of a relevant energy barrier. On the other hand, a path-integral Monte Carlo based approach like SQA should be able to not only thermally hop over these barriers but also mimic tunneling through them [35][36][37], which should benefit from a decreasing temperature. Therefore, we expect to be able to identify tunneling energy barrier bottlenecks in the quantum anneal by contrasting the temperature dependence of the performance of SVMC and SQA. Figure 6(a) shows that for our 8-qubit gadget, SQA and SVMC behave as expected in the presence of a tunneling energy barrier: the success probability of SVMC  ) and (c) Probability of reaching the ground state at the end of the anneal using SVMC and SQA for different simulation inverse temperatures β using two different instances, with and without the 8-qubit gadget, at the largest available size L = 16. (b) An instance that exhibits a clear signature of a tunneling energy barrier when the gadget is introduced. (c) An instance that does not exhibit a signature for a tunneling energy barrier even with the gadget. In all panels SVMC and SQA simulations used 8M and 3M sweeps respectively, and both algorithms use the DW2KQ annealing schedule. The drop in success probability at large β for SQA is because spin updates become less efficient at high β and more spin updates are required to maintain the high success probability.

FIG. 7.
Empirical scaling behavior with and without the gadget. Shown is the distribution of the power law scaling coefficient a obtained after fitting ln pS to a ln t f + b for 100 instances at L = 16, run on the DW2KQ processor. We choose t f ∈ [5,50]µs since this is the range over which the TTS decreases as seen in Fig. 1(a). The instances with the gadget typically exhibit a larger scaling coefficient, which leads to the observation of an optimal annealing time. In (a) and (b) error bars represent 95% confidence intervals (2σ) calculated using 1000 bootstraps of 100 gauge transformations [15].
decreases with decreasing temperature, whereas the success probability of SQA increases with decreasing temperature. As shown in the inset and comparing to Fig. 4, we see that SVMC is effectively trapped in the higher excited states, while SQA is able to follow the ground state.
Next, we use the same technique to probe our plantedsolution instances with and without the gadget. We show the behavior of two very different L = 16 instances in Fig. 6. For one of the instances [ Fig. 6(b)], the introduction of the gadget adversely affects the performance of SVMC, pushing the success probability to zero for increasing inverse-temperature β. For SQA, the improvement in performance as β increases is significantly sharper with the gadget. For the second instance [ Fig. 6(c)], we see that while SQA's behavior is almost identical, SVMC exhibits an improving performance with increasing β in the presence of the gadget, suggesting an absence of an energy barrier. This analysis demonstrates that the tunneling properties induced by our 8-qubit gadget can be inherited by the problem instances even at the largest problem size, and that the success probability exhibits a strong temperature dependence. For further details on the behavior of an ensemble of instances see Appendix G.
Unfortunately we cannot directly probe tunneling or study the temperature dependence on the D-Wave processors. To the extent that SQA models the behavior of the physical quantum annealer, one may choose to interpret the evidence we have presented above as evidence for the role of tunneling energy barriers induced by the gadgets. 4. The gadget is responsible for the observed optimal annealing time To more directly understand the effect of our gadget on the hardware quantum annealer, it is instructive to contrast the scaling behavior with and without the gadget. Toward that end, we fit the empirical success probability p S to a power law of the form b(t f ) a (see Appendix H for the fit quality). We show in Fig. 7 the distribution of the scaling coefficient a for 100 instances for the logicalplanted instances with and without the gadget. The two distributions differ substantially: the instances with the gadget exhibit larger coefficients, almost all with a value greater than 1, resulting in a significantly larger initial rate of increase in p S (t f ) than for the instances without the gadget. This, in turn, leads to the observed initial decrease in the TTS with increasing annealing time: upon expanding the logarithm in Eq. (1) for small p S , we find that Fig. 1(a), where the slope of p S (t f ) is indeed seen to be initially > 1, then dropping to < 1 for larger t f . Since the TTS must eventually increase with the annealing time [ideally, for sufficiently large t f , R(t f ) → 1, at which point TTS(t f ) ∝ t f ], this helps to explain why the instances with the gadget exhibit an optimal annealing time. It also suggests a useful heuristic for future studies attempting to identify problem instance classes with an optimal annealing time: the instances should have the property that if p S (t f ) = g(t f ) 1 for some function g, then TTS(t f ) ∝ t f /g(t f ) must be decreasing for some range of t f > t min values. This is compatible with any faster-than-linear form for g. We already alluded earlier to a competition between adiabaticity and thermal excitations as being potentially responsible for an optimal annealing time. Another mechanism, that appears to be more consistent with the fact that the DW2KQ is not operating in the weak coupling regime, is that thermal relaxation is fast for small t f and then slows down for sufficiently large t f , presumably since the system has already entered the quasistatic regime [38].

IV. DISCUSSION AND OUTLOOK
There are relatively few theoretical examples where QA offers a demonstrable scaling advantage even over SA (for a review see Ref. [39]), and most of these examples do not involve physically realizable classical cost functions [29,32,[40][41][42]. In light of this, what is the significance of our demonstration of a QA hardware scaling advantage over the SA algorithm? We believe that an important clue lies in the fact that SVMC also exhibits an advantage over SA for these problems. The SA and SVMC algorithms can both be viewed not only as classical analogues of QA, but also as implementing two of its possible classical limits [11,14,43]. While SA performs updates on the classical energy landscape associated with the Ising Hamiltonian, SVMC performs updates on the semiclassical potential associated with the quantum anneal. A scaling difference between the two, with an advantage for SVMC, suggests that thermal updates on the semiclassical energy landscape is more efficient. While it is unclear whether the quantum effects in the D-Wave devices that have already been demonstrated on a smaller scale (N 16) [9,15,23,28,[44][45][46][47] remain operative at the much larger scales we have employed in our study, the fact that the DW2KQ also exhibits an advantage over SA suggests that it must be evolving in a landscape that also allows for better scaling. It is in this sense that quantum effects that are necessarily absent from SA and might be present in the quantum annealer, can provide an advantage. This is especially significant since there is a large overlap between the instances solved at the median quantile by the DW2KQ and all three of the classical algorithms, including SA, as shown in Fig. 8. This means that if any quantum effects are responsible for the scaling advantage of the DW2KQ over the SA algorithm, then they are operative in largely the same set of problem instances, so that these instances may define a target class for quantum enhanced optimization.
Likewise, SQA also evolves on the same semiclassical energy landscape as SVMC, but in addition to thermal updates is also capable of mimicking tunneling. The fact that SQA's scaling is far superior to SVMC's, and that this improves as the simulation temperature is lowered, shows that tunneling is effective at enhancing SQA's performance for the logical planted instances. What does this tell us about the possibility that the relative performance of the algorithms is indicative of quantum effects in the DW2KQ device? It is known that the scaling of quantum Monte Carlo can be as efficient [35,36,48] or less efficient [37] than the incoherent tunneling rate scaling of a true quantum annealer. Therefore, the fact that SQA overwhelmingly outperforms the DW2KQ but that the DW2KQ still outperforms SA, might suggest that the device is dominated by classical dynamics with a very small quantum component. While only speculative at this point, this type of situation might be the best we can hope for in the current generation of highly noisy quantum annealers, without some form of quantum error correction or suppression [49][50][51][52][53]. Figure 3 shows how the scaling of both SVMC and SQA is strongly affected when we increase their temperatures to the DW2KQ dilution fridge temperature. In both cases, for the median and lower percentile, SQA and SVMC's performance is hurt at this higher temperature relative to the colder temperature. This strongly suggests that the DW2KQ's performance for this class of instances is severely impacted by its temperature. However, we also find that the SVMC algorithm scales better than the DW2KQ. If SVMC is the classical limit of the device, then the DW2KQ should outperform SVMC. One possible explanations is that additional noise sources, such as implementation errors, might further degrade the performance of the DW2KQ relative to solvers run on digital We calculate the TTS for 1000 instances with each solver's respective optimal annealing time for the median at a given size L, and check which instances fall below the median TTS. Shown is the (normalized) fraction of the overlap of the instances between the solvers. Further details are given in Appendix I. Error bars represent 95% confidence intervals (2σ) calculated using 1000 bootstraps of 1000 instances.
classical computers [19]. We note that SVMC's performance improves at higher percentiles at higher temperatures relative to colder temperatures, which is consistent with the algorithm being able to thermally hop over energy barriers. This indicates that temperature is another algorithmic parameter that must be optimized separately for each quantile, a point we leave for future studies.
We emphasize that the discrete-time SQA algorithm studied here should not be interpreted as a true model of a thermal quantum system. It has been demonstrated that time-discretization may result in improved residual energy minimization performance over the continuum case [54], although this does not necessarily translate into a scaling performance advantage. Nevertheless, the superior performance of the SQA algorithm we have observed is an interesting finding in its own right: we are unaware of another example of an Ising model cost function where SQA with closed boundary conditions bests SA as a ground state solver (Ref. [23] reports an example but uses SQA with open boundary conditions). We have attributed this advantage to a more favorable energy landscape with the presence of tunneling barriers than can be traversed efficiently, but to test whether the observed scaling advantage would hold for a thermalizing quantum annealer requires quantum Monte Carlo simulations without Trotter errors [55][56][57]; unfortunately, at the > 2000 qubits scale we have worked with here, this is computationally prohibitive. The same is true for master equation simulations [28,58], even when implemented using the quantum trajectories method [59].
We emphasize that the instances presented here are not necessarily computationally hard, as suggested by the fact that, considering the entire range of sizes we tested, the quality of the polynomial fits is better than that of the exponential fits (see Fig. 2). The logicalplanted instances, in the absence of the gadget, are defined on a square lattice and can be solved in polynomial time using the exact minimum-weight perfect-matching algorithm [60]. However, we have confirmed that this algorithm performs poorly once the gadget is included, as expected when local fields are present (see Appendix J). Therefore, it is natural that algorithms optimized with respect to the problem structure demonstrate superior performance. For example, simulated annealing with both single and multi-spin updates (SAC), with the latter being simultaneous updates of all the spins comprising a unit cell (super-spin approximation [7]), scales significantly better than SA with single spin updates but still does not perform as well as SQA (see Appendix J).
Furthermore, there are many other classical algorithms that do not implement the same algorithmic approach as quantum annealing, such as the Hamze-Freitas-Selby (HFS) [61,62] algorithm. The latter exploits the low tree-width of the Chimera connectivity graph and has in all studies to date been the top performer for Chimeratype instances. In contrast, here we find that HFS's scaling performance lies between the DW2KQ and SAC (see Appendix J). Another competitive algorithm is parallel tempering with iso-energetic cluster moves [63].
We can expect that a more highly connected hardware graph will prevent algorithms such as HFS or SAC from being efficient; which architectures may lend themselves to an unqualified quantum speedup remains an open research question. Meanwhile, our approach defines a clear path forward by concretely establishing the possibility of generating instance classes with accessible optimal annealing times, amenable to a complete scaling analysis. We provide a derivation of the TTS expression given in Eq. (1) (see also, e.g., the Supplementary Materials of Ref. [6]). Let us assume that the probability of observing the ground state energy in any given repetition is p S (t f ), and we ask how many repetitions R must be performed to observe the ground state energy at least once. The probability of not observing the ground state energy once in R trials is (1 − p S (t f )) R . Therefore, to observe the ground state energy at least once with probability p d is: Solving for R gives the expression in Eq. (1) in the main text. Technically, the number of repetitions is defined as , but we do not include the ceiling operation in the calculation of R(t f ) in this work, since this can result in sharp TTS changes that complicate the extraction of a scaling. Similarly, the ratio N/N max should be ( N max /N ) −1 . The TTS should in principle also include all time-costs accrued by running the algorithm multiple times, such as state initialization and state readout times, as well as multiple programming times if different gauges are used (see Appendix B). We do not include these here either, because at least on the D-Wave processors, the readout and programming times are several factors larger than the annealing time and hence can effectively mask the scaling behavior. Instead, we restrict t f to be the runtime between state preparation and readout for all our algorithms.
To see how an analysis of the TTS that does not account for optimal annealing times can lead one astray, consider the following extreme example: suppose t f is too large at all problem sizes N ∈ [N min , N max ], such that R(t f ) = 1 always suffices to find the global minimum; in this case TTS(t f ) ∝ N t f for all N (i.e., is constant except for the parallelization factor N ), which except for trivial problems, must obviously be false.   coefficient (a, b, c) in fits of ln TTS * to a + b ln L + cL for the hardware-planted instances using L ∈ [8,16]. Errors are 95% confidence intervals.
about the processors are provided in the Appendix B. For each instance, we ran 100 random gauges (also known as spin-reversal transforms). A gauge is the application of a particular bit-flip transformation to the σ z operators in the problem Hamiltonian, i.e., 1}. This transformation does not change the eigenvalues of the transverse field Hamiltonian, and it is meant to minimize the effect of local biases and precision errors on the device [44]. For each gauge we took n reads = 1000 readouts, unless constrained by t f n reads < 10 6 µs. For example for t f = 2ms, we only took 400 readouts per gauge.
The annealing schedules of the D-Wave 2000Q (DW2KQ) processor housed at Burnaby and the DW2X processor housed at USC/ISI devices are shown in Fig. 9, in units of GHz. These schedules are not measured but computed and reported by D-Wave Systems Inc. based on their flux qubit models. The Chimera hardware graphs of the DW2KQ and DW2X processors we used in this work are shown in Fig. 10.
In the main text we focused on the annealing time. There are several other relevant timescales that we present here for completeness. We used the default initial state preparation time (t initial ). The readout time for the DW2KQ is t readout = 124.98µs. A complete characterization of the required runtime (the "wall-clock time") would include the thermalization and readout times in each independent run of the quantum annealer. Fur-   thermore, since we program the same instance multiple times using different gauges, the programming time of t program = 6987.80µs needs to be accounted for. In total, the wall-clock TTS would be given by: where G is the number of gauges, and R is the total number of runs, divided equally among the G gauges. However, since these timescales can be much larger than the optimal annealing time, they can mask the scaling of the TTS, and hence we focus just on the annealing time, as in previous work [6,15]. In principle, the initial state preparation time can be reduced and optimized along with the annealing time if included as part of the TTS, but we have not explored in this work how this impacts performance.

Appendix C: Simulation Parameters and Timing
Our implementation of the SA, SQA, and SVMC algorithms is based on the graphics processing unit (GPU) implementation used in Ref. [20] and described in more detail in Ref. [64]. We briefly describe our CUDA implementation of these algorithms here for completeness. In what follows, a sweep is a single Monte Carlo update of all the spins. For all implementations, we use the default cuRAND random number generator (XORWOW). We compile the CUDA code using the '-use fast math' flag, which, we note, may not be suitable for Monte Carlo simulations that require accurate calculations of thermal expectation values.
We first discuss our implementation of SA [8]. Each GPU thread updates the eight spins in a single unit cell. Because the Chimera graph is bipartite, each thread updates the four spins in one partition followed by the four spins in the second partition. A key feature of the implementation is that the eight local fields, 16 inter-cell couplers, and 16 intra-cell couplers are stored in the memory registers of the GPU. Only the spin configuration is stored on local memory. This minimizes the cost of retrieving data from global memory. We use the GPU intrinsic math function for the calculation of the Metropolis acceptance probability in order to maximize execution speed. As many copies n copies as allowed by register memory are run in parallel in separate GPU blocks. Therefore, we have for the timing of SA: where n sweep is the number of sweeps and τ sweep is the time required to perform a single sweep. Because n copies depends on the total number of threads (L 2 ) and hence on the problem size, this can be equivalently written as: where f SA is the number of total spin updates per unit time performed by the GPU. For consistency we use the timings reported in Ref. [64] for runs performed on an NVIDIA GTX 980, which have f SA = 50ns −1 . For SA, we use a temperature annealing schedule that is the DW2X annealing schedule for B(s) (shown in Appendix B) times β = 0.132 (in units where the maximum Ising coupling strength |J ij | is 1), such that βB(1) ≈ 5. Dotted curves represent best-fit quadratic curves to the data (see Appendix K for details). The position of TTS * shifts to larger t f as the system size increases. An optimum could not be established for L < 8 for this instance class, i.e., it appears that t * < 5µs when L < 8. (c) The distribution of instance optimal annealing times for different system sizes, as inferred directly from the positions of the minima as shown in (a). It is evident that the number of instances with higher optimal annealing times increases along with the system size, in agreement with (b). In (a) and (b) error bars represent 95% confidence intervals (2σ) calculated using 1000 bootstraps of 100 gauge transformations [15]. Scaling of t * with problem size for the two problem classes. Error bars represent the 95% confidence interval for the location of t * by fitting to a quadratic function as described in Appendix K.
The implementation of SVMC follows the same structure as SA, except that the spin configuration is replaced by angles {θ i } ∈ (0, 2π] [14]. The energy potential along the anneal is given by: a special case of Eq. (5) in the main text. An update involves drawing a random angle ∈ (0, 2π], and it is accepted according to the Metropolis-Hastings rule [65,66] with β = 2.5 for the logical-planted instances and β = 0.51 for the hardware-planted instances (in units where the maximum Ising coupling strength |J ij | is 1). We use the GPU intrinsic math function for the calculation of the cosine, sine, and Metropolis acceptance probability in order to maximize the speed of the algorithm. The timing of SVMC is the same as in Eq. (C2) but with f SVMC = 29ns −1 replacing f SA . We use the DW2X annealing schedule for A(s) and B(s) shown in Appendix B; this schedule keeps A(s) > 0 longer than that of the DW2KQ, which favors the SVMC algorithm, since once A(s) = 0 the system becomes the classical Ising model and the most efficient updates use θ = 0, π, but SVMC chooses angles randomly.
The implementation of SQA follows the same structure as SA, and also uses the DW2X schedule for similar reasons as just mentioned for SVMC. We restrict the Trotter slicing to 64 in order to fit the spins along the imaginary-time direction into a 64-bit word. A sweep involves performing a single Wolff cluster update [67] along the imaginary time direction for each spin. Once a cluster of spins is picked, it is flipped according to the Metropolis-Hastings rule using the Ising energy of the cluster with β = 2.5 for the logical-planted instances and β = 4.25 for the hardware-planted instances. We use the GPU intrinsic math function for the calculation of the Metropolis acceptance probability in order to maximize execution speed. At the end of the anneal, one of the 64 slices is picked randomly as the final classical state. The timing of SQA is the same as in Eq. (C2) but with f SQA = 5ns −1 replacing f SA .
We note that increasing the number of Trotter slices, while decreasing the Trotter error, appears to reduce the final success probability for one of the instances we have checked (see Fig. 11, where the peak success probability occurs for 64 slices), an effect noted in Ref. [54]. Studying this effect over the entire set of instances is computationally prohibitive at our > 2000 qubits scale.  17. Scaling of the optimal TTS with problem size for hardware-planted instances. The data points represent the DW2KQ (blue circles) and three classical solvers, SA (red diamonds), SVMC (purple left triangle), and SQA (green right triangle). The dashed and dotted curves correspond, respectively, to exponential and polynomial best fits with parameters given in Table II. Panels (a), (b) and (c) correspond to the 25th quantile, median, and 75th quantile, respectively. Simulation parameters for SA, SVMC, and SQA are given in Appendix C. The data symbols obscure the error bars, representing the 95% confidence interval for each optimal TTS data point (computed from the fit of ln TTS to a quadratic function as explained in Appendix K).

Appendix D: Alternative fits
In Fig. 2 of the main text (see also Figs. 17 and 20 below), we present exponential and polynomial fits to the optimal TTS as a function of L. Here we show that a hybrid three-parameter fit, i.e., ln TTS = a + b ln L + cL, does not give reasonable fits with good confidence bounds for all solvers. We restrict our attention to the hardware-planted instances, since in that case we have 9 sizes for the fit. Table I gives the results of the fits for the median; we see that the estimate for the exponential scaling coefficient c of the classical solvers is especially poor, likely due an insufficient number of data points for a three-parameter fit.

Appendix E: SA with a linear schedule
We used the DW2X annealing schedule in Fig. 9(b) for the SQA, SVMC, and SA simulations. Further optimization of this schedule is likely to improve the overall performance of the algorithms, although it is not evident whether it will substantially change their scaling with problem size. As an example, we provide results for the median TTS for SA using the DW2X schedule with a different overall temperature β = 0.396 and a linear schedule in Fig. 12, where we observe that the different schedules only shift the TTS curve but do not change the scaling within the statistical error bars. This indicates that our SA scaling results are robust to minor modifications of the schedule.

Appendix F: Quantile of Ratios
The benchmarking analysis we performed in the main text is akin to the 'ratio-of-quantiles' comparison performed in Ref. [6], where an alternative metric for speedups was also defined, called the 'quantile-of-ratios'. For this case, we find the annealing time that minimizes the TTS for each instance individually, and the perinstance optimal TTS, denoted TTS * i , is the minimal TTS for each instance individually. For each instance, the ratio of TTS * i for two different solvers is calculated, and different quantiles over the set of ratios is taken. We show in Fig. 13 the results for the median ratio using the logical-planted instances. The advantage of the DW2KQ relative to SA continues to hold, and SQA continues to exhibit the best scaling.

Appendix G: Gadget and Instance construction
The key new ingredient in our instance construction is an eight qubit 'gadget' that fits into the unit cell of the D-Wave processors. The gadget has a bipartite K 4,4 graph connectivity with the following Ising parameters, as also depicted in Fig. 5 in the main text: The logical-planted class of instances involves constructing planted-solution instances on the logical graph of the DW2KQ. The construction of the planted instance is similar to that of Ref. [20]. We define the logical   graph of the DW2KQ as being comprised of vertices corresponding to only the complete unit cells (with no faulty qubits or couplers). We also included one unit cell that was missing a single intra-cell coupler (unit cell 251), since having this missing coupler does not change the analysis. This is a minor difference relative to Ref. [20], where only complete unit cells were used. The edges of the logical graph correspond to having all four inter-cell couplers. We did remove the logical edge between unit cells 251 and 252. On an ideal Chimera graph, this would form a square grid. We constructed an Ising Hamiltonian as a sum of αL 2 frustrated loops, where we picked α = 0.65. We again chose to plant the all-zero state. We constructed loops as follows. Choose a random vertex on the graph as the starting vertex, and randomly pick an available edge. If that edge does not already have |J| = 3,   add the vertex connecting it to the chain until a loop is formed. Continue until the chain forms a loop by hitting a member of the chain. Only the loop and not the tail is kept. Accept the loop if it includes more than 4 vertices; this means that the minimum loop has 6 vertices. This then generates a planted-solution instance on the logical graph. In order to embed it on the hardware graph, turn on all the available couplings in the unit cell to be ferromagnetic with J = −3. In the notation of Ref. [20], this amounts to constructing instances with R = ρ = 3. Finally, we randomly placed our gadget into a fraction p = 0.1 of all the connected unit cells in the plantedsolution instance, and added these terms to the Ising Hamiltonian. (The gadget on unit cell 251 has the same ground state even with the one missing coupling.) The final Hamiltonian now has a maximum range of 6, i.e., |J ij | ≤ 6 for all couplers. Again, the ground state of the final Hamiltonian remains the all-zero state.

The eight qubit gadget
The key ingredient in our study is an eight qubit 'gadget' that fits into the unit cell of the D-Wave processors. Figure 14 compares the results for the 8-qubit gadget on the two D-Wave processor generations, for different representative unit cells. The success probability exhibits a single maximum, with the peaks occurring at different annealing times on the two devices. While there is some variation in the magnitude of the success probability depending on which unit cell is used, the position of the peak remains robust. We note however that the position of the peak differs on the two devices (around 100µs on the DW2X and around 300µs on the DW2KQ), indicating that the physical characteristics of the two devices are different beyond simply having different connectivity graphs.

Hardware-planted instances
Here we describe a class of instances we call "hardwareplanted" (not discussed in the main text), that also exhibits an optimal annealing time within the accessible range of the DW2KQ, as we demonstrate below. The class is defined by constructing planted-solution instances on the hardware graph of the DW2KQ, shown in Fig. 10(a). This method builds an Ising Hamiltonian as a sum of α8L 2 frustrated loops, where the all-zero state is a ground state of all loops (somewhat confusingly, the Hamiltonian is thus 'frustration-free' in the terminology of Ref. [68]). We picked α = 0.35 (this value is approximately where the peak in hardness occurs for the HFS algorithm, described in Appendix J). We constructed loops as follows. Choose a random vertex on the graph as the starting vertex. From the (at most six) available edges connected to this vertex, randomly pick one. If this new vertex has not been visited already, it is added to the chain. Continue until the chain forms a loop by hitting a member of the chain. Only the loop and not the tail is kept. The loop is discarded if any of the couplings along the loop already have |J| = 3, and if the loop does not visit at least two unit cells [19]. The second condition means that the shortest possible loop includes six vertices (within each unit cell the degree of each vertex is four, but including other unit cells the degree is six, except for unit cells along the boundary of the Chimera graph, where the degree can be five). Along the loop, choose the couplings to satisfy the planted solution, i.e., set them all to be ferromagnetic. Then randomly pick a single coupling and flip it. The couplings along the loop are added to the already-present coupling values on the graph. This process is repeated until α8L 2 loops are generated for the chosen value of α.
Finally, we randomly placed our gadget into pL 2 complete unit cells (without faulty qubits or couplers), where in this work we set p = 0.1, and added these terms to the Ising Hamiltonian. The final Hamiltonian now has a maximum range of 6, i.e., |J ij | ≤ 6 for all couplers. The ground state of the final Hamiltonian remains the all-zero state because this state is the ground state of all loop and gadget terms in the Hamiltonian.
We provide in Fig. 15 analogous results to those in Fig. 1 of the main text for the hardware-planted instances. In Fig. 15(a), we show a representative instance at L = 16 that exhibits an optimal annealing time above 500µs. In Fig. 15(b), we show that the median TTS exhibits a clear minimum for sizes L ∈ [8,16] (no minimum was observed for L < 8), which moves to higher annealing time values with increasing problem size. This is reflected in the distribution of instance optimal annealing times, as shown in Fig. 15(c). The steady increase in the hardness of the instances with problem size is reflected in the upward shift of the minimum TTS in Fig. 15(b).
Apart from the obvious difference of the existence of optimal annealing times at smaller sizes (L ≥ 8 compared to L ≥ 12), the optimal annealing time is significantly higher for the hardware-planted instances than for the logical-planted instance class, as summarized in Fig. 16. The optimal annealing time is seen to increase with problem size in all cases, rising faster for the DW2KQ than for the DW2X, but eventually flattening for both types of problem instances. The increase is consistent with both the possibility of benefit from a longer adiabatic evolution time or from a longer thermal relaxation time at larger problem sizes.
In Fig. 17 we present the scaling results for the hardware-planted instances at three different quantiles, in analogy to Fig. 2 in the main text. The simulation parameters for the solvers are identical except that we use colder temperatures for SA and SQA. For SA we use β = 0.396 (this corresponds to βB(1) ≈ 15), while for SQA we use β = 4.25. The scaling coefficients are summarized in Table II. We again find that SQA has the smallest scaling coefficient. The scaling coefficient of the DW2KQ is larger than that of all the classical solvers we tested, so for this class of instances we can definitively rule out the possibility of scaling advantage against the solvers we tested.

Correlating SQA and SVMC for logical planted instances
While a detailed analysis for each instance such as shown in Fig. 6 in the main text is prohibitive, we correlate in Fig. 18 the performance of SQA and SVMC at β = 2.5 and a relatively large number of sweeps. We observe that for almost all the instances, SQA finds the ground state with a significantly higher success probabil- ity and substantially fewer spin updates. Furthermore, a significant (but not overwhelming) number of instances hug the vertical axis of the scatter plot, corresponding to instances where SVMC completely fails to find the ground state but SQA succeeds with a non-vanishing probability.
Appendix H: Success probability scaling In Fig. 7 of the main text, we showed the power law scaling coefficient of the success probability extracted for t f ≤ 50µs. Here we provide supplemental data to support the quality of these fits. First, we show the data and fit in Fig. 19(a) for the instance depicted in Fig. 1(a) of the main text as well as its counterpart without the gadget. The data for this instance nicely agrees with a polynomial fit. In Fig. 19(b), we show that the uncertainty in the power law scaling coefficient for the majority of the instances is below 10%, indicating that the polynomial fits to the data points are reasonable.
Appendix I: Calculating the normalized overlap of instances.
In Fig. 8 of the main text we showed the overlap of the logical-planted instances below the median between the classical solvers and the DW2KQ. In order to calculate this quantity, we first fit the ln(TTS) of each instance to the function a(ln t f − b) 2 + c, and we evaluate the function at the optimal annealing time for the median TTS. This gives us a mean value TTS i (t * ) and its associated 1σ error ∆TTS i for the i-th instance. We then perform 1000 bootstraps over 400 instances, where for each bootstrap we generate two sets of 100 normally distributed random numbers η i, (1,2) in order to calculate two TTS realizations for each instance, i.e. TTS i,(1,2) = TTS i (t * )+η i,(1,2) ∆TTS i . For the two sets of TTS realizations, we calculate the median TTS and find which instances have a TTS below the median. We calculate the overlap fraction of instances between two solvers S and S for realizations α and β respectively, which we denote by f Sα,S β . The normalized fractionf C,DW2KQ between a solver C and the DW2KQ is then given by: The normalization ensures that even with the noisy realization of the TTS, the overlap of instances between a solver and itself is one.  Table III. (a) Hardware-planted instances. (b) Logical planted instances. The data symbols obscure the error bars, representing the 95% confidence interval for each optimal TTS data point (computed from the fit of ln TTS to a quadratic function as explained in in Appendix K).

Appendix J: Comparison to other algorithms
In this section we present results from testing a number of other algorithms. Of course, for practical reasons we cannot consider all other relevant algorithms (e.g., we do not consider the iso-energetic cluster updates algorithm [63]). Instead, we aimed to find other algorithms in addition to SQA that have a better scaling than the DW2KQ for the logical-planted instances. SQA remains the best-scaling algorithm among those we tested.

HFS
In the main text, we did not make comparisons to the HFS algorithm because it does not implement the same algorithmic approach as the other annealing algorithms. Nevertheless, because it is an algorithm tailored to solve spin-glass problems on the Chimera architecture, it is instructive to compare its performance. For HFS, we use the implementation provided by Ref. [69] (which does not utilize a GPU), and we ran it in mode "-S3", meaning that maximal induced trees (treewidth 1 in this case) were used. The TTS is given by [18]: where τ HFS = 0.3µs is the time for a single update. R(n trees ) is the number of repetitions with n trees tree updates. In principle, the optimal TTS is found by finding the value of n trees that minimizes the TTS, but the implementation of Ref. [69] continues to increase n trees until an exit criterion is reached. Specifically, the algorithm exits when the same lowest energy is found consecutively after n exit = 4 tree updates. We have found that this can be highly non-optimal, especially for the hardwareplanted instances. Therefore, in all our scaling plots, we have optimized the value of n trees . This is an important distinction from all previous work using the HFS algorithm, which to the best of our knowledge did not optimize n trees , and hence the scaling of the HFS algorithm in previous work is likely to be an underestimate of the true scaling, in the very same sense that the D-Wave scaling reported previously underestimates the true scaling. The behavior of the optimal TTS with problem size is shown in Fig. 20, with the scaling parameter fits given in Table III. We find that for the logical-planted instances, HFS scales better than the DW2KQ, while for the hardware-planted instances the scaling of the two is statistically indistinguishable.
For the HFS algorithm, we find that a quadratic fit does not capture the general features of the TTS curve as a function of number of tree updates. Instead, we find that a function of the form ln TTS = a(ln n tree ) −3 + b(ln n tree ) − 4 3 3/4 (a 3 b) 1/4 + c captures the data well. The value of c gives the value of ln TTS * . We give the fit values and their confidence intervals in Tables IV and V.

SAC
We can also consider simulated annealing with both single and multi-spin updates (SAC), with the latter being simultaneous updates of all the spins comprising a unit cell (super-spin approximation [7]). This requires the algorithm to know about the underlying hardware graph. The implementation of SAC is identical to that of SA, except each sweep of single spin updates is followed by a sweep of unit cell updates. The eight spins in the unit cell are flipped, and the move is accepted according to the Metropolis-Hastings rule. Because the unit cell graph is bipartite, unit cells in the first partition are updated first, followed by the unit cells in the second partition. This algorithm can be implemented as efficiently on GPU's as the single spin SA algorithm since it does not require storing any more data in memory. Because SAC effectively involves updating twice as many spins as SA in a single sweep, the timing of SAC is the same as in Eq. (C2) in the main text but with f SAC = 25ns −1 . For consistency, we use the same annealing schedule in B(s)β for SAC as we did for SA, with B(s) as in Fig. 9(b) (in units where the maximum Ising coupling strength |J ij | is 1). We use β = 0.132 (this corresponds to βB(1) ≈ 5) for the logical-planted instances and β = 0.396 (this corresponds to βB(1) ≈ 15) for the hardware-planted instances. We give the fit values and their confidence intervals in Tables VI and VII, and the behavior of the optimal TTS with problem size is shown in Fig. 20, with the scaling parameter fits given in Table II. We see that HFS and DW2KQ are statistically indistinguishable, and SAC is the top-scaling algorithm for the hardware-planted instances, outperforming even SQA. Results for the logical-planted instances are given in Table III, which for convenience reproduces data from Fig 3 in the main text. Here we see that SAC outperforms HFS, which in turn outperforms DW2KQ, while SQA is the top-scaling algorithm, outperforming SAC.

Minimum-weight perfect-matching
Before the mapping onto Chimera and the introduction of the gadget, the logical planted solution instances are defined on a two dimensional square grid. Here we check a polynomial-time algorithm for solving the minimumweight perfect-matching (MWPM) problem [70,71] and show that it cannot be used to efficiently determine the ground state of the logical planted instances defined on Chimera with the gadget. In order to do so, we map the the Ising Hamiltonian on the square grid to a MWPM problem [72], run the Blossom V algorithm [71] to determine the solution to the MWPM problem, and finally map the solution of the MWPM problem to a ground state of the Ising Hamiltonian. While the MWPM algorithm does find the ground state of planted solution instance defined on the square grid, it does not necessarily find the ground state of the associated Chimera instance with the gadget. The reason for this is that the gadget reduces the degeneracy of the ground state by selecting only those states for which the unit cell on which the gadget is placed points up. Without knowing this and because of the large ground state degeneracy, the MWPM predominantly selects the wrong ground state. We show in Fig. 21 how the success probability of finding the ground state decreases with increasing problem size. While at L = 8, MWPM finds the ground state for approximately half the instances, at L = 16, it finds the In order to determine the position of the optimal annealing time t * and its associated TTS * (we drop the quantile notation q for simplicity) at a given size L for the DW2KQ, DW2X, SA, SQA, and SVMC results, we fit the ln TTS data for different annealing times to a function of the form a(ln t f − b) 2 + c. The value of b gives the value of t * , and the value of c is the associated ln TTS * .
The fit values and their confidence intervals are given in Tables VIII (logical-planted) and IX (hardwareplanted) for the DW2KQ, in Table X for the DW2X (hardware-planted only, since logical-planted instances did not exhibit an accessible optimal annealing time on the DW2X), in Tables XI and XII for SA, in Tables XIII  and XIV for SQA, and in Tables XV and XVI for SVMC. Results of this fitting procedure for the DW2KQ results on the logical-planted instances are shown in Fig. 1(b) of the main text and on the hardware-planted instances in Fig. 15(b). The fits for the DW2X on the hardwareplanted instances are shown in Fig. 22. Because the largest problem size we can program on the DW2X is at L = 12, we studied only hardware-planted instances on this device. Note that because the hardware graph of the DW2X differs from that of the DW2KQ [see Figs. 10(a) and 10(b)], we should not assume that the instances defined on both are necessarily from the same class. Median TTS as a function of annealing time for hardware-planted instances on the DW2X. Unlike the DW2KQ, the optimal annealing time for L = 8 appears to be smaller than 5µs; for L ≥ 9 the optimal annealing time t * lies within the range of achievable annealing times for the DW2X device. Furthermore, for the same problem size, the DW2X optimal annealing times are consistently smaller than those of the DW2KQ [compare to Fig. 15(b) and recall Fig. 16]. While the difference in the hardware graph may play a role in this, it is likely that the intrinsic differences between the two devices is responsible (see Appendix G 1 for additional comparisons).          Fit to y = a(x − b) 2 + c for the median results of the hardware planted instances using SQA with y = log TTS and x = log nsweeps. The factor c does not include fSQA.