Seeking Quantum Speedup Through Spin Glasses: The Good, the Bad, and the Ugly

There has been considerable progress in the design and construction of quantum annealing devices. However, a conclusive detection of quantum speedup over traditional silicon-based machines remains elusive, despite multiple careful studies. In this work we outline strategies to design hard tunable benchmark instances based on insights from the study of spin glasses - the archetypal random benchmark problem for novel algorithms and optimization devices. We propose to complement head-to-head scaling studies that compare quantum annealing machines to state-of-the-art classical codes with an approach that compares the performance of different algorithms and/or computing architectures on different classes of computationally hard tunable spin-glass instances. The advantage of such an approach lies in having to only compare the performance hit felt by a given algorithm and/or architecture when the instance complexity is increased. Furthermore, we propose a methodology that might not directly translate into the detection of quantum speedup, but might elucidate whether quantum annealing has a"`quantum advantage"over corresponding classical algorithms like simulated annealing. Our results on a 496 qubit D-Wave Two quantum annealing device are compared to recently-used state-of-the-art thermal simulated annealing codes.


I. INTRODUCTION
Optimization plays an integral role across disciplines. Not only does modern manufacturing and transport heavily depend on efficient optimization methods to reduce cost and emissions, many fields of research depend on a multitude of optimization techniques to solve a wide variety of problems. Similarly, the ever-increasing amount of data available to mankind means an urgent need for more efficient approaches in querying, parsing, and mining data, approaches that often depend on optimization techniques. Within physicsrelated disciplines alone, optimization is needed to solve many difficult problems ranging from frustrated spin systems [1][2][3] to novel approaches in material discovery, as well as the efficient parsing of high-energy event data or astrophysical spectra. As such, the search for more efficient optimization approaches is of great importance. Because the speedup of current silicon-based computing technologies is slowly coming to an end mostly due to manufacturing and material constraints [4], interest in developing faster optimization methods has shifted to the development of new state-of-the-art algorithms, as well as novel computing paradigms, e.g., based on quantum architectures.
Quantum computing [5,6] and, in particular, adiabatic quantum optimization [7][8][9][10][11][12][13][14][15][16][17][18] has gained increased momentum since D-Wave Systems Inc. introduced the D-Wave Two (DW2) quantum annealing device [19]. Inspired by the work of Santoro et al. [12], multiple teams have attempted to demonstrate that quantum adiabatic optimization-or quantum annealing (QA) [20][21][22][23]-has advantages over conventional thermal optimization techniques, such as, for example, simulated annealing (SA) [24]. The idea behind QA is to adiabatically quench quantum fluctuations to optimize a cost func-tion (Hamiltonian) of a given complex optimization problem. Potentially, the wave function of the problem might be able to quantum tunnel through barriers in the free-energy landscape, i.e., QA might be able to outperform other approaches like SA where temperature fluctuations are slowly reduced to find the optimum. Towards the end of the annealing schedule in SA, when these temperature fluctuations are small, the system is unable to overcome free-energy barriers and, especially for problems with rough energy landscapes such as in spin glasses [25,26] and related problems, it might become trapped in metastable states, thus missing the true optimum of the problem.
The fact that a broad range of hallmark optimization problems, such as the satisfiability problem (k-SAT), the number partitioning problem, vertex covers, knapsack problems, coloring problems, the traveling salesman problem, etc. can be mapped onto quadratic unconstrained binary optimization problems [27], means that devices that are tailored to solve these, such as the DW2, could revolutionize today's optimization efforts. Although not a fully programmable universal quantum computer, the D-Wave device represents a sizable advance in (quantum) computing.
The seminal work of Rønnow et al. [28] took great care and detail in defining the notion of quantum speedup. While at the moment the demonstration of strong quantum speedup remains a distant goal, the detection of limited quantum speedup [29]-a speedup relative to a given corresponding classical algorithm such as SA-seems more graspable. The number of studies (see, for example, Refs. [28,[30][31][32][33]33]) attempting to detect quantum speedup is growing at a fast pace; however, the definite detection of quantum speedup remains elusive. So why, despite these large efforts, does quantum speedup remain to be demonstrated? Potentially, there are many reasons why this might be the case. On one hand the complex circuitry, combined with the extreme fragility of quantum states to perturbations might be a source of decoherence and thus loss of any advantage over conventional techniques. On the other hand, the systems currently available (maximally 512 qubits on DW2, soon up to ∼ 1000) might be too small for the benchmarks to be in the asymptotic scaling regime. However, a more mundane reason that is relatively easy to fix is the choice of the wrong benchmark problem. In Ref. [34], Katzgraber et al. demonstrated that the native benchmark to search for quantum speedup on a device like the DW2-an Ising spin glass with discrete uncorrelated disorder-is likely a problem that not only might be too easy to detect any speedup (think of two world-class skiers on a bunny slope), but the energy landscape of a spin glass on the DW2 Chimera topology [35] might actually favor thermal approaches like SA, simply because the spin-glass state exists only at zero temperature. Furthermore, the use of either bimodal or uniform range-k disorder [28,[31][32][33]33] creates an energy landscape that has a huge number of configurations that minimize the cost function. As such, any method like SA run with multiple restarts will naturally excel in optimizing such a problem. Attempts to mitigate this issue by planting solutions [36] delivers problem instances that might not be challenging enough for both classical algorithms and quantum devices alike.
To overcome the limitations imposed by the small size of current devices, it is imperative to use a native benchmark problem that uses as many qubits N as possible on the device. Any embedding of a potentially harder problem [37] will further reduce the number of logical qubits, thus pushing the asymptotic regime farther away. Furthermore, it is hard to mitigate the effects of noise on both qubits and couplers without improving manufacturing. However, it is considerably easier to design hard benchmark instances that attempt to work around the flaws and limitations of the DW2 architecture. Reference [38] focuses on designing instance problems that are affected as little as possible by the chip's intrinsic noise. Here, we present a simple road map that uses insights from the study of spin glasses to design hard, as well as tunable, benchmark instances.
In addition, we propose to search for quantum advantages over classical architectures not only by comparing to state-ofthe-art classical algorithms [39], but by studying the effects of tuning the instance complexity for a given type of disorder on both classical and quantum approaches. By studying the performance hit felt by the different approaches on carefully tailored problems with a free-energy landscape that is either dominated by large barriers or is reminiscent of a ferromagnetic system, further insights into the nature of quantum annealing devices can be gained. To perform a fair comparison across instances, here we fix the ground-state degeneracy (ideally) to 1 (or as low as possible) and vary the complexity of the free-energy landscape by using the spin-glass order parameter distribution as a proxy to the dominant features of the landscape [40,41]. We show that, indeed, the spin-glass order parameter distribution produces tunable instances, and that predictions from the study of spin glasses on the complexity of the energy landscape allows us to produce problems on average considerably harder than any previous study.
We emphasize that we are not attempting to perform a scaling analysis as done in previous studies, simply because we believe that the currently accessible system sizes of up to 512 qubits are too small to be in the asymptotic limit [42]. We base this statement on previous simulations of two-dimensional Ising spin glasses on a square lattice at zero temperature with discrete disorder [43] where corrections to scaling due to the finite system sizes were very strong for systems with ∼ 10 3 spins.
Our results show that the DW2 device is outperformed at finding the ground state by classical state-of-the-art optimization algorithms. However, there is a potential signature that the DW2 device might be able to optimize certain classes of carefully designed native spin-glass problems more efficiently than the classical counterpart SA, especially if noise is reduced. This suggests that the DW2 device potentially has a "quantum advantage" over corresponding classical algorithms like SA for certain problems. In addition, there are signs that the DW2 device might in some cases be more effective at generating low-lying states, as opposed to strict ground states than SA. Finally, our results suggest that "classical computational hardness" in spin glasses seems to carry over to quantum annealing devices, therefore facilitating the design of spin-glassbased instances. The day that quantum annealing machines have lower noise levels, higher connectivity to enable the simple embedding of spin-glass problems with, e.g., a finite transition temperature [34,37], or a larger numbers of qubits, a combination of the approach presented in Ref. [28], with error-correction techniques [31,44], and designer instances described in this work will likely show if quantum speedup is myth or reality.
The paper is structured as follows. In Sec. II, we introduce the native benchmark problem, followed by a detailed description of the limitations of current approaches as well as how we design hard instance problems in Sec. III. Section IV summarizes results on both the DW2 device, as well as classical simulation codes, followed by a discussion and summary. Appendix A outlines our experimental methodology on the DW2 device housed at D-Wave Systems Inc., followed by simulation details in Appendix B and numerical results in Appendix C. Appendix D summarizes less fruitful efforts experimenting with other instance classes.

II. NATIVE BENCHMARK: SPIN GLASSES
We illustrate our benchmarking ideas using the D-Wave Systems, Inc., D-Wave Two quantum annealing machine [45]. The native benchmark problem for the DW2 device is an Ising spin glass [6,[25][26][27] defined on the Chimera topology of the system [35], The N Ising spins S z i ∈ {±1} are defined on the vertices V of the Chimera lattice (see Fig. 7) and can be coupled to a (local) field h i . The sum is over all edges E connecting vertices {i, j} ∈ V. In this study we set h i = 0 ∀i.
We emphasize that it is of paramount importance to study native problems that use as many qubits as possible to prevent overhead that might yield smaller embedded problems. At the moment, with approximately 500 (soon 1000) qubits at hand, it will be difficult to detect any quantum speedup. As such, our focus does not lie in performing a detailed scaling analysis with the problem size N , but to show how to select tunable hard problems that have the same disorder distribution, i.e., have the same strengths or weaknesses with respect to the intrinsic noise found in these devices. Tuning the complexity of the problem instances will then allow for a systematic testing of any potential advantages or disadvantages that the DW2 device might have over other architectures and/or simulation approaches. Note that in this study we disregard the effects of noise on the couplers and qubits and will report on these in a subsequent publication with strategies on how to mitigate the effects of perturbed problem Hamiltonians [38]. However, for the generated problems, the resilience to noise (robustness to perturbations) on the qubits and couplers is roughly similar and mostly agrees within error bars for the different instance subclasses that use interactions based on Sidon sets [46]; see Sec. III B for details. This means that the noise of the DW2 does not affect our results.

III. DESIGNING HARD INSTANCES
We start by describing the shortcomings of previous instances to detect quantum speedup and then outline our approach to produce tunable, hard instances.
In Ref. [34] it was shown that a spin glass on the Chimera topology has a zero-temperature phase transition. Although the worst case complexity of finding a ground state of an Ising spin glass on the Chimera graph falls into the NP hard class, performing any minimization of the energy based on any annealing approach will likely have a rather simple phase space to traverse for small system sizes because dominant barriers will not be as pronounced. Embedding problems that have a finite-temperature spin-glass transition is difficult, mainly due to the large overhead; i.e., only systems with few logical qubits can be studied because many physical qubits are needed to emulate long-range interactions. Because the resulting systems are small, the problems are far from the asymptotic regime to detect any quantum speedup in a scaling analysis.
A more promising route is thus to use insights from the study of spin glasses and carefully design the interactions between the qubits on the native Chimera graph, such that the problems are as hard as possible in order to challenge any optimization approach.

A. Problems with current approaches
In addition to a restrictive geometry, the D-Wave hardware has clear restrictions as to what values the interactions be-tween the spins can have. This is rather limiting and, as such, only discrete and well-separated values of the couplers can be set. The simplest approach used in previous studies [28,[31][32][33]33] is to select the disorder from a bimodal distribution, i.e., J ij ∈ {±1} (we shall refer to these as U 1 ), followed by uniform range-k problems where the interactions J ij are chosen from the integer set {±1, ±2, . . . , ±k}. We refer to the latter as U k . The problem with these choices for systems up to N = 512 variables is the huge degeneracy of the ground states that yields again benchmarks too simple to challenge any optimization approach (see Sec. IV). A simple analogy to this problem is a game of golf where the green has, for example, 10 7 holes. Hitting a hole in one is a trivial task! However, having a course with only one hole makes the sport truly challenging. As such, we design herein problems that-within the hardware restrictions of the machine-have a unique configuration that minimizes the Hamiltonian in Eq. (1).
Other approaches [36,47] using planted solutions suffer from similar problems: While the instances are harder than for the problems in the U k class, they often still have a large degeneracy and their complexity is not high enough for the current available systems of up to ∼ 10 3 qubits. In particular, the very careful work presented in Ref. [36] shows a clear easy-hard-easy transition of the planted k-SAT solutions that could be exploited to generate hard instances. However, one problem that these instances have is that the disorder is not drawn from a particular distribution; i.e., two different planted k-SAT instances will likely have a very different (classical) energy spectrum and thus also be differently susceptible to the intrinsic noise found in the DW2 device [48]. Furthermore, we perform experiments with planted k-SAT solutions as presented in Ref. [36] using the benchmark codes in Ref. [39] and find that these instances are at times easier than the ones in the U 1 class. The authors of Ref. [36] do emphasize that harder problems must be designed to allow for the optimization of the annealing time, as well as the need to find problems where the benefits of quantum annealing can be assessed ahead of time.
Finally, setting the spin-spin interactions within the K 4,4 unit cell of Chimera (see Fig. 7) to be of larger magnitude than those between the cells (often referred to as "cluster problems") has given DW2 an advantage over classical codes in a scaling analysis [49] when cluster Monte Carlo updates are not allowed. However, by design, simulated annealing (and any other Monte Carlo-like simple-sampling variation) will have a large disadvantage. The addition of simple clusterlike moves would again give classical approaches the upper hand and, as such, these approaches are not a viable route to detect any speedup, especially because they are unphysical.

B. Designing tunable hard instances
Our approach to generate hard instances capitalizes on the similarity between classical hardness of spin-glass-like problems and quantum hardness. In Fig. 6 of Ref. [40], it was shown in detail how the "mixing" or "autocorrelation" time strongly correlates to the complexity of the spin-glass order parameter distribution while performing the simulations with state-of-the-art parallel tempering Monte Carlo methods [50][51][52]. Autocorrelation times uniquely characterize the time a classical algorithm needs to completely decorrelate the system. As such, the time can be used as an indirect proxy of the time complexity of a particular disorder instance.
In spin glasses, order is measured by comparing two copies of the system with the same disorder [25]. For simplicity, we set S z i ≡ S i , because we are studying the system classically. In that case, the overlap between two replicas α and β with the same disorder J but independent Markov chains is defined via where the sum is over all spins N . One can then study the distribution of the order parameter P (q) which characterizes a given disorder instance J . After a disorder average [· · · ] av over many instances P(q) = [P (q)] av displays a single peak around q ∼ 0 for high temperatures. For T → 0 two peaks at ±q EA emerge [53,54], a characteristic signature of a broken symmetry. However, for a given instance the structure of the distribution P (q) can be rather complex and can have multiple peaks at different values of q in addition to the two dominant peaks at ±q EA . Individual peaks can be identified with pairs of dominant valleys in the (free-) energy landscape [26]. When these peaks are close to q ≈ 0, one can assume that a thick barrier separates these valleys, whereas when the peaks are close the barriers are typically thin. Reference [40] showed that when the distribution P (q) has large support for an area close to q = 0, then the autocorrelation times were typically larger than when the support around q = 0 is close to zero. As such, by measuring the distribution function P (q), we can predict approximately the time complexity of a particular disorder instance [41]. This is illustrated in the main panel (bottom left) of Fig. 1. There, three characteristic instances are shown (color coded). An instance with many peaks close to q = 0 will typically be computationally harder than one that has only two peaks at q ∼ 1 (red line). Our experiments (shown herein) on the DW2 device show that, indeed, the complexity of an instance can be tuned by studying the structure of P (q) where the distance between two dominant peaks corresponds roughly to the barrier thickness in phase space and the relative depth between the peaks and maxima can be interpreted approximately as the barrier depth. While we are confident that there is a clear correlation between the distance ∆q of two well-defined peaks and the thickness of barriers in the energy landscape, the correlation of the depth between the peaks and the height of the barriers remains to be tested experimentally by a more precise mining of the data. However, if the depth between the peaks is nonzero, then it is safe to assume that there is some relatively trivial path that connects the valleys [55].
In addition to selecting instances according to the complexity of the phase space by studying the behavior of the spinglass order parameter distribution, we estimate the number of configurations for a given instance that minimize the Hamiltonian in Eq. (1). The goal is to make the problem as difficult as possible by restricting the number of minimizing configurations ideally to one, i.e., a unique ground state. To estimate the number of ground-state configurations a given instance has, we use the method pioneered in Refs. [56,57] where states at very low temperatures are sampled with parallel tempering Monte Carlo techniques. Once the ground-state energy is found, a histogram with minimizing configurations is created (indexed by translating the binary configuration string to a number) and sampled until every bin has at least 50 hits. We make sure that we find the true ground-state energy by studying every instance with different simulational heuristics. However, we cannot be completely certain that we have found all configurations that minimize the Hamiltonian, simply because in some cases this number can be huge (in the worst case 2 N ). Having exactly one ground state is not a necessary condition to generate a hard problem. However, if our efficient low-temperature search is unable to find more states that minimize the cost function, it will be unlikely that other methods will.
A large source of degeneracy in an Ising Hamiltonian is due to zero local fields. The Hamiltonian in Eq. (1) can be written as a single-spin expression, namely, where the local fields F i are given by Whenever for a given disorder F i = 0, spin S i can take any value without influencing the energy of the system. Therefore, if a given disorder instance has k spins where F i = 0, the degeneracy of the ground state will grow by a factor 2 k . To prevent this from happening, we need to choose the disorder from a distribution that-within the restrictions of the device-minimizes the cases where the local fields are zero. The most convenient choice is thus to select the values of |J ij | from a Sidon set [46]. In a Sidon set, the sum of two members of the set gives a number that is not part of the set. For example, the set {2, 5, 10} is a Sidon set because the pairwise sum of members of the set never adds up to a member of the set. This is not the case for {2, 5, 7}, where 2 + 5 = 7.
To illustrate our ideas, we choose the interactions between the spins from the Sidon set S 28 J ij ∈ {±8/28, ±13/28, ±19/28, ±28/28}, where we normalize the interactions to be restricted between ±1 [58]. To select instances with particular properties, we can therefore generate large numbers of random problems using different disorder distributions and then mine the data. We first fix the number of ground-state configurations to 1, and then we divide the instances into subclasses by studying the (normalized) overlap distribution P (q) for each instance. For example, we define the following classes: (a) Hard instances with thick barriers: These are instances where P (q) > 5 for |q| ≤ 0.75. See Fig. 1, main panel.
We are interested in instances that have dominant peaks in the central (blue/dark) window. Based on classical simulations, we expect these instances to be on average among the hardest. In particular, we expect that both simulated, as well as quantum annealing will have trouble finding the optimum -see Fig. 1(a).
(b) Hard instances with thin barriers: These are instances where P (q) ≈ 0 for |q| ≤ 0.50 and where P (q) > 2.5 for |q| ≥ 0.5 with at least two peaks in the range |q| ∈ [0.5, 1.0]. See Fig. 1, main panel. We are interested in instances that have dominant peaks that are close to each other in the gray boxes close to |q| > 0.5. Based on classical simulations, we expect these instances to be on average hard, however, not as hard as the instances with a thick barrier. We expect that while simulated annealing will have similar problems than with the instances with a thick barrier, quantum annealing might show an enhanced performance if the device has some quantum advantage over classical codes -see Fig. 1(b).
(c) (Hard) instances with small barriers: These are instances where P (q) < 0.1 for |q| ≤ 0.75. The overlap distribution is reminiscent of a ferromagnet at low temperature. In this case no peaks are allowed in the large central (red/light) box of Fig. 1, main panel. In these instances we expect one dominant energy valley (up to smaller wiggles), i.e., these should be the easiest instances on average for any annealing approach. See Fig. 1(c).
Note that the individual windows we use are tuned such that from 10 5 randomly simulated instances approximately 5000 match the aforementioned criteria. After filtering the instances that have more than one minimizing configuration, we obtain approximately 2500 instances to experiment with. The detailed simulation strategy, as well as simulation parameters, are listed in Appendix B. Noise on the DW2 device is approximately 5% of a particular external field (qubit noise) h and 3.5% of a spin-spin interaction (coupler) J ij . For the instances in S 28 , the smallest classical energy gap is ∆E = 2/28, i.e., slightly larger than the noise found on the DW2 device. While this will affect the success probabilities, it will affect all instances, either easy or hard, approximately the same way. To verify this, we perform detailed simulations where we compute the ground-state energy and configuration of a given instance with no degeneracy, perturb the couplers and qubits with Gaussian random noise of a typical strength found in the current DW2 device, and recompute the ground-state configuration. We apply 10 noise gauges and compute how stable the different instance subclasses defined below are on average. Our results show that all Sidon-set-based instance subclasses with different barrier thicknesses are affected similarly by the intrinsic noise of the device (not shown). As such, when comparing instance classes, on average a fair comparison is performed. Because the barriers are large and thick, we expect both classical and quantum approaches to have difficulties. In (b), we illustrate the expected behavior when the barriers are thin, i.e., double peaks (or more) that protrude from the dark boxes in the region |q| > 0.5. The features in the energy landscape of these hard instances with thin barriers are still very pronounced, but we expect the barriers to be thinner than in (a). While SA should show little to no advantage when the barriers remain high but are thinner, if the DW2 device has any quantum advantage, it might be able to overcome these barriers. Finally, we study instances that have no features for |q| < 0.75 (large red box in the main panel) and only have a single peak at ±qEA. These (hard) instances with small barriers have the simplest energy landscape (c) with mostly only one dominant feature. As such, we expect any annealing approach to efficiently find the optimum of the problem (on average). Note that these are cartoons intended to illustrate the different instance classes and do not represent actual data.

IV. RESULTS
A detailed list of the average success probabilities is given in Appendix C. To make sure that an approximately fair comparison with a known baseline study is performed, we tune the number of sweeps for the SA codes [39] such that the average success probabilities for SA and the DW2 device are approximately the same for bimodal disorder. This is the case for N sw = 900 sweeps. Note also that below we quote mainly average success probabilities. The reason is that for the hardest instance classes the DW2 device is often unable to minimize the cost function for the number of runs performed; i.e., a median would be zero and thus deliver no useful information. Because probabilities are restricted to be in the interval [0, 1], an average is well defined.
A. The ugly-D-Wave Two fails often Figure 2 shows sorted success probabilities p for SA (left) and the DW2 device (right) and different instance classes normalized by the number of samples N sa studied. We compare classes S 28 with thick, thin, and small barriers with uniform range-4 (U 4 ) instances and bimodal disorder (U 1 ) used in previous studies [28]. The data for the DW2 device show a clear progression in complexity and, in particular, that the device is unable to solve many of the harder problems (success probabilities below 10 −4 ). The SA simulations using the codes of Ref. [39] show that bimodal disorder is considerably easier than all other instance classes. Furthermore, for the number of sweeps used, the complexity of U 4 is similar to S 28 with small ("none") barriers. Interestingly, the SA codes do not distinguish between S 28 instances with thin and thick barriers. Note that this is not the case for the DW2 device.
Furthermore, SA can solve a much wider range of instances, as can be seen by the distributions dropping to zero only close to n → N sa . This means that while the typical (median) probability to solve a problem is finite for the SA codes, for the hardest instance classes the median is zero for the DW2 device. A double-peaked success behavior of the quantum annealer is consistent with what has been reported in Refs. [28,32], who present it as evidence of quantum behavior, although the hypothesis has been subsequently challenged by studies of quasiclassical models [59,60]. Finally, we emphasize that by optimizing the number of sweeps in the SA codes these can be tuned to outperform the DW2 device for all disorder classes studied. Figure 3 shows averaged (and gauge-averaged) success probabilities in logarithmic scale for both DW2 and SA for different instance classes. The data clearly illustrate that the average success probabilities for bimodal disorder are approximately 1 order of magnitude larger than any other type of disorder studied. Note that we choose the number of sweeps for SA such that the average success probability in the bimodal class is comparable to the DW2 device. For the DW2 device, one can clearly see a progression in difficulty between U 1 , U 4 , as well as the Sidon set S 28 with small barriers, followed by the Sidon sets with thin and thick barriers. For the choice of sweeps in SA, U 4 is comparable to S 28 with no dominant barriers, and the S 28 instances with thick and thin barriers have approximately the same average success probabilities. For all Sidon instance classes studied, the classical SA simulations outperform DW2 based on raw success probabilities. This is seen in more quantitative detail in Fig. 4, which shows the ratio of the average success probability for SA divided by the average success probability for DW2 for each instance class. To establish any quantum speedup, a system-size scaling is needed. However, the fact that the average success probabilities for the bimodal disorder for DW2 and the classical SA codes are much larger than for all other problems suggests that bimodal disorder (or, more generally, highly degenerate  i.e., the machine would need many more runs to be able to find the optimum of hard native problems. Error bars are omitted for better viewing.

B. The bad-Previous instance classes are too easy
random problems) is too easy a problem to detect any quantum speedup. Running any classical SA code in repetition mode with highly degenerate problems potentially represents an advantage over any quantum annealing scheme. Overall, DW2 has far lower average success probabilities on the Sidon sets. This can be explained by the inherent noise present in the device. In the Sidon sets the gap to the first excited state is considerably smaller than for, e.g., bimodal disorder. As such, solving a Hamiltonian that is not the target Hamiltonian due to noise-induced perturbations is likely. Therefore, in an attempt to filter out these effects, we study relative probabilities between instance classes and not between optimization techniques. Because the problem instances are randomly generated, one can expect that within a given instance type, e.g., S 28 , the noise affects all instance classes in a similar fashion [58], as we see in our simulations. This means also that the difference in the performance of DW2 for S 28 instances with thick and thin barriers is likely not an artifact of the chosen values for the couplers. In all Sidon instance classes (S28) the classical codes outperform DW2. Furthermore, success probabilities for bimodal disorder (U1) are much larger than for any other instance class, therefore suggesting that the degeneracy produced by bimodal disorder makes this instance class too easy to detect quantum speedup. Note also that the classical codes, on average, do not seem to distinguish between instances with thick and thin barriers. Labels are from left to right. C. The good-Evidence of a quantum advantage? Figure 3 suggests that-at least with the choice of annealing parameters made-in the Sidon instance class the classical codes do not seem to differentiate between thin and thick barriers on average, whereas DW2 does seem to show an improvement in the average success probabilities when the barrier thickness is decreased.
Given the stochastic nature of the classical algorithms, the thickness of a barrier should have a much weaker effect on the algorithmic efficiency than its height. We have selected the instances in such a way that barriers are predominantly tall. Although we have no exact control at the moment as to how tall these barriers are, we can expect them to be on average of similar height for both Sidon sets with thin and thick barriers. However, by selecting instances with peaks in the overlap distribution at a given distance from each other, we have good control over the barrier thickness. Figure 5 shows the ratio of average success probabilities when reducing the barrier thickness (left) and removing dominant barriers (right) for both SA and DW2. While reducing the barrier thickness has no effect on average on the classical algorithms, DW2 experiences a performance increase. To make sure this is not an artifact of our choice of simulation parameters, we run the SA codes with both N sw = 900 and 2000 sweeps obtaining qualitatively the same results. Furthermore, we find no correlation between the barrier thickness and the effects noisy couplers and qubits have on the success probabilities for both instance classes. When removing dominant barriers altogether, both classical and quantum algorithms show a noticeable performance increase. One can, therefore, surmise that when the barriers are thin enough (and tall) the DW2 device might experience a quantum advantage over classical approaches. However, a far more careful and systematic study must be performed before strong conclusions can be drawn. FIG. 5: Average success probability increase when reducing the barrier thickness (ratio between the average success probabilities for S28 thick and S28 thin) and removing the barriers (ratio between the average success probabilities for S28 thick and S28 none). While in the latter case both classical algorithms and the quantum annealer show a performance boost on average, in the former only the quantum annealer shows improvement.
To gain a deeper understanding of the noise effects that affect the DW2 device, we relax our criterion for a successful optimization run by allowing the k lowest excited states to count towards a "successful" run in the Sidon sets. In this case, the smallest classical energy gap when flipping a spin is ∆E = 2/28 ≈ 0.0714. This should be compared with the disorder-averaged ground state energy of the system, i.e, [E 0 ] av ≈= −551. We compute the success probabilities for energies in the interval [E 0 , E 0 + k∆E] for different instance classes using SA and the DW2. Figure 6 shows the average success probabilities as a function of the number of energy levels k. Although we only fix the average success probabilities for the U 1 class to be similar for DW2 (full symbols) and SA (empty symbols) and k = 0, it seems this result holds for at least the first 10 excited states. As can be seen, average success probabilities increase with an increased inclusion of low-lying energy levels for all instance classes. The trend is far more pronounced for the DW2 device than for SA in the case of the Sidon sets S 28 , indicating that noise clearly affects the ability of the machine to detect ground states. Furthermore, note that allowing for the lowest 10 energy levels in the S 28 class corresponds to an increase in less than 1% in the overall energy of the system. Averaging over gauges (i.e., different instances of noise terms in the Hamiltonian) does help the DW2 device, thus illustrating that an increased performance strongly depends on reducing noise, and also performing multiple quenches.
Is the DW2 device of any use then? For problems affected by noise due to device restrictions, the DW2 thus might efficiently deliver low-lying energy states. This is of particular relevance to problem domains such as machine learning [61] and Bayesian statistical analysis [62].
For optimization, the data suggest that error-correction strategies [31] that enhance robustness to noise should be explored in greater depth. Combined with a hybrid approach that either breaks up the problem into smaller groups that are easier to tackle [63][64][65], or uses other efficient computing architectures [66] to complement the minimization, the DW2 device (or any other quantum annealing machine) might be an efficient optimization tool one day.

V. DISCUSSION
We illustrate that a careful design of the benchmark instances is key when attempting to detect quantum speedup. In particular, using insights from the study of spin glasses can help in designing benchmark problems that are considerably harder than previous attempts, and are tunable. Noise levels combined with the small number of qubits on the DW2 device make it difficult to detect any quantum speedup at the moment. Below, we attempt to discuss sources of the poor performance of the device as seen from the spin-glass perspective.
Disordered frustrated binary systems are the native, likely hardest, as well as simplest benchmark problems for any new (quantum) computing paradigm. It is important to consider some of the hallmark properties of spin glasses that could make it extremely difficult to detect any (quantum) speedup in the presence of coupler, as well as local-field qubit noise.

A. Effects of coupler noise
The extreme fragility of the spin-glass state was predicted a long time ago [67,68] and analyzed on the basis of scaling arguments [69,70]. These scaling arguments predict that the configurations that dominate the partition function change  drastically and randomly when temperature, local fields, or the interactions between the spins are modified. There is strong (numerical) evidence of disorder chaos (coupler noise) in spin glasses [71][72][73][74][75][76][77][78]. Therefore, small perturbations of the couplers due to noise might lead to the destruction of the spinglass state, as well as to a change of the problem to be solved. The latter can be alleviated slightly by performing multiple gauges. However, the weak chaos regime is dominated by rare events that can flip large spin domains that can directly affect experimental results [77]. Increasing the classical energy gap beyond the noise level of the machine can partially reduce these effects, however at the cost of producing considerably easier benchmark instances [38]. One might argue that the minimum classical gap of the Sidon instances (∆E = 2/28) is too small compared to the machine restrictions when encoding problems. However, we perform tests with a different instance class with a larger clas-sical energy gap and where the couplers are drawn from the Sidon set {±5, ±6, ±7}, finding qualitatively similar results.

B. Effects of local-field noise
In mean-field theory [79], an Ising spin-glass system has a line of transitions in a field [80], known as the de Almeida-Thouless line that separates the paramagnetic phase at high temperatures and fields from the spin-glass phase at lower temperatures and fields [81][82][83][84][85][86]. Although the existence of a de Almeida-Thouless line for short-range spin glasses is still under some debate (see, for example, Refs. [87][88][89]), there is vast numerical evidence for a multitude of geometries and, in particular, low-dimensional systems that the spin-glass state is strongly affected by any longitudinal (random) fields [90][91][92][93]. As for the case of disorder chaos in spin glasses, the spin-glass state can be easily affected by the intrinsic qubit noise of the DW2 device. Therefore, it might be plausible that, again, the high levels of noise might reduce the success probabilities because the studied system is perturbed and dominant barriers are affected.

VI. SUMMARY AND CONCLUSIONS
We find that for most disorder types studied, DW2 is systematically slower at finding the ground state than the stateof-the-art classical SA codes developed by Isakov et al. [39]. Note that, by optimizing the number of sweeps in the SA codes, these can be tuned to outperform the DW2 device for all disorder classes studied. Although this might be discouraging at first, we argue that an improved machine calibration [94], noise reduction [95], and the ability to likewise optimize the quantum annealing schedule combined with larger system sizes and tailored spin-glass problems might help in the quest for quantum speedup. We also show that a "classically computationally hard" problem seems to typically also be a hard problem for the quantum annealing device. However, it could also be that the DW2 device is a thermal annealer [59,60,[96][97][98][99] in disguise.
For the hardest Sidon instances the DW2 device does show a promising trend when the success constraints are relaxed. Furthermore, reducing the thickness between barriers in the free-energy landscapes suggests that for the large Sidon instances studied some quantum advantage might be present. However, this would not be enough to deem the hardware to be efficient, especially because it is unclear if this effect persists for larger problem sizes. We conclude by stressing that a careful design of benchmark instances is key to detecting quantum speedup [28] or any quantum advantage a novel quantum annealing device might have. We thus expect that a combination of the methodologies outlined in this work with the approach outlined in Ref. [28] that defines the notion of "quantum speedup" in detail, combined with better hardware (and maybe quantum error correction [31,44]), will finally show whether or not quantum annealing has an advantage over classical thermal annealing. Circles represent the individual qubits and lines the couplers. White circles represent fully functional qubits, whereas light gray circles represent working qubits with missing couplers. Broken qubits are represented by dark circles (16). This means that the total number of working qubits is 496.

D-Wave Two Methodology
An annealing time of 20µs is used for all experimental runs on the DW2 processor, which is cooled to a temperature of 18mK. Each problem instance is run N R = 10 4 times in N G = 10 batches of randomly-chosen gauge transformations in order to provide protection against parameter noise and control errors. To generate a gauge transformation, a set of N random variables {t i }, with t i ∈ {−1, 1}, is sampled uniformly, and the transformation is made. In principle, this procedure does not fundamentally change the problem, but due to parameter noise on the physical device, each gauge transformation of a given instance will, in reality, correspond to a different Hamiltonian. Following the analysis performed in Ref. [33], an instance's success probability across gauges is derived from the geometric mean of the gauges' failure rates. If p g is the observed success probability of a gauge g, then A "success" is defined as the occurrence of a state meeting a criterion, for example, of having ground-state energy E 0 , or with energy lying in a range [E 0 , E 0 + ∆], ∆ > 0, of the minimum.
The DW2 device is run in the so-called "autoscaling" mode for all problems, which adjusts the nominally specified J and h parameters to fully use the range allowed by the device.

Simulated annealing methodology
For the software-based simulated annealing experiments, we use the codes developed by Isakov et al. [39] to ensure a fair comparison with previous studies. The authors present a variant of SA that exploits the bipartite nature of topologies such as the Chimera graph's in order to halve the number of variables being simulated. This optimization results in considerably improved performance over plain SA. In this study we use the an ss ge nf bp vdeg routine.
All instances are simulated N R = 10 4 times for N sw = 900 Monte Carlo sweeps each; clearly, no advantage would be gained from gauge transformations in the software case. The default geometric annealing schedule described in Ref. [39] was adequate for our purposes, but the (inverse) temperature scales were appropriately adjusted for each instance class. The parameters of the simulation are listed in Table I.
Note that we choose N sw = 900, such that the average success probabilities for the DW2 device agree with the SA simulations for the commonly studied bimodal (U 1 ) disorder. We choose this approach to provide a baseline for all other instance classes. Simulations with N sw = 2000 sweeps showed qualitatively similar results.  To compute the overlap distribution P (q) we perform finitetemperature parallel tempering Monte Carlo simulations [50][51][52] combined with isoenergetic cluster moves [102] to speed up the simulations. We choose a temperature set with 30 temperatures and the lowest temperature T min = 0.212 is chosen such that thermalization can be completed in a meaningful time and features in the overlap distribution are well defined. Two replicas with N = 496 spins and the same disorder are thermalized for 2 23 Monte Carlo sweeps and P (q) is measured over an additional 2 23 Monte Carlo sweeps to obtain high-resolution data. We compute 10 5 randomly chosen disorder instances for each problem class. The data are then mined according to predefined criteria (see Sec. III B). Table II lists the numerical values of the average success probabilities for the different instance classes we study either on the DW2 device or with SA codes. All numbers are averaged via a jackknife procedure over N sa instances of the disorder.

Appendix D: Other Instance Classes Studied
We also perform other experiments with different instance classes. However, these are either too easy or it is extremely difficult to obtain unique ground-state instances. Note that for the J 4 instances [34], where the interactions are bimodally distributed and the bonds in the K 4,4 cells are a 1/4, as well as the S 1,3,7 small Sidon instances, we limit the number of configurations that minimize the Hamiltonian to less than 32 because too few unique ground states could be found. As such,  (6) 13.3(2) S1,3,7 (thick barriers) 1615 0.063(4) 0.59(1) S1,3,7 (small barriers) 1582 0.22(1) 1.14 (2) we are merely mentioning here the results to prevent other researchers from attempting to study these systems. Average success probabilities are listed in Table III.