High-quality Thermal Gibbs Sampling with Quantum Annealing Hardware

Quantum Annealing (QA) was originally intended for accelerating the solution of combinatorial optimization tasks that have natural encodings as Ising models. However, recent experiments on QA hardware platforms have demonstrated that, in the operating regime corresponding to weak interactions, the QA hardware behaves like a noisy Gibbs sampler at a hardware-specific effective temperature. This work builds on those insights and identifies a class of small hardware-native Ising models that are robust to noise effects and proposes a procedure for executing these models on QA hardware to maximize Gibbs sampling performance. Experimental results indicate that the proposed protocol results in high-quality Gibbs samples from a hardware-specific effective temperature. Furthermore, we show that this effective temperature can be adjusted by modulating the annealing time and energy scale. The procedure proposed in this work provides an approach to using QA hardware for Ising model sampling presenting potential new opportunities for applications in machine learning and physics simulation.


I. INTRODUCTION
The computational task of sampling -that is producing independent configurations of random variables from a given distribution -is believed to be among the most challenging computational problems. In particular, many sampling tasks cannot be performed in polynomial time, unless strong and widely accepted conjectures in approximation theory are refuted [1][2][3]. Consequently, state-ofthe-art algorithms for general purpose sampling are based on Monte Carlo methods that require significant computing resources to sample from distributions of practical interest.
Gibbs distributions, i.e., distributions with the form P (σ) ∝ exp(−βH(σ)), have close ties to modeling many physical systems as they capture the equilibrium configurations that matter takes at a given inverse temperature β [4]. It has also been observed that Gibbs distributions provide a powerful foundation for building machine learning models and optimization algorithms [5,6]. However, the computational challenge of sampling from these distributions is often prohibitive for practical applications of machine learning and optimization. It has been observed that a variety of the emerging analog computing devices, including those based on optical [7] and quantum [8] technologies, can be used as fast Gibbs samplers because of the natural connection between probabilistic physical systems and state sampling. The emergence of these novel computing devices has the potential for a dramatic impact on state-of-the-art approaches for physics simulation, machine learning and optimization.
From its inception, Quantum Annealing [9][10][11][12] (QA) has been designed for solving optimization tasks and not sampling tasks. However, in practice, various non-ideal properties of QA hardware platforms lead to outputs that are reminiscent of thermal Gibbs distributions [13][14][15][16]. Recent experiments have demonstrated that, in the operating regime corresponding to weak input interactions, QA hardware behaves like a noisy Gibbs sampler [17] at a hardware-specific effective temperature, that is, a sampler from a mixture of Gibbs distributions with fluctuating parameters caused by noise. Furthermore, it was shown in [17] that samples from this noisy mixture of distributions is indistinguishable from a single Gibbs distribution with spurious additions to interaction structure of the programmed model. In applications where sampling from a specific Gibbs distribution is required, this distortion of the model parameters may present an undesirable feature because of this mismatch in model structure.
Building on the insights of [17], this work demonstrates that it is possible to leverage QA hardware to perform high-quality Gibbs sampling from some types of input models. In particular, we identify a class of hardwarenative Ising models where the D-Wave 2000Q platform consistently produces samples with a total variation distance less than 5% from a desired target Gibbs distribution. To this end, this work explores the full range of operational input parameters, in contrast with [17] that focused on quantifying the effect of noise on the structure of the output distribution and hence considered a narrow range of weak input couplings. We also show that changing the annealing time enables the QA hardware to sample from a range of inverse temperatures between 1.9 and 5.1, which is an essential feature in a variety of applications. This work leverages the combination of three insights to achieve high-quality Gibbs sampling on the QA hardware: (1) it focuses on the hardware-native Ising models that are resilient to the hardware's noise; (2) the input models are re-scaled to avoid distortions caused by the transverse field in the computational model; and (3) both the annealing time and the model's energy scale are leveraged to control the effective temperature of the target distribution. This results in a prescriptive procedure for conducting high-quality Ising model Gibbs sampling on QA hardware without the need to tune the input model or perform post-processing to correct for distorted output distributions. If required, both of these procedures can be leveraged to further improve the results presented in this work, as done in [13,18,19]. While this work shows that, under specific conditions, Gibbs sampling is possible using quantum annealing hardware, it does not illustrate how it can be used in specific sampling applications.
This work begins by reviewing the computational model of quantum annealing and previous works exploring sampling applications in Section II. Building on these previous works, a procedure for conducting high-quality thermal Gibbs sampling is proposed in Section III and evaluated on quantum annealing hardware. Section IV explores theoretical models to provide insights into the mechanisms that under-pin the proposed sampling approach and Section V concludes with a discussion of future directions.

II. PROBLEM FORMULATION & RELATED WORK
The foundational physical system of interest to this work is the Ising model, a class of graphical models where the nodes, V , represent classical spins (i.e., σ i ∈ {−1, 1} ∀i ∈ V ), and the edges, E ⊆ N × N , represent pairwise spin interactions. A local field h i ∀i ∈ V is specified for each spin and an interaction strength J ij ∀i, j ∈ E is specified for each edge. The energy of a given spin configuration is given by the Hamiltonian: This Ising model is widely used to encode challenging computational problems arising in the study of magnetic materials, machine learning, and optimization [20][21][22][23]. The particular computational task of interest to this work is to produce i.i.d. samples σ from the Gibbs distribution associated with the Ising model at the inverse temperature 1 α, that is, Throughout this work, we assume that the parameters J, h are in the range of −1 to 1, to provide a consistent scaling for α. Given the computational challenge of Ising model sampling at finite temperature, our objective is to leverage QA hardware to conduct this task.

A. A Brief Review of Quantum Annealing
The central idea of quantum annealing is to use the transverse field Ising model combined with an annealing process to find the low-energy configurations of a classical Ising model. The elementary unit of this model is a qubit i ∈ V described by the standard vector of Pauli matrices { σ x , σ y , σ z } along the three spatial directions {x, y, z}.
The hardware platform provides a programmable Ising model Hamiltonian on the z-axis [9], which encodes the Ising energy function (1) with a oneto-one mapping of qubits to Ising spins. Note, that the eigenvalues of the Ising Hamiltonian operator are in bijection with the 2 N possible assignments of the classical Ising model from Eq. (1). The quantum annealing protocol strives to find the low-energy assignments to a user-specified H Ising model by conducting an analog interpolation process of the following transverse field Ising model Hamiltonian: The interpolation process starts with s = 0 and ends with s = 1. The two interpolation functions A(s) and B(s) are designed such that A(0) B(0) and A(1) B(1), that is, starting with a Hamiltonian dominated by − i∈V σ x i and slowly transitioning to a Hamiltonian dominated by H Ising . In the closed system setting and when this transition process is sufficiently slow, the quantum annealing is considered adiabatic quantum computing. It is known that under these conditions, quantum annealing will find the ground states of H Ising , and therefore minimizing configurations of H Ising , with high probability [24][25][26][27][28]. To that end, the length of the annealing process t (in microseconds) is a user controllable parameter. The outcome of the quantum annealing process is specified by the binary string σ, where each element σ i takes a value +1 or −1 and corresponds to the observation of the spin projection of qubit i in the computational basis denoted by z.

B. Quantum Annealing and Sampling
Given the energy minimizing nature of quantum annealing, it is not immediately clear how it may be applied to the sampling task posed in (2). It is reasonable to postulate that an adiabatic system would behave similarly to Gibbs sampling at the zero temperature limit (i.e., α = ∞), that is drawing i.i.d. samples from the ground states of H Ising . Interestingly, this is not usually the case as the transverse field component of the Hamiltonian, σ x i , induces biases in the annealing process to prefer some ground states over others [29][30][31][32]. Nevertheless, protocols have been proposed to help increase the fairness of ground state sampling using quantum annealing [33,34]. In contrast to previous works, this work is concerned with sampling from thermal distributions (i.e., α < ∞), which requires accurate sampling at all of the energy levels of H Ising , presenting a formidable challenge in model accuracy.
Real-world QA hardware is an open quantum system that is subject to a wide variety of non-ideal properties, including thermal excitations and relaxations that impact the results of the computation [35][36][37]. In this regard, the relevant adiabatic theorem is the open system adiabatic theorem [38][39][40]: if the dynamics is governed by a master equation of Lindblad form [41] and if the annealing is done sufficiently slowly, the system will converge with high probability to the steady state of the dynamics. This result is promising because, for certain master equations, the thermal Gibbs state is the steadystate solution [42,43].
In addition to the fundamental impacts of open quantum systems, the D-Wave hardware documentation highlights five other sources of deviations from ideal system operations called integrated control errors (ICE) [44], which include: background susceptibility; flux noise; Digital-to-Analog Conversion quantization; Input/Ouput system effects; and variable scale across qubits. Consequently, it has long been observed that output distributions of the QA hardware produced by D-Wave Systems are reminiscent of a Gibbs distribution of the input Hamiltonian H Ising [13][14][15][16]18] with a hardware-specific effective temperature of β ≈ 10 [17,45]. The prevailing interpretation of the hardware's output distribution is the Freeze Out model, which proposes that the output reflects a quantum Gibbs distribution occurring at an input-dependent point towards the end of the annealing process where some small amount of σ x i remains [46]. This model has been notably successful for training quantum machine learning models [47,48] and generating statistics for quantum Gibbs distributions [49][50][51]. However, these quantum Gibbs distributions are inaccurate when targeting a desired classical Gibbs distribution for sampling applications [6,18,52,53]. The recent insight from [17] is that when this hardware is operated at a low-energy scale (i.e., |J|, |h| ≤ 0.050) it behaves as a thermalized classical Gibbs sampler from H Ising but suffers from a notable amount of distortion from instantaneous noise in the local field parameters, h, on the order of 0.036 [45,54], behaving as a so-called noisy Gibbs sampler.
Inspired by the classical Gibbs sampling insights of [17], this work demonstrates that there exists a class of Ising Hamiltonians where D-Wave's 2000Q hardware produces high-quality Gibbs samples, with minimal distortions from noise or transverse fields. In particular, we consider the class of H Ising with J, h ∈ {−1, 0, 1}, |V | ≤ 16 that are representable on the D-Wave 2000Q hardware graph. The primary insight of this work is to operate the QA hardware at an energy scale, α in , that is large enough to be noise-resilient but small enough to avoid the degeneracy breaking properties of the transverse field. It is not guaranteed that such a 'sweetspot' exists but this work demonstrates that the range of 0.2 ≤ α in ≤ 0.4 on the platform considered achieves the desired properties. Section IV of the paper provides a qualitative study postulating why a sweet-spot occurs at this particular energy scale. Additionally, this work shows that in the proposed energy scale the annealing time has a consistent effect of tuning the effective temperature (i.e., α) of the Gibbs samples generated by the hardware. Combined these observations present new opportunities for leveraging QA hardware for conducting the Ising model sampling task described by (2).

III. EXPLORING GIBBS SAMPLING WITH QUANTUM ANNEALING HARDWARE
This work is concerned with the following four parameters: H Ising , the Ising model that one would like to sample from (restricted to J, h ∈ {−1, 0, 1}); α in ∈ [0, 1], a scaling factor that will be used when programming the QA hardware; t, the annealing time of the QA protocol; and α out ∈ [0, α max ], a scaling factor of the distribution output by the QA hardware, where α max is an estimated upper bound and is further described in section III A 3. Given H Ising , the QA hardware is programmed with the re-scaled model H in Ising = α in H Ising and executed with an annealing time t. Note that rescaling is equivalent to replacing J with α in · J and likewise with h. The empirical distribution output by the hardware ν is then compared to Gibbs distributions of H Ising at different effective temperatures, i.e., µ(α, σ) ∝ exp (−αH Ising (σ)). We define α out as, that is, the effective temperature of the closest Gibbs distribution of H Ising to the results output by the hardware, using total variation distance (TV) as the measure of closeness. Details of the total variation metric are provided in Appendix A. The remainder of this section explores how changes in the α in and t parameters impact the output distribution of the D-Wave 2000Q Quantum Annealer located at Los Alamos National Laboratory, known as DW 2000Q LANL. The unexpected finding is that there exists range of α in values where the QA hardware is a high-quality Gibbs sampler.
A. Experiment Design

Ising Model Selection
Given the challenges of sampling from embedded Ising models [52], this work focuses exclusively on H ising Hamiltonians that have native representation in the QA hardware. In particular, the D-Wave 2000Q system implements a C 16 chimera graph [55], which consists of a 16 × 16 grid of unit cells each containing 8 qubits (4 horizontal and 4 vertical). In each unit cell, every horizontal qubit is connected to every vertical qubit through couplers and various qubits in adjacent unit cells are also connected through couplers. The coupling strength J ij between qubits i and j can be programmed to values in the continuous range [−1, 1], and the local fields h i can be programmed to values in the range [−2, 2]. As part of the execution process these unitless quantities are transformed into the hardware's energy scales of 0.0 to 6.36 GHz for A(s)/h and 0.07 to 14.56 GHz for B(s)/h, where h is Planck's constant.
In selecting the Ising models for testing Gibbs sampling with QA, we strive to design models that exhibit a variety of sampling difficulties. Previous works have highlighted that QA will prefer some degenerate ground states over others as a result of residual effects from the annealing process [29][30][31][32]. As ground states are the most heavily weighted states in the low-temperature Gibbs distribution, these asymmetries introduce significant sampling errors. In this work, we elicit this effect by constructing a set of 13 models with ground-state degeneracies ranging from 1 to 38 to capture both easy and challenging instances to sample from with a quantum annealer.
In particular, this work considers seven GSD (groundstate degeneracy) cases with degeneracy values of 2, 4, 6, 8, 10, 24, and 38, without local fields, and six additional GSD-F models with 1, 2, 3, 4, 5, and 6 ground state degeneracies including local fields. Note that the GSD-F models tend to have lower degeneracy because the inclusion of the fields breaks symmetries within energy levels. The specific instances are referred to as GSD-# and GSD-F-# where the # indicates the amount of ground state degeneracy. The Hamiltonians of each instance and a discussion of how they were designed is provided in Appendix B.

Quantum Annealing Data Collection
For each of the GSD models, H Ising , a family of hardware inputs is considered by sweeping α in between 0.0 and 1.0 with step size of 0.0125 below 0.1, 0.025 between 0.1 and 0.5, and 0.1 above 0.5. The increased step size for low scales is helpful as the output statistics tend to be more sensitive to α in in this regime. For each of these models and scales, 10 6 samples are collected from the QA hardware. After every 100 samples, a gauge transform is applied in order to mitigate the effects of bias. For each gauge transformation, a randomly generated vector a ∈ {−1, 1} V is used to redefine the Ising parameters via h i → a i h i , J ij → a i a j J ij . This transformation preserves the energy eigenvalues of H(s), but it relabels the spin configurations σ by σ i → a i σ i . Using gauge transformations is a well-known technique for eliminating these field biases, which can add additional unwanted asymmetries [17,56,57]. After the samples have been collected, the empirical probability of each state is recorded. This data collection process is repeated for annealing times t of 1 µs, 5 µs, 25 µs, and 125 µs.

Estimating the Effective Output Temperature
As discussed at the beginning of this section, we would like to identify the closest Gibbs distribution of H Ising to the empirical distribution output by the QA hardware, i.e., (5). However, the exact solution of the optimization task posed by (5) presents a significant computational challenge because of difficulties in computing candidate Gibbs distributions. The primary reason that this work has focused on the small-scale Ising models with only 16 spins is to make this optimization task computationally tractable. In this work, the optimization problem (5) is solved by a brute-force calculation of the total variation distance for a discrete set of α values. The α values are determined by calculating an optimistic estimate for α max and then scaling this value by the same step sizes that are used for α in . The α max value is calculated by executing the single-qubit protocol proposed in [45] on each of the relevant spins and taking the average of the spin-by-spin effective temperatures recovered by that protocol. Therefore, the minimum of the range is 0.0 and the maximum is α max . The α value from this discrete set that achieves the minimum total variation distance between the quantum annealing data and the Gibbs distribution is selected as α out . In the results, we observe that α out α max indicating that α max is a sufficiently optimistic value for this brute-force optimization procedure in practice.

Bounding Finite Sampling Performance
Because the empirical distribution produced by the QA hardware is calculated from finite samples, there is an unavoidable error caused by finite sampling (details are discussed in Appendix A). To understand the impact of this finite sampling error, a lower bound is calculated by a simulation procedure that generates 10 6 samples from the selected α out Gibbs distribution and calculates the total variation distance between this empirical distribution and the exact distribution. Because the samples are generated from a known source distribution, the only source of error is a result of finite sampling. Therefore, this value represents the best total variation distance that is achievable given the number of samples that were used in these experiments 10 6 . This lower bound is included in the result figures to provide a measure of the portion of the total variation distance that can be attributed solely to the limitations of finite sampling. In this work we find that this sampling issue is only significant when the input scale is low (α in < 0.1) as this regime reflects a distribution that is close to uniform and the probability mass is spread out among exponentially many states. The presence of errors caused by finite sampling motivates our use of a large number of samples (i.e., 10 6 per data point) and provides an explanation for some of the consistent features in the results that are observed in the low α in regime.

A Typical GSD Model
We begin by presenting detailed results on a characteristic GSD model and then provide a summary of the results across the complete collection of models. The total variation distance results for the GSD-6 model are presented in Figure 1. As the value of α in is increased, the results show an up-down-up shape in the TV metric. At very low α in < 0.01, the TV value is limited by finite sampling effects. As α in is increased, it quickly deviates from the family of H Ising Gibbs distributions and gradually becomes more Gibbs-like. At some point, while increasing α in , a minimum TV value is reached and it begins to increase monotonically until the maximum α in value. This up-down-up trend is replicated across a variety of the models considered in this work and plausible causes for it are discussed in Section IV. The central observation of this analysis is that there is a band of α in between approximately 0.2 and 0.4, where the QA hardware's output distribution deviates from the target Gibbs distribution by less than 5% in total variation distance, which we designate as high-quality. Outside of this particular range of α in , the total variation distance increases considerably, showing how the QA hardware's distribution dramatically shifts away from the family of H Ising Gibbs distributions. Interestingly, there does not seem to be any significant dependence of the total variation distance on annealing time (i.e., each of the colored points in Figure 1 follow similar trends). These results suggest that this middle regime of α in is favorable for using QA hardware as a Gibbs sampler at each of the annealing times considered. Similar results on all of the GSD instances are provided in Appendix C.
The results from Figure 1 suggest there are operating regimes where the QA hardware's is an accurate Gibbs sampler; however, the effective temperature (i.e. α out ) of the that distribution is not presented. Figure 2 explores this point by presenting the α out values that were found for each of the high-quality input configurations considered (i.e., 0.2 ≤ α in ≤ 0.4). Given the prevailing theories of how QA hardware operates [45,46], one expects that increasing the energy scale of the input α in or the annealing time will increase the effective temperature of the output distribution, α out . Indeed, that trend is observed in Figure 2, where the recovered α out values increase monotonically (or nearly so) with both α in and annealing time. This monotonicity is a particularly useful result as it suggest a simple procedure for tuning the effective temperature of the distribution output by the QA hardware platform, which is an essential feature for most practical sampling applications.
Generating Gibbs samples in the α out range of from 1.85 to 5.16 as shown in Figure 2 is an encouraging result as the critical points separating high-and lowtemperature regimes of Ising spin glasses on lattices and random graphs typically occurs for values of α out < 1.0 [22]. Sampling from these low-temperature regimes is known to present significant computational challenges for classical algorithms, giving hope that QA hardware may be able to provide a performance enhancement on these tasks. However, additional study is needed to determine if the results presented here can generalize to larger system sizes with different connectivity layouts and to temperatures below the critical point of natively representable Ising spin glasses [58].   2: The values of α out recovered by operating the QA hardware at different values of α in and annealing times in the high-quality regime identified in Figure 1 for the the GSD-6 Hamiltonian. Encouragingly, α out increases monotonically (or nearly so) with both α in and annealing time, allowing the hardware generate distributions at different effective temperatures, which is an essential feature for practical applications.

Summary of the GSD Models
To test if the observations on the GSD-6 instance generalize to a broader collection of models, we repeated the previous experiment on all 13 of the GSD models proposed in this work, seven GRD and six GSD-F cases. Figure 3 displays the minimum total variation distance with respect to the ideal Gibbs distribution for each scale of α in and for each model. Only the 1 µs data is presented, but results with other annealing times are available in Appendix C. Although one can observe a variety of distinct features across these instances, the encouraging finding is that the range of α in within 0.2 and 0.4 consistently yields statistics with low total variation distance, suggesting the observations on the GSD-6 model do generalize to a broader class of models. In particular, α in ≈ 0.3 achieves a low total variation distance for nearly all of the models considered.
It is important to briefly discuss the GSD-2 and GSD-F-1 models, as these are the only ones that do not follow the up-down-up trajectory shown in Figure 1. In these cases, the total variation distance is high for low α in and then drops below 5% for α in greater than 0.2 as expected. However, the total variation distance remains low even for α in greater than 0.4. This result is because when α in is large, the probability distribution is highly concentrated in the ground states and these instances do not exhibit ground state degeneracy breaking. As there is only one ground state in the GSD-F case and two symmetrical ground states in the GSD case, the sampling task is relatively easy and effectively reduces to a ground state identification task. The QA hardware achieves low TV in these cases by simply outputting the ground states with high probability. Although one can consider these two cases to represent easy sampling tasks, we include them to highlight the increased challenge that instances with more ground state degeneracy pose to using QA for sampling applications.
The results from Figure 3 indicate that the proposed high-quality Gibbs sampling regime generalizes to a vari- FIG. 3: Total variation distance between QA hardware output distribution for various scalings α in and target Gibbs distribution with estimated α out on the all GSD instances (top) and GSD-F instances bottom. A consistent high-quality Gibbs sampling band is observed for 0.2 ≤ α in ≤ 0.4, with α in ≈ 0.3 achieving a low total variation distance for nearly all of the GSD models.  ety of Ising models; however, the stability of the effective temperatures (i.e., α out ) of those distributions is an important question. Table I provides a summary of this information by presenting the minimum and maximum α out values that can are achieved on each model by varying both α in and the annealing time (see Appendix D for more detailed information). Although there is some variability in the largest α out values, the results indicate that the range of effective temperatures output by the hardware remains relatively stable and the hardware is suitable for sampling from all models between 1.85 and 3.97. These results also suggest that the the observations on the GSD-6 model generalizes to a broader class of models.

IV. DISCUSSION
Using quantum annealing for Gibbs sampling has been proposed as early as 2010, [13]. Since then, several studies have indicated that the output distribution of a quantum annealer does not sample from the input Hamiltonian's Gibbs distribution [6,18,31,32]. However, many of these works use an input scaling that is outside of the range identified by this work and are therefore subject to the distortions that we see in Figures 1 and 3. These figures show that if α in is too small (< 0.2) or too large (> 0.4) then QA hardware's output distribution will differ greatly from that of a corresponding Gibbs distribution. Additionally, observing that the QA operating protocol proposed by this work yields high-quality Gibbs samples in a variety of Ising models suggests a systematic phenomenon that produces this behavior.
To investigate what might account for the distortions in these Gibbs distributions and the unique features that occur in the 0.2 ≤ α in ≤ 0.4 range, we conduct a detailed study of the output statistics of a very small Ising Hamiltonian, i.e., a three-spin Ising chain. With such a small system, we are able to create a theoretical model that reproduces the experimental data. This theoretical model suggests local field noise and residual transverse field effects as potential explanations for the observed distortions. The theoretical model found that the threespin system sampled from an Ising Hamiltonian with extraneous couplings when α in was in the low-or highscaling regime. Only when α in remained in the range identified by this work did the system produce the desired Gibbs distribution. This result provides additional evidence that this restricted scaling regime is optimal for conducting Gibbs sampling with QA hardware.

A. Three-spin Ising Chain Statistics
In this experiment, three spins denoted by σ 1 , σ 2 , and σ 3 are linked together in a ferromagnetic chain with σ 1 coupled to σ 2 and σ 2 coupled to σ 3 as shown in Figure 4. Notice that σ 3 is not coupled to σ 1 . Formally, the input Hamiltonian is defined as follows: where J in is the value of the coupling. Note that J in is equivalent to α in from the 16-spin experiments. J in is swept between 0.0 and 1.0 with step-sizes matching the discretization of α in . 2 For each value of J in , 5 × 10 6 samples are collected from the QA hardware and the output statistics are recorded. Once again, in order to mitigate bias a spin-reversal transform is applied after every 100 samples. The output distribution is analyzed by solving the inverse-Ising problem using the Interaction screening estimation algorithm [22,59]. This algorithm takes the empirical distribution produced by the hardware and estimates the Ising Hamiltonian that most likely produced the given output statistics. The estimated values for each of the couplings are denoted as J out and more specifically as J 12 , J 23 , and J 13 to indicate the specific edge in the three-spin chain. These estimated values are compared to the corresponding input values J in in Figure  5a. This procedure provides an understanding of the effective Hamiltonian output by the QA hardware, which may differ from the Ising Hamiltonian that has been programmed. The entire protocol is repeated for annealing times of 1 µs, 5 µs, 25 µs, and 125 µs with similar results.  13 indicates that this coupling is not programmed in the input Hamiltonian even though it is reconstructed from the output statistics and is thus a "spurious coupling." Only the data for a 1 µs annealing time is presented in Figure 5a, see Appendix E for results for the other annealing times.
Because σ 1 and σ 3 are not coupled together in the input Hamiltonian, one expects J 13 from the reconstruction to be close to zero. However, as shown in Figure 5a, J 13 appears to be negative and, hence, anti-ferromagnetic below J in = 0.275 and positive above that point. In [17], this effect is referred to as a "spurious coupling" because it appears in the output distribution despite not being programmed in the input Hamiltonian. It is striking that this spurious coupling exactly cancels out when J in = 0.275, which coincides with the optimal input scaling for α in for which almost all GSD models in Figure 3 are shown to have low total variation distance. Furthermore, the regime where the spurious coupling strength is low relative to the intended coupling strengths can be expanded to approximately include J in between 0.2 to 0.35, which again closely matches the optimal range for α in that was found for the 16 spin experiments. This provides additional support to the observation that the QA hardware considered here is an effective Gibbs sampler in this specific scaling range.

B. A Three-Qubit Ising Chain Effective Model
To further explore a connection between these experimental observations and the spurious couplings observed in the hardware's output (i.e., Figure 5a), we propose an extension of the effective single-qubit quantum model developed in [17] to the three-qubit context. The singlequbit model considered in [17,45] includes a transverse field with an intensity proportional to the input local field parameter, h. This transverse field component was able to reproduce an observed saturation of the output field for large input values. Another important feature proposed in [17] was qubit noise perturbing the input parameters of the model, which explained the spurious couplings in the regime of low input parameters. Building on these characteristics, we consider the following toy model on three qubits controlled by a single parameter FIG. 5: Notice how J 13 is reconstructed as negative for low-input couplings and then transitions to positive as input coupling increases beyond 0.275. As this coupling is not programmed in the input Hamiltonian, this non-zero reconstruction marks a disruption to the desired distribution. The negative and positive regions indicate the scaling regimes where noise and quantum effects distort the output distribution, respectively. We observe that the optimal scaling regime to use for Gibbs sampling is when J 13 is small relative to J 12 and J 23 .
J in in the absence of local fields. We assume the output distribution is a noise-averaged thermal distribution: where H is a three-qubit Hamiltonian with independent binary qubit noise realized through binary random variables s i : Equation (8) describes a three-qubit Ising chain, J in is a single input parameter controlling the strength of interactions, while γ i and η i are constants re-scaling the strength of the transverse field and qubit noise. Leveraging the parameters estimated in [17], we select β = 11, γ i = 0.013 and η i = 0.04 as typical values for these parameters. We would like to highlight that the model from Equation (8) is very different quantitatively and qualitatively from what is known as the Background Suceptibility model as discussed in Appendix F.
It was shown in [17] that for small values of the input parameter J in , the negative spurious coupling can be explained by field noise on the involved qubits. In fact, it is information-theoretically impossible to distinguish between a model with field noise and a model with the corresponding anti-ferromagnetic coupling. As seen in Figure 5b, the toy model in Eq. (7) indeed predicts that when J in is small and thus field noise is significant relative to J in , a negative spurious coupling is preset. We therefore designate this low-scaling regime where J in < 0.2 as the noisy regime.
When J in increases, the field noise becomes less pronounced relative to J in . In this regime of higher input values, the model (7) predicts an emergence of a positive spurious coupling -an effect that is also observed in the experimental data. This suggests that a J-dependent residual transverse field on each of the qubits can account for the positive response in spurious link and the saturation of that link for large inputs. Similarly to how we designated the low-scaling regime as noise-dominated, we can designate the high-scaling regime as dominated by the effect of residual transverse fields.
An interesting intermediate regime predicted by the toy model as shown in Figure 5 happens around J in = 0.275, where noise and transverse field influences cancel each other, leading to an effective absence of spurious couplings, which also occurs experimentally. This observation suggests that the optimal sampling parameter range for α in , referred to as the sweet-spot previously in this work, may be explained by the interacting effects of noise and residual fields in the system.

V. CONCLUSION
Inspired by recent work indicating that quantum annealing hardware behaves as a noisy Gibbs sampler in very low energy scales [17], this work identified an approach for mitigating the impacts of noise and conducting high-quality Gibbs sampling in a range of effective temperatures for a class of small hardware-native Ising models. This approach to using quantum annealing hardware for Gibbs sampling opens new opportunities for applications in machine learning and exploring the physics of condensed matter systems.
More broadly, the computational task of Gibbs sampling and the protocol developed in this work could provide an avenue for exploring the potential for quantum advantage in quantum annealing hardware. To that end, two follow-on works would be required. First, a class of Ising models needs to be identified that are challenging to sample from with classical algorithms (e.g., spinglasses) that also adhere to the criteria proposed in this work, i.e., naively representable on the hardware with J, h ∈ {−1, 0, 1}. Such examples seem unlikely in the Chimera hardware architecture [58,60]; however, the new Pegasus hardware architecture [61] will likely provide new opportunities for identifying such models. Second, significant additional research is required to verify that the protocol proposed by this work will scale to larger systems, ideally with 100's to 1000's of qubits, so that more of the quantum annealing hardware can be used. This type of verification requires an amazing amount of computation and tuning of sophisticated Monte Carlo methods as there are no known efficient algorithms for generating Gibbs samples of the target distributions that would be required as a baseline of comparison. Another practical challenge is that scaling to larger systems may also yield unexpected side effects in practice, such as amplifying the role of hardware programming errors [62,63], putting an implicit limit on sampling accuracy at larger scales. Although significant follow-on investigation is required to more deeply understand the potential for performing thermal Gibbs sampling with quantum annealing hardware, this work provides a foundation for maximizing the performance available hardware platforms when conducting these computational tasks. 89233218CNA000001. This material is also based upon work supported by the National Science Foundation the Given two distributions µ and ν over a binary string of size N , i.e. σ ∈ {−1, 1} N , the total variation distance between them is defined as,

VI. ACKNOWLEDGEMENTS
that is, the absolute difference between the probability of each state is added together and divided by 2, to normalize the metric to the range of 0 to 1. Given that the metric is in the range of 0 to 1, it is often presented as a percentage between 0 and 100. The TV distance can be interpreted as the maximum discrepancy between the probability of an event computed with the distribution µ instead of ν, that is, where A ∈ P {−1, 1} N is an element of the power set of {−1, 1} N . Therefore, it comes as the distance of choice to measure the error that can be made by estimating probabilities using a surrogate distribution ν of a target distribution µ. However, note that this sets a very strong quality criteria for estimating the reliability of a surrogate distribution as the TV is equal to the worse-case estimation error. It is important to note that if µ or ν are only accessible through a set of finite samples, there is an unavoidable error caused by sampling that will be captured by the TV distance. To avoid finite sampling error in the TV estimate, one needs to use a number of samples that is on the order of the typical support of µ and ν. This is required to estimate the absolute difference in Eq. (A1) accurately for each state σ. Typically, the number of samples required grows exponentially in N and is maximal for distribution µ and ν that are close to a uniform distribution where it is in the order of 2 N . It is fairly common to consider KL divergence as an alternative to the TV distance, especially when the nonconvexity of the latter becomes a computational bottleneck, taking advantage of Pinsker's inequality, i.e., TV(µ, ν) ≤ 1/2 KL(µ||ν). However, the KL divergence does not share the same meaning as the TV distance: the KL divergence estimates the inefficiency in compressing a source µ using a code optimized for ν. In general, it is not suitable for measuring the quality of probability estimates between a target distribution and its surrogate. A simple situation that illustrates this last point is when the distribution ν is almost identical to µ except that it has an infinitesimally small probability mass that lies outside of the support of µ. The TV distance between these distribution is infinitesimally small, while the KL divergence becomes infinite. The Ising model instances used in this work were generated by the following procedure. First, 16 spins (2 unit cells) on the D-Wave 2000Q Quantum Annealer DW 2000Q LANL hardware graph were selected with qubits 296 to 303 making up one unit cell and qubits 304 to 311 making up an adjacent unit cell in the Chimera graph architecture. Then the values for J ij incident to these qubits were assigned randomly from the set {−1, 1} for each coupler in the hardware graph. Second, a bruteforce calculation was used to compute the ground-state degeneracy for a given instance. 3 This procedure was repeated many times and a subset of cases were selected to highlight a smooth range of ground-state degeneracies. This yielded a total of seven GSD (ground-state degeneracy) cases with degeneracy values of 2, 4, 6, 8, 10, 24, and 38, which are presented in Table II. The same procedure was repeated while also assigning random values to the local fields h i to {−1, 1}, yielding six additional independently generated GSD-F models with 1, 2, 3, 4, 5, and 6 ground state degeneracies, which are presented in Table III. The specific instances are referred to as GSD-# and GSD-F-#, where the # indicates the amount of ground state degeneracy. Note that the GSD-F models tend to have lower degeneracy because the inclusion of the fields breaks symmetries within energy levels.
It is worth noting that these models are reminiscent of the RAN and RANF models that have been used to benchmark earlier generations of QA processors [64,65]. The RAN model class on Chimera graphs has been shown to exhibit a zero temperature phase transition [58,60], indicating that sampling from this model class at finite temperature is not expected to be computationally hard. Here, however, instances are hand-picked from this class to exhibit varying degrees of ground state degeneracy. In this sense, we do not expect that the instances of this work to be representative samples from the RAN class, and instead expect the high degeneracy instances to be challenging for a quantum annealer to sample from due to the degeneracy breaking effect of the transverse field.

Appendix C: Total Variation Distance at Different Scales and Annealing Times
In this section, we present supplementary figures for each of the experiments in the main text. Figures 7 and 8 display the total variation distance results for every GSD model as a function of α in . As in Figure 1 where only GSD-6 is shown, the results for each of the annealing times are overlaid with different colors. Note how GSD-2 and GSD-F-1 are the only instances where total variation distance decreases as α in increases. This is because models with low degeneracy are easier to sample from in the high scale regime, which was previously explained in Section III B 2. For the rest of the instances, the consistent U-shaped trend supports our claim that the most effective α in for achieving high accuracy is in the medium scale regime. This optimal band between α in of 0.2 and 0.4 remains consistent across all GSD models. In addition, these figures show that varying annealing time does not shift the optimal region by a significant amount, with the exception of GSD-10. However, there is a slight yet noticeable shift to the left in almost all instances as anneal times increase. Figure 9 communicates the same information as Figures 7 and 8, but more directly compares various GSD models by stacking their total variation distance values in a heatmap. Only the 1µs plot was presented in Figure  3 in the main text, but here the results for all anneal times are shown. Once again, it is evident that the optimal region for α in remains the same for various anneal times.

Appendix D: Effective Temperatures at Different
Scales and Annealing Times Figures 9 and 10 present the α out values that correspond to the optimal α in range between 0.2 and 0.4. The minimum and maximum value in each heatmap was used to construct Table I. These figures show how α in and anneal time can be used to smoothly tune α out and that the effect is consistent across all GSD models. Although the exact α out values vary slightly from model to model, the trend is predictable and so these figures suggest a straitforward approach for producing Gibbs samples at at desired α out value.

Appendix E: Three-Spin Ising Chain Annealing Times
Finally, Figure 13 shows results for the three-spin Ising chain experiment for various anneal times. The most important feature of these plots is the point at which the spurious coupling reduces to zero, as we hypothesize that this is the optimal input scale for sampling. From the plots, this optimal J in value slightly decreases from 0.275 to a low of 0.2 as anneal time increases. However, this value still remains within the optimal region seen in the 16-spin scaling experiments. As anneal time increases, both programmed and spurious couplings appear to increase more rapidly with the increase of J in . This increase results in the spurious coupling almost saturating by the time J in reaches 0.4 for anneal times of 25 µs and 125 µs. However, the majority of our claimed optimal region between 0.2 and 0.4 remains in the regime where the spurious coupling is small compared to the programmed couplings. It is believed that physically neighboring qubits are not perfectly isolated from each other and give rise to uncontrolled interactions. These spurious couplings are described by the Background Susceptibility model [44] and take the form of an interaction that is proportional to the coupling intensities emanating from neighboring qubits. For the three-spin Ising chain discussed in Section IV, it results in the following Hamiltonian with background susceptibility, where χ > 0 is the background susceptibility constant between qubits one and three (Note in [44], χ is a negative quantity following the D-Wave programming convention that ferromagnetic couplings are negative.). Notice that the intensity of the spurious link between σ 1 and σ 3 is quadratic in the input coupling intensity J in and is of ferromagnetic nature when J in is positive. Because the Hamiltonian in Eq. (F1) is diagonal, we immediately see that this model behaves very differently than the effective Hamiltonian found experimentally (see Figure 5). The background susceptibility model predicts no saturation in the input coupling or spurious coupling intensity, as the earlier grow linearly and the later grows quadratically. Moreover, the spurious coupling of the background susceptibility model remains ferromagnetic unlike what is measured experimentally for small values of J in . This behavior is depicted in Figure 6 for the typically encountered values of β = 11 and χ = 0.05.