Programmable Quantum Annealers as Noisy Gibbs Samplers

Drawing independent samples from high-dimensional probability distributions represents the major computational bottleneck for modern algorithms, including powerful machine learning frameworks such as deep learning. The quest for discovering larger families of distributions for which sampling can be efficiently realized has inspired an exploration beyond established computing methods and turning to novel physical devices that leverage the principles of quantum computation. Quantum annealing embodies a promising computational paradigm that is intimately related to the complexity of energy landscapes in Gibbs distributions, which relate the probabilities of system states to the energies of these states. Here, we study the sampling properties of physical realizations of quantum annealers which are implemented through programmable lattices of superconducting flux qubits. Comprehensive statistical analysis of the data produced by these quantum machines shows that quantum annealers behave as samplers that generate independent configurations from low-temperature noisy Gibbs distributions. We show that the structure of the output distribution probes the intrinsic physical properties of the quantum device such as effective temperature of individual qubits and magnitude of local qubit noise, which result in a non-linear response function and spurious interactions that are absent in the hardware implementation. We anticipate that our methodology will find widespread use in characterization of future generations of quantum annealers and other emerging analog computing devices.

Drawing independent samples from highdimensional probability distributions represents the major computational bottleneck for modern algorithms, including powerful machine learning frameworks such as deep learning [1]. The quest for discovering larger families of distributions for which sampling can be efficiently realized has inspired an exploration beyond established computing methods [2,3] and turning to novel physical devices that leverage the principles of quantum computation [4]. Quantum annealing [5] embodies a promising computational paradigm that is intimately related to the complexity of energy landscapes in Gibbs distributions, which relate the probabilities of system states to the energies of these states. Here, we study the sampling properties of physical realizations of quantum annealers which are implemented through programmable lattices of superconducting flux qubits [6]. Comprehensive statistical analysis of the data produced by these quantum machines shows that quantum annealers behave as samplers that generate independent configurations from low-temperature noisy Gibbs distributions. We show that the structure of the output distribution probes the intrinsic physical properties of the quantum device such as effective temperature of individual qubits and magnitude of local qubit noise, which result in a non-linear response function and spurious interactions that are absent in the hardware implementation. We anticipate that our methodology will find widespread use in characterization of future generations of quantum annealers and other emerging analog computing devices.
Sampling -the task of producing independent configurations of random variables from a given distribution -is believed to be among the most challenging computational problems. In particular, many sampling tasks cannot be performed in polynomial time, unless strong and widely accepted conjectures in approximation theory are refuted [7][8][9]. The potential value of quickly generating highquality samples is exemplified by the recent application of emerging analog computing devices, including those based on optical [10] and quantum gate [11] technologies, to sampling tasks. However, analog computers are inevitably impacted by hardware imperfections and en-vironmental noise, which distort the computations that they are designed to perform. Assessment of the quality of samples produced by such devices provides a key diagnostic to understand the nature of interactions, biases, and noise inside analog computational machines. In this Report, we leverage state-of-the-art statistical learning methods to conduct a precise fidelity assessment of the sampling properties of analog quantum annealing devices, providing a key foundation for using these devices in high-value sampling tasks.
Adiabatic quantum computing [12] exemplifies a promising physical principle that may lead to an enhanced exploration of the potentially rough energy landscape due to quantum tunneling [5]. State-of-the-art quantum annealing processors [6] were recently used to push the frontiers of quantum simulations [13,14], optimization [15], and machine learning [16]. Comparably, the use of quantum annealers for sampling [17][18][19][20] is not as well understood. This is partially due to the lack of methods for a rigorous characterization of empirical distributions produced by independent runs of quantum annealers. Additionally, as is the case with any sophisticated analog device, quantum annealing processors are inevitably affected by noise and biases of diverse nature that are difficult to characterize [21,22] and complicate the use of these devices as samplers.
In this study, we focus on the family of D-Wave quantum processing units (QPUs) [6]. An elementary unit of the D-Wave quantum annealer is a superconducting quantum qubit i whose final state is specified by a binary spin variable σ z i that takes value +1 or −1 during a read-out process in the computational basis denoted by z. Depending on the particular device, the total number of qubits can vary between 1152 for D-Wave 2X to 2048 for the D-Wave 2000Q machine. The qubits are interconnected through superconducting couplers that form the so-called chimera graph G = (V, E), where V denotes the ensemble of qubits, and E is the set of couplers defined by the connectivity of the chip, see Fig. 1a. The magnitude of currents circulating in the superconducting couplers define the strength of pairwise interactions between individual qubits that can also be biased towards a particular state through a local field.
A D-Wave QPU implements [23] the following interpolating Hamiltonian, also known as energy function, for arXiv:2012.08827v1 [quant-ph] 16 Dec 2020 s ∈ [0, 1]: where H Ising = ij∈E J in ij σ z i σ z j + i∈V h in i σ z i is the target Hamiltonian of the Ising type, i.e. containing only pairwise interactions and local terms. This Ising energy function is specified through the user-defined input parameters: Pairwise couplings J in ≡ {J in ij } (ij)∈E and local fields h in ≡ {h in i } i∈V , where each J in ij can be set in the range [−1, 1] and each h in i in the range [−2, 2]. The annealing schedule is controlled by two monotonic functions A(s) and B(s) satisfying A(0) B(0) and B(1) A(1). The value σ z i (that we will refer to as "spin") for each qubit i is read out in the end of the annealing procedure; In what follows, we will drop index z when discussing these classical measurements of the qubit state. While QPU takes J in and h in as input values, the real values of couplings and magnetic fields implemented on the chip can significantly differ due to a combination of several effects, including programming errors, multiplicative corrections related to effective temperature of the chip, and additive factors such as flux noise and biases, among others.
The premise of adiabatic quantum computation [24] in an isolated setting consists in a sufficiently slow interpolation of the system Hamitonian towards the target one, H Ising . The adiabatic theorem prescribes that when initially prepared in the lowest-energy configuration (ground state) of the starting Hamiltonian, the system is always found in the ground state of the interpolation Hamiltonian; this principle allows one to retrieve ground states of a non-trivial target Ising model, which is useful for optimization applications. However, due to finite temperature [25], decoherence [26] and other sources of infidelity such as flux qubit noise [23], available quantum annealers, such as those produced by D-Wave Systems, do not consistently find ground states of the target models, but instead behave as non-isolated quantum systems, ending up in excited states. In other words, these quantum devices act as samplers from an unknown distribution, which is commonly expressed as a Gibbs distribution: A probability measure that expresses the probability of measuring a certain state σ ≡ {σ i } i∈V as a function of that state's energy, µ(σ) ∝ exp(H(σ)), where H(σ) is some energy function evaluated at σ. This handicap for optimization applications can be turned into an advantage when the annealer is viewed as a sampling device, provided that it is possible to predict the distribution of configurations output by the quantum annealer based on the specified input model. This prediction constitutes the primary objective of this work.
Our investigation of the form and the nature of the distributions produced by quantum annealers begins with a study of the statistics of a single qubit. Consider an experiment that estimates the parameters of the out-put distribution of a single qubit i represented by a binary variable σ i in the form of a Gibbs distribution, µ effective (σ i ) ∝ exp(σ i h out i ). One can estimate the effective local field h out i for different values of input fields h in i by looking at the empirical count of positive observations σ i = +1. The resulting dependence of h out i as a function of h in i is depicted in Fig. 1b. We use the maximum likelihood approach (see Supplementary Information) to infer parameters that best describe the output statistic in terms of classical and quantum Gibbs distributions. Our results show that while the output distribution of the quantum machine is well described by an effective classical Gibbs distribution for the range of parameters |h in i | ≤ 0.35, quantum Gibbs statistics is required to adequately describe the statistics of the effective local field h out i for larger input parameters |h in i | > 0.35, see Fig. 1b. In both regimes, the effective inverse-temperature β of the output distribution is very high, in the range β ∈ [12.4, 13.3] in the programming units for the considered qubits. This corresponds to a low-temperature regime for most classical distributions, which are notoriously challenging to sample from efficiently. For example, the low-temperature phase of Ising spin glasses occurs at (βJ) c = 0.44 [27], which translates to values of field and coupling magnitudes around 0.035 in the quantum annealing programming units. Finally, and most importantly, a detailed study of the resulting distributions in Fig. 1b provides strong evidence that rapidly fluctuating noise in the residual local fields plays an essential role in describing the observed statistics of h out i . These observations at the single-qubit level naturally lead to the central proposal of this study: Within a suitable input parameter range, D-Wave's quantum annealers act as low-temperature noisy Gibbs samplers. In other words, the annealers sample from a Gibbs distribution with energy function parameters that fluctuate due to noise. This proposal is a notable departure from an ideal non-zero temperature quantum annealer, which is expected to sample from the Gibbs distribution of the input Hamiltonian at the annealer's temperature [25].
The single-qubit experiment provides strong evidence of the noisy Gibbs sampler hypothesis for individual qubits, however it is not clear if similar properties will generalize to larger multi-qubit systems. To further investigate this hypothesis in the context of multi-qubit systems, we conduct a comprehensive characterization of the output distribution on eight qubits that form a single cell of the quantum annealing chip, described in Fig. 1a. In order to remain in a regime described by classical Gibbs distributions, we chose for input parameter values J in ij = 0.025 for all edges (i, j) inside the cell and h in i = 0 for all qubits. We note that all discrete distributions on 8 spins with interaction orders up to eight can be fully specified by an exponential family distribution with 255 parameters. We reconstruct these parameters by generalizing our Interaction Screening estimator for learning The QPU takes input parameters J in and h in that are specified on the chip with the chimera topology, depicted here for the 2000Q machine at Los Alamos National Laboratory. The magnifier shows details of inter-and intra-connections between qubits in four cells composed of eight qubits each. In the end of the anneal, a classical binary projection σ z is read out for each qubit. In blue, we highlight the qubits that were used for the reported single-cell experiments. (b) Dependence of the field h out describing the output statistics of σ z in a single-qubit experiment is plotted here as a function of the positive input parameter h in . Error bars represent statistical fluctuations up to 3 standard deviations. The statistics observed for h in i > 0.35 significantly deviates from the linear behavior expected for a classical Gibbs distribution; In the Supplementary Information, we show that this behavior can be described by a quantum model with a residual transverse field. In the regime of input parameters h in i ≤ 0.35, the output distribution can be described by a classical distribution with a linear dependence h out ∝ h in i . We show that the change of the slope around h in i ≈ 0.15 can be explained by fast-fluctuating noise on the local field: The region of h in i ≤ 0.15 is dominated by noise which creates a reduced effective response β eff , while the noise plays a less pronounced role in the intermediate region where the response coefficient β emerges as the inverse-temperature of the model (see Supplementary Information for details). Notably, this single-qubit experiment provides a reliable estimation of fast field fluctuations that are too rapid to be measured directly.
of Ising models [27, 28] to the case of models with multibody interactions (see Supplementary Information). We would like to stress that the resulting estimator is exact up to statistical fluctuations.
The statistical significance of the reconstructed parameters is empirically estimated by conducting 50 independent replicates of the algorithm and measuring the variance in the solutions, as described in the Supplementary Information. Our results show that the output statistics of binary configurations is well described by a classical Gibbs distribution with an energy function structure presented in Fig. 2. In particular, the results indicate that multi-body terms beyond pairwise interactions are statistically insignificant, and hence the distribution output by the quantum annealer is well described by a Gibbs distribution on a model of the Ising type. Surprisingly, the resulting model reveals the presence of additional spurious couplings that do not appear in the input model, and correspond to couplings that are not present in the hardware's implementation, see Fig. 2. We will later see that these spurious links are an unexpected consequence of local field noise, similar to the of change in effective inverse-temperature that noise causes in Fig. 1b.
To better characterize the nature of these spurious links, we employ a data-driven approach to learn how these effects depend on the input parameters. Specifically, we learn a response function that links the input parameters to the effective output parameters that describe the distribution, {J out , h out } = f (J in , h in ), by regressing on 250 pairs of input-output models. Input models have been independently sampled for parameters in the range J in ij ∈ [−0.05, 0.05] and h in i ∈ [−0.05, 0.05]. As a result, we show that the input-output function is well described by a general quadratic response. This inputoutput function reveals that a linear scaling of the input model is the primary driver of the output distribution,

FIG. 2.
Characterization of the output distribution on eight qubits. For a given input model on eight qubits forming a single cell of the chip and characterized by parameters J in ij = 0.025, h in i = 0, we reconstruct the most general Gibbs distribution on eight binary spins, with an energy function containing all interactions up to the maximum possible order, eight. We repeat the experiment for 50 sets of independent samples to quantify the statistical significance of the reconstructed values leading to the shown 3 standard deviations (s.d.) tolerance. Our results indicate that the second order Ising model provides an adequate description of the emerging output distribution, while higher-order couplings are not statistically significant and can be explained by statistical fluctuations. We find that among statistically significant interactions in the Ising model that best describes the output distribution, the strongest ones (purple) are in correspondence with the input couplings, while the weaker ones (blue) are the spurious couplings that are are absent in the chip topology, as well as the spurious fields that are also not present in the input problem. . Noise-aware model of spurious couplings and fields. Inspection of the general input-output quadratic response function indicates that spurious couplings and spurious fields emerge as a quadratic response to the adjacent input parameters. Importantly, the purely quadratic part of the response is described by negative-sign susceptibilities; For instance, interactions favoring same-spin alignment in the input model would lead to spurious interactions that favor opposite-spin alignment in the output model. On the pictured minimal Ising models, we observe that instantaneous qubit noise, introduced here as a zero-mean random variable η with a variance denoted var(η), provides a mechanism by which effective spurious couplings and local fields can appear in the output distribution. This shows that the observed output distribution with spurious effects is indistinguishable from a mixture of distributions that differ by a realization of the random noise in local fields. We derived analytical expressions for these models that show that the resulting spurious couplings and spurious fields respond quadratically to the input parameters of the model, with a negative-sign susceptibility, and with the same geometrical pattern as observed in the experimental response function (see Supplementary Information for a detailed derivation). The spurious link effect is notably sensitive to the noise level as its strength increases with the square of the field noise variance, var(η). but this linear model is distorted by spurious links and fields that depend quadratically on specific input parameters (see Fig. 3). This non-linearity in the response function may explain why previous studies based on a linear response assumption [18, 20, 29-32] found that effective temperatures inferred under this hypothesis were instance-dependent, while the response function we construct is universal across all input models.
The response function analysis provides a valuable insight that functions of specific combinations of input parameters, i.e. negative feedback from specific edge-edge and field-edge pairs, are the drivers of the spurious effects in the output distribution. Inspired by the observation that local-field noise has a significant role in the singlequbit model, we further investigate how noise may impact the output distributions of multi-qubit models. Our first observation is that due to near-instantaneous qubit noise each sample produced by the quantum annealer represents a unique realization of the input model. Consequently, the output distribution represents a mixture of models rather than independent runs from a consistent model. Focusing on the edge-edge and field-edge pairs highlighted by the response function analysis, our second observation (presented in Fig. 3) is that these spurious effects can arise from reconstructing a single effective model from samples of a mixture of models with random local field values. Finally, we observe that spin-reversal transforms, a common persistent bias mitigation technique, cannot eliminate the emergence of spurious effects due to instantaneous noise (see Supplementary Information). Altogether, these observations provide a qualitative evidence that instantaneous noise on local fields represents the underlying feature yielding the spurious effects observed in D-Wave's output distribution.
To further validate our noisy Gibbs distribution hypothesis, we conduct a comprehensive comparison to the statistics of a simulated noisy Gibbs sampler and the output distribution of D-Wave's quantum annealer. Specifically, we replicate the quadratic response function analysis with a simulator that generates samples from a mixture of noisy Ising models calibrated with the noise and scaling parameters extracted from the single-qubit analysis. As shown in Fig. 4, the input-output response function from the simulated distribution provides a strong agreement with the measured susceptibilities, providing compelling evidence that output statistics of D-Wave quantum annealer can be modeled as a noisy Gibbs distribution. Our noisy Gibbs sampler model has been validated across three generations of quantum annealers, and is also indirectly confirmed through replicating this response function analysis on the lower-noise version of 2000Q machine [33], where we observe that the intensity of spurious effects seen in the output distribution are significantly reduced (see Supplementary Information).
We anticipate that methods presented in this study will be broadly used in characterization of analog devices that produce binary samples. Our work opens many avenues for future research, such as studying critical and low-temperature behavior of Ising spin glasses, especially with anticipated increased connectivity [34] or extension to non-stoquastic Hamitonians [35] in future realizations of quantum annealers that would enable the study of a richer class of problems. The learned response function that maps effective output parameters to the input pa-rameters can be used for calibration of analog machines, which would be useful for practical sampling applications. The concept of noisy Gibbs sampler is also promising in its own right for accelerated solution of robust optimization and sampling problems within hardware-inthe-loop approaches.  Fig. 2. We present a comparison of the response measured in the experiment to the simulated response based on the mechanism explained in Fig. 3 using the noise values extracted from the single-qubit experiment presented in Fig. 1. The response for the native chimera coupling J out 304,305 is driven by a linear self-response term or effective temperature. For the spurious coupling J out 308,309 , a linear response is nonexistent but a quadratic response comes from terms involving adjacent couplings forming a triangle with the spurious coupling, in agreement with observations in Fig. 2 and Fig. 3. The response for field h out 309 primarily consists of a linear part driven by the effective temperature but also has a quadratic response involving a neighboring coupling and the adjacent connected field. The complete quadratic response with all terms can be found in the Fig. S13 of the Supplementary Information. The spurious coupling response deviates sensibly more from simulated predictions than the spurious field response. Among the main causes for this discrepancy, we find that measurements of spurious couplings are more prone to statistical fluctuations being significantly weaker than spurious fields, and the sensitivity to noise variation is higher according to the dependence presented in Fig. 3. Remarkably, the simulated quadratic response shows a strong agreement with the measured susceptibilities for spurious fields.

Primary Hardware Platform and Experimental Settings
The primary hardware platform on which the majority of experiments have been conducted is a D-Wave 2000Q quantum annealer at Los Alamos National Laboratory, referred to as DW 2000Q LANL. The DW 2000Q LANL QPU chip has a so-called chimera graph structure with C 16 topology, i.e. it is composed of two dimensional lattice of 16-by-16 unit cells. Each unit cell is composed of 8 qubits connected through a complete bipartite graph structure. A very small number of faulty qubits are disabled and not available for programming. In total, this QPU has 2032 operational qubits and 5924 operational couplers. A complete topology of the DW 2000Q LANL hardware graph is depicted in Fig. 1a of the Main Text.
Most of experiments in this paper deal with a single unit cell with 8 qubits and 16 couplers; the specific identifiers of this cell is given in Table S1. We denote the set of qubits as V with |V | = N , and the set of couplers as E. This specific set of qubits was selected as it is characteristic of a typical complete unit-cell in a hardware chip.
Unless specified otherwise, in this work we set the following additional solver parameters when submitting jobs to the D-Wave hardware: auto scale = False, which ensures that the input parameters are not automatically rescaled to utilize the maximal operating range (a feature sometimes used for optimization applications); flux drift compensation = False, which prevents automatic corrections to input fields based on calibration procedure that is run a few times each hour; annealing time = 5, which corresponds to a single-run annealing time of 5µs; and num reads = 10000, which specifies the number of samples collected for a single programming of the chip. The impact of the specific choice of the annealing time in the regime of parameters considered in this work is negligible, as discussed in Section . The motivations for disabling the flux drift compensation and the impact of spin reversal transformation are thoroughly discussed in Sections and .

Single-qubit Experiments
As demonstrated by the results in this paper, in the regime of couplings that are interesting for sampling applications, the output statistics are perfectly described by a certain classical Boltzmann distribution. Here, we investigate at which coupling strength this classical description breaks down, and one needs to introduce a different statistics, such as quantum Boltzmann distribution, for an adequate description of the output data. We show that this regime corresponds to the couplings strengths that are order of magnitude above the intensities that we consider throughout our study. The experiments consist in looking, for isolated spins, at the relationship between the input magnetic field h in and the outcome statistic described by the effective field h out . The outcome statistic of a single spin being always fully expressible by a probability distribution taking the following form, Our process to estimate h out for a given value of the input magnetic field consists in the following steps. We start by collecting M samples from the D-wave annealer embodied as a list of single spin realizations σ (k) ∈ {−1, 1} for k = 1, . . . , M . The statistic S that we extract from these samples is the count of positive spin realization S = M k=1 δ σ (k) ,1 , where δ is the Kronecker delta. Assuming that each sample is effectively independent and identically distributed from Eq. (S02), we observe that the statistic S is a Bernoulli process with M trials and probability of success p = 1+tanh(h out ) 2 . We estimate the probability of success p using the standard unbiased estimator p = S/M for Bernoulli processes. We compute confidence intervals I α = p, p with confidence level α around our estimator p using the exact method of Crow [36] leading to minimal length intervals. Finally, we invert the relationship between p and h out to find an estimate of the output effective field. Confidence intervals on h out are found using the same relation since it is a monotonic mapping. In the experiments, we have collected M = 5 × 10 6 samples using the D-wave spin reversal transform for values of h in between −1 and 1. We have chosen the confidence to be α = 0.997 corresponding to a "3σ" confidence level. The classical Boltzmann distribution for a single spin at thermal equilibrium and exposed to a magnetic field h in is given by the celebrated formula, where h res is a residual magnetic field independent of h in and β is the inverse temperature times the Boltzmann constant β = 1/k B T . By comparing Eq. (S02) with Eq. (S03), we find that the relationship between h in and h out predicted by a classical Boltzmann distribution is linear The classical parameters β and h res are estimated using the maximum log-likelihood approach for which the loglikelihood function takes the following form, where h out measured are measurements of h out for different values of h in and h out model describes the modeled relationship between the input and output fields, e.g. h out classical from Eq. (S04). Measurements of h out for different values of h in using the aforementioned procedure on spin #309 and the classical relationship between h in and h out are depicted in Fig. S1. We clearly see that if the classical relationship holds for small values of h in it fails to explain the flat tails of the measurements for which |h in | 0.5. In fact, we starts to already see a deviation from the linear curve for magnitudes of the input field between 0.5 |h in | 0.2. In the forthcoming analysis, we will show that the first behavior can be explained by quantum effects, and the latter by classical noise in the input field.
A quantum statistical description of a one spin system is realized through the so-called density matrix formalism. We consider a quantum spin exposed on the z-axis to a magnetic field h in and residual field h res and exposed on the x-axis to a transverse field h trans . This is a natural assumption as the Hamiltonian realized during the considered quantum annealing process is composed of these two terms [6]. The density matrix describing this one spin system at thermal equilibrium is the following object, where σ z and σ x are the usual Pauli matrices for the z and x axis respectively. The transverse field h trans that appears in Eq. (S06) is in general a function of h in . Due to an observation that the experimental points seem to flatten out for large values of h in in Fig. S1, we choose to parametrize the transverse fields as linear transformations of the input field, namely h trans = ξh in , a dependence which is consistent with a saturation in the h out response. The mean value of observing the quantum spin along the z-axis is given by Tr ρ σ z . From this relationship we deduce that the probability of observing the system taking the value σ ∈ {−1, 1} is given by the following probability distribution, With these assumptions we find that that the relationship between h out and h in in the quantum case can be described as follows, Notice that in the limit where ξ = 0, the quantum predictions from Eq. (S08) converges to its classical counterpart from Eq. (S04). As we will see later in Section , the variability in the residual field h res is non-negligible and in fact plays a key role in the explanation of the spurious link behaviors, see Section for a detailed explanation. Assuming that h res is a random variable with probability density function f (h res ), we can obtain a noisy quantum description of the measurement outcomes by applying Bayes's rule to Eq. S08, We will see later that in the regime of noise relevant to our experiments, the precise knowledge about the shape of the distribution f (h res ) plays little role.
Note that for h res 0 = 0, the first order expansion of Eq. (S010) for small values of h in yields the simple relationship h out qnoise ≈ βh in cosh(βh res sd ) −2 . This shows that one effect of noise on the system consists in reducing the effective inverse temperature that one could infer from a classical (i.e. linear) regression for small values of h in . The effective inverse temperature for the noisy quantum model is higher than the effective inverse temperature found with the noiseless quantum model. The noiseless quantum model and the classical models find similar temperatures.
We infer the parameters of the proposed quantum models, i.e. h res 0 , ξ and h res sd , with the maximum log-likelihood approach described by Eq. (S05), replacing the quantity h out model by either the functional relationship h out quantum from Eq. (S08) or h out qnoise from Eq. (S010). In Fig. S2, we show the measurements acquired for spin #309 along with the curves predicted by the noiseless and noisy quantum models. The flattening of the output field response for |h in | 0.5 is correctly accounted for by the addition of the quantum transverse field. The initial deviation from the linear response for values of the input field between 0.5 |h in | 0.2 are predicted accurately only by the noisy quantum model. An important feature highlighted by this study is the effect of the noise on the effective inverse temperature (or linear response) that one may infer from the data. We see from our models that the noise in the residual field effectively lowers the inverse temperature for small values of the input field of order |h in | 0.2. Indeed, the effects due to noise will be more prominent when the input field is small. In this particular illustration, the noise on the residual field accounts for more than 20% of the input field for magnitudes of |h in | ≤ 0.2.
In Fig. S3, we show similar measurements collected for spin #307 as long as the predictions of the proposed quantum models. Among all spins that we tested, we found that spin #307 has the lowest level of residual field noise. In this particular case, we see that both noiseless and noisy quantum model predictions agree and we find an inverse temperatures consistent with the inverse temperature found for spin #309 with the noisy quantum model. This particular spin displays a low-level of residual field noise. As a consequence, the noiseless and noisy quantum models find the same inverse temperature. . List of regression coefficients for eight different spins found with the classical model, the noiseless quantum model and the noisy quantum model. The coefficients are obtained through the maximization of the log-likelihood function respective to each model. The inverse temperature found by the classical and noiseless quantum models differs from the inverse temperature obtained with the noisy quantum model for significant noise magnitudes. The other parameters are consistent between models which shows that these models form a hierarchy of increasing descriptive complexity. The transverse field responses for spins #304 to #307 differs significantly from the transverse field responses of spins #308 to #311. This suggest a difference in the hardware implementation of these two groups of spins for they are physically located on the two different sides of a chimera cell.
For the unit cell of interest to this work, we show the regression coefficients obtained with the classical model, the noiseless quantum model and the noisy quantum model in Table S2.
Finally, we study the effect of the noise probability distribution on the model. We compare our noisy quantum model derived from a binary probability distribution with the noisy quantum model that one obtains with a uniform noise distribution, i.e. f (h res ) = (2 √ 3h res sd ) −1 for h res ∈ h res 0 − √ 3h res sd , h res 0 + √ 3h res sd and zero otherwise. Predictions for the spin #309 made by these two models of noise are displayed in Fig. S4. The parameters for both models are identical and correspond to the parameters inferred with maximum log-likelihood for the binary probability distribution. We see that for noise magnitudes of h res sd = 0.048 the predictions are practically indistinguishable.

Learning of General Classical Distributions on Binary Variables
An arbitrary positive probability distribution on N classical binary variables σ ∈ {−1, +1} N can be represented in the the form of a Gibbs distribution with different interaction orders: where Z denotes the normalization factor known as parition function. In general, the number of terms can be exponential. However, physical systems are typically characterized by a finite number of short-ranged multi-body interactions. For instance, the renowned Ising model corresponds to the case where only first and second orders are present. In this section, we develop an exact learning method that allows one to reconstruct an arbitrary probability distribution on binary variables, i.e. recover the parameters {h i , J ij , J ijk , J ijkl , . . .} from a number of independent spin configurations sampled from the distribution (S011). For the case of Ising models with pairwise interactions only this reconstruction problem is known as inverse Ising problem, and has been extensively studied in the past with a number of heuristic methods [37]. However, the inverse Ising problem has been solved only recently, with the appearance of exact algorithms that showed that parameters of an arbitrary Ising model, including low-temperature and spin-glass models, can be recovered to an arbitrary precision with an appropriate number of samples [28,38]. The state-of-the-art near-optimal performance for inverse Ising problem has been recently achieved with the estimator based on the Interaction Screening Objective (ISO) [27]. For Ising model, the ISO reads where f (σ) M = M −1 M m=1 f (σ (m) ) denotes the empirical average over M independent samples, and J i denotes the set of couplings adjacent to node i, i.e. J i = {J ij } j =i for Ising models. It is easy to see that the ISO in (S013) is a convex function of parameters J i , h i . Furthermore, the unique minimizer of the ISO converges to the true parameters of the distribution in the limit of a large number of samples, and yields an O(1/ √ M ) error on the recovered model parameters for finite M [28]. The ISO is a local estimator, i.e. it is defined for each spin, and only involves couplings adjacent to this spin, so for reconstructing the entire model one needs to run N parallel reconstruction problems (S014). Notice that this procedure yields two estimations for the same coupling, J ij and J ji , and we use their arithmetic mean ( J ij + J ji )/2 as the final estimate of the coupling J ij . This estimator is used throughout the paper for reconstructing parameters of Ising models, where the required number of samples for a given precision and the associated expected confidence interval are obtained through synthetic numerical experiments, as explained below.
Here, we generalize this objective function to the case of general Gibbs distributions with multi-body interactions of the type (S011). The corresponding ISO reads (S015) Similarly to the Interaction Screening estimator for the inverse Ising problem (S013), it is easy to see that the ISO for general models is a convex function of parameters J i , h i . Let us present a simple argument that illustrates the fact that in the limit of large number of samples the unique minimizer (S014) of the convex ISO objective (S015) is achieved at ( J i , h i ) = (J i , h i ), meaning that the true interactions present in the model are fully "screened". Indeed, the ISO is an empirical average of the inverse of the factors in the Gibbs measure; if where · · · denotes the average over the measure (S011). Let us look at the derivative of the ISO with respect to a given coupling, say J ijk . This derivative corresponds to weighted three-body correlation, for instance ∂S * i /∂J ijk = σ i σ j σ k /F i (J i , h i ) , and this reflect the key property of the Interaction Screening based estimator. When ( J i , h i ) = (J i , h i ), ∂S * i /∂J ij | Ji,hi = 0, meaning that the minimum of ISO is achieved at ( J i , h i ) = (J i , h i ) as M → ∞. Again, similarly to the case of Ising models, after running N parallel local reconstructions, the resulting couplings J i = {J ij , J ijk , J ijkl , . . .} are symmetrized using all permutations of tuples (i, j, k, l, . . .). This estimator is used for probing general distributions with multi-body interactions in the next section.
It is important to highlight that the estimator given by the generalized ISO (S015) is exact: With a given number of samples M , the deviation between reconstructed ( J i , h i ) and true (J i , h i ) model parameters decay as ∼ 1/ √ M . Next, we describe an empirical procedure that we have developed to estimate the reconstruction fidelity in practice, which dictate how much data we needed in all experiments to statistically exclude the finite-sample considerations from all conclusions that we draw throughout the work.

Empirical Estimation of Reconstruction Errors
For finite number of samples M , couplings estimated via (S014) are accurate up to a certain error that decays with M . This error can be estimated theoretically, however the resulting worst-case bounds can be loose for a given model. To the best of our knowledge, there is no standard approach to tightly quantify the finite sampling error of learning models of the form (S011) in practice. To address this challenge, we propose the following empirical error estimation procedure for a given model with specific parameters and a fixed number of samples. Specifically, given a black-box sample generator B, a finite sample set M , and a replicate parameter R, we conduct the following procedure: 3. for r from 1 to R independent replicates do the following: (a) collect M synthetic samples from m using an auxiliary sampling algorithm; (b) reconstruct model m r from the collected samples; 4. compute statistics over the parameters reconstructed across the reconstructed {m r } r∈1,...,R models.
In this work the black-box sampler B is given by our DW 2000Q LANL QPU, and the sampling algorithm is a brute-force approach that enumerates every possible state and computes the exact probably of each state. The this brute-force approach is feasible for small number of spins which will be the focus of our targeted experiments. For larger problems one could utilize more scalable sampling algorithms, including those based on Markov-Chain Monte-Carlo techniques, or Belief Propagation with decimation. The last step will provide the information on the typical variability of reconstruction accuracy due the effect of finite samples. Given a set of R replicates of the models of the type (S011) reconstructed from R independent sets of samples. Let us define (h r , J r ) ∀r ∈ R as parameters of the models of the type (S011) learned from each of the R sets of samples. We also define (δh, δJ) representing deviations from the parameters (h, J) reference model that was used to produce synthetic samples: In the step 4 of the procedure above, we estimate the empirical mean and standard deviation on each of the parameters in the set h i , J i from R values. For a sufficiently large R, these quantities indicate an expected scale of error coming from model recovery for a given number of samples.

Probing of Multi-Body Interactions
The objective of experiment in this section is to leverage the ISO for general distributions on binary variables (S015) to determine the class of models that adequately describes the output distribution produced by the DW 2000Q LANL QPU. Although the target Hamiltonian has a form of a classical Ising model, a priori this distribution could be more general than the Ising Gibbs distribution (S012). In this section, we probe the existence of multi-body interactions beyond pairwise in the energy function of the output distribution. We conduct an experiment on a single chimera graph cell involving 8 qubits (see Table S1). As explained in the previous section, for an 8-spin model, the most general positive probability distribution can be written in the form (S011) with an energy function being a polynomial of order eight. Specifically, for σ ∈ {−1, +1} 8 , the possible distribution reads This model has a total of 8 h parameters 247 J parameters. Our goal will be to practically show existence or absence of multi-body interactions in the output distribution. Presence of interactions can be established if the reconstructed couplings are statistically significant, i.e. they are larger in absolute value than the reconstruction error resulting from a finite-sample reconstruction. We show that a few million D-Wave samples will be sufficient to provide an accurate reconstruction of model parameters.  The experiment we conduct here focuses on learning 8-th order models from the samples output by the D-Wave hardware on two canonical models, a ferromagnet and an anti-ferromagnet. The parameter details of these two models are presented in Table S3. The primary objective of this experiment is to determine what model parameters are statistically significant. Specifically, what learned parameters can-and-cannot be attributed to artifacts from to finite sampling and the model reconstruction algorithm. Through a procedure that was detailed in section , a number of supporting simulations are conducted to determine error bounds on the reconstructed model parameters. Leveraging the obtained error value, we determine the three standard deviations threshold that determine the statistical significance of the reconstructed values. Recovered couplings with absolute values above this threshold are very unlikely to be due to a reconstruction error, while values below can be explained by the finite sample noise in the reconstruction process.
Figs. S5 and S6 present the absolute values of the 255 model parameters broken down by the interaction-order for the ferromagnetic and the anti-ferromagnetic cases, respectively. We find that in both cases, a second-order model provides an accurate representation of the output distribution of DW 2000Q LANL. This experiment thus provides a convincing evidence that a second-order model is sufficient for modeling the distribution that the quantum annealer samples from for a range of input parameters of interest to this work.
We further validate these results by quantifying the typucal variation of model parameters used in the 8-th order reconstruction based on finite sampling error. To that end, the validation protocol described in section is executed with with R = 50 replicates and M = 10 7 , to replicate the number of samples used in the data collection for most of the key experiments in this work. Fig. S7 shows data for the mean and the standard deviation for each parameter deviation δh or δJ from the reference model, estimated from running the variance measuring procedure defined in the section on the reconstruction models. We see that estimated variance of reconstructed values is very similar across different couplings. The average-case variance across all model parameters using three standard deviations are 0.0034 and 0.0021 for the Ferromagnet and Anti-Ferromagnet models, respectively. These values have been used for computing the threshold values that appear in Figs. S5 and S6 in the previous section.

Reconstruction and Validation of Two-Body Models
The previous section conducted a 8-th order reconstruction with 10,000,000 samples and argued that a 2-nd order model provides a sufficient approximation of the distribution that the DW 2000Q LANL hardware samples from. To that end, the remaining experiments in this work focuses on learning 2-nd order models of the hardware's output distribution. Leveraging the knowledge that this 2-nd order model is sufficient has a significant advantages in that the number of model parameters reduces from 255 (8-th order) to only 36 (2-nd order), which in turn reduces the amount of data required to accurately learn the associated 2-nd order model. To further lessen the data requirements, we reduce the number of qubits considered from 8 to 4, focusing on the upper-half of the cell. This effectively decreases the number of 2-nd order parameters from 36 to 16. It will later become evident that these reductions are necessary to make the experiments viable on reasonable time scales. After conducting these model reductions, we replicate here the validation experiments from the section to establish an optimal number of samples required for reconstruction of parameters in the two-body model. Table S4 specifies the input parameters for two additional models, strong ferromagnet and strong anti-ferromagnet, using coupling sign convention consistent with (S011). Fig. S8 presents the results of the validation experiment for the 2-nd order model reconstruction. 50 reconstruction replicates are used in the validation experiments conducted in this section. The results indicate that reconstruction accuracy is approximately 0.0025 and 0.0022, which is comparable to the accuracy used in the 8-th order reconstruction experiment, using a smaller number of samples. Fig. S9 presents the strength of the second order terms that are recovered from the hardware data. These absolute values are well above the recovery accuracy threshold, indicating that 4 · 10 6 samples are sufficient for accurately recovering the two-body model. We use this number of samples in the remainder of experiments in this paper.
In Fig. 2   distribution. In particular, we find that among statistically significant couplings in the output distribution, the strongest ones are in correspondence with the input couplings, while the weakest ones are the spurious couplings that are not present in the input problem and, moreover, are absent in the chip topology. In what follows, we construct additional experiments aimed at clarifying the nature of these spurious couplings.

Impact of the Annealing Time
In all presented experiments thus far, we used the annealing time of 5µs per each sample. It is important to understand how the statistics of these reconstruction experiments might differ as the annealing time varies. Here,

Second−Order Interaction Distribution
Interaction Magnitude (log10) we replicate the model reconstruction experiment for the Strong Ferromagnet and Strong Anti-Ferromagnet models, using the following varying annealing time parameters, annealing time = 1, 5, 25, 125, 625, which corresponds to single-run annealing time of 1µs, 5µs, 25µs, 125µs, 625µs respectively. Additionally, in this experiment, the num reads parameter was reduced from 10,000 (this work's default) to 4,000 in all cases, to adhere to the maximum job run-time limit of the DW 2000Q LANL QPU. Fig. S10 presents the results of this experiment. We find that an increase of the annealing time by two orders of magnitudes results only in a slight increase of the absolute coupling values in the reconstructed model. At the same time, this minor change comes at a significant increase in data collection time. Due to this observation, in this work we chose to standardize around the annealing time of 5µs, which is essential for the high-throughput data collection, and represents the fastest available annealing time across several generations of quantum annealers. the number of unknown values in (S022). The core experiment of this section consists in performing a series of 2-nd order reconstruction experiments over random input models, and then in using these pairs of input-output models to recover the coefficients in the quadratic response function. The primary challenge of this experiment is the time required to collect a sufficient amount of data to fit the quadratic response function. To minimize the data requirements, we focus on the 4-spin model defined as N , E in Table S1. For this specific model, the quadratic functions (S020), (S021) have 57 parameters; we consider 250 inputoutput model pairs to recover these parameters. Each of the 250 input models is selected i.i.d. from the following input parameter distribution,  reconstructs the outputs for 250 input models using 4,000,000 samples for each model, which results in a total of a billion samples collected. Characteristic dominant terms in the recovered quadratic response function is presented in the Fig. 4 of the Main Text (data on all measured quadratic response terms is given below, in Sections and .). The zero order terms in the response functions are interpreted as residual fields and couplings; the first-order terms are related to the native couplings present in the chip; and finally, the second-order terms are responsible for the spurious couplings. In previous work, the primary hypothesis behind the response function was formulated in terms of a particular case of a linear assumption [18,20,[29][30][31][32], where each parameter would be multiplied by a single effective temperature. The non-linearity of the general response function that we construct here may explain why this effective temperature was found to be instance-dependent: This corresponds to a linear approximation of a non-linear function.
Spurious couplings identified under a careful statistical analysis indicate that a simple linear model is not sufficient for an accurate characterization of the D-Wave's input-output relationship. It is important to note that the secondorder response that we find here is different from the previously observed next-nearest-neighbour couplings in the strong input regime, where a quadratic cross-talk relation with an opposite sign susceptibility has been suggested (see the section "Compensation of qubit nonidealities" in the Methods of [13]). We conjecture that the emergence of the next-nearest-neighbour couplings observed in the strong regime has a quantum nature and is due to the induced effects of the transverse field; a detailed exploration of this phenomenon is beyond the scope of our study that focuses on the classical regime of the output distribution in a multi-qubit setting.
The discovery of strong and structured quadratic response functions for the output distribution that our D-Wave DW 2000Q LANL QPU samples is invaluable to applications such as hardware calibration, problem embedding and accurate sampling. However, identifying the root cause of these unexpected output parameters can provide valuable information about how to design better quantum annealers and can provide novel analytics for evaluating the performance of quantum annealers. To that end, the next section provides the theoretical grounds to explain the quadratic response as side effects of instantaneous qubit noise.

Characterisation of the Local Field Variably
In preparation for constructing a model that would explain the form of the quadratic response function, in this section we investigate possible drifts of the reconstructed model parameters. To this end, we perform reconstructions over several days, and monitor the stability of the recovered model.
The foundation of this variability study is the reconstruction of the output distribution of the zero value problem, that is h in , J in = 0 as shown in Table S5. Furthermore, we would like to perform this reconstruction accurately but with a minimum number of samples, so that the possible flux drift dynamics can be observed in the time between multiple reconstructions. We begin by calibrating a 2-body reconstruction specifically for the zero-value problem by proposing that only 200,000 samples are required for an accurate reconstruction, which is 20 times less data than what is used for a typical 2-body reconstruction. Repeating the previous reconstruction variability analysis, Fig. S11 presents both the variance and recovery accuracy of this zero-valued model. The reconstruction accuracy is approximately 0.007, which is about two times less accurate than the previously considered reconstruction experiments. However, we find this accuracy is still acceptable as our primary interest in this experiment are model values that are above 0.100. Indeed, Fig. S11 indicates that the DW 2000Q LANL QPU has a number of biases that are on the order of magnitude larger than reconstruction accuracy threshold of 0.008.
Next, we investigate how the reconstructed values of the zero problem change over time. The objective is to understand how the low frequency noise in the hardware changes over time and how that can impact data collection over the span of minutes to hours. In this experiment, the data for the proposed zero problem is collected at 10 minute intervals over a period of 48 hours and then the 2-body reconstruction is used to recovery a model from the observed samples. Fig. S12 shows the reconstructed values over time and whiskers around the points indicate the error bounds on the model reconstruction values. Considering the mean values of these time series one can observe that there is a persistent bias on both the reconstructed fields and couplers, which is most likely an artifact of the initial hardware's calibration. Looking at the variance of the time series highlights a high variance in the fields terms and a much lower variance in the coupler terms. Overall, the results of these experiments suggest that all parameters of the output distribution at the exception of local fields remain stable over time.

Explaining Quadratic Response via Instantaneous Qubit Noise
In the previous section , we saw that magnetic field is affected by comparably large fluctuations. However, this analysis was conducted by performing the reconstruction over a certain time window, which may average out the fluctuations. On the other hand, the analysis in section allowed us to estimate instanteneous fluctuations of the residual random magnetic field for each qubit individually. Here, we show that these rapid fluctuations in the individual qubit fields can be responsible for spurious effective interactions with non-trivial quadratic-type responses in the input quantities. We start by showing theoretically the type of spurious interactions one may expect to reconstruct on toy models for which we can derive close form formulas. Then, using numerical simulations, we will quantify the quadratic response caused by noise on a four qubit system and compare these predictions with the quadratic response measured on the D-Wave quantum annealer.

Spurious Magnetic Field Response
We first consider a simple system consisting of two classical spins σ 1 , σ 2 ∈ {−1, 1} linked by a coupling J ∈ R. The first spin is subject to a noisy magnetic field h sd 1 s 1 whose direction varies according to the uniform random variable s 1 ∈ {−1, 1}, whereas the second spin is subject to a constant magnetic field h 2 . The Bolzmann distribution of this two spins system at inverse temperature β for a particular noise realization s 1 is given by the following expression, where the partition function Z(s 1 ) depends on the noise realization s 1 . Suppose now that we want to perform an Ising model reconstruction using a collection of iid. samples from the distribution in Eq. (S025) where the noise realization changes randomly on a sample to sample basis. Our collection of samples end up arising from a mixture of models as around half of the configurations comes from µ(σ 1 , σ 2 | s 1 = +1) and the other half comes from µ(σ 1 , σ 2 | s 1 = −1). Therefore, the effective model that we can reconstruct with this heterogeneous collection of samples is the following mixture of Ising models, This effective model is also an Ising model and, after some algebra, it can be explicitly formulated with respect to the initial coupling and fields, where h effective For small coupling and field magnitudes J and h 2 , the expression in Eq. (S028) reduces to h effective 1 ≈ −βJh 2 tanh (βh sd 1 ) 2 . We see with this toy model that fast fluctuating magnetic field noise induces an effective magnetic response. The samples coming from a mixture of models with field noise are indistinguishable from samples coming from the single model with an effective response. The key qualitative features of this response is 1) its intensity is roughly proportional to the product of the coupling and opposite (constant) field intensity, and 2) the sign of the response is negative. This effect being roughly proportional to the square of the standard deviation of the noise, it is negligible for low noise values but becomes much more pronounced when the noise becomes large.

Spurious Coupling Response
To illustrate how spurious couplings can occur from a mixture of noisy Ising models, we look at a chain of three spins σ 1 , σ 2 and σ 3 connected via couplings J 12 and J 23 . We assume that the spin at the extremity of the chain are subject to a noisy magnetic field h sd 1 s 1 and h sd 3 s 3 where the random variable s 1 , s 3 ∈ {−1, 1} are independent and uniformly distributed. The magnetic field on the middle spin is assumed to be zero. For a particular noise realization, the Boltzmann distribution of this chain of spins is given by the following conditional distribution, where the partition function Z(s 1 , s 3 ) is a function of the noise realization. Similarly to the previous subsection, we consider fast fluctuating noise that changes randomly on a sample to sample basis. In this case, a collection of samples coming from the mixture of noisy Ising models described by Eq. (S029) becomes indistinguishable from iid. samples coming from the effective model, This effective model appears to be an Ising model as well with no magnetic fields and with an additional coupling between σ 1 and σ 3 , where the effective coupling J effective 13 can be explicitly written with respect to the mixture parameters, For small coupling values, Eq. (S032) reduces to J effective 13 ≈ −βJ 12 J 23 tanh (βh sd 1 ) 2 tanh (βh sd 3 ) 2 . We immediately see that this coupling response induced by field noise retains the main qualitative features observed in the previous subsection. The intensity of the response is quadratic in the couplings J 13 and J 23 for it is proportional to their product and the sign of the response is negative. Note that this coupling response, which involves three spins, is predicted to be weaker than the magnetic field response discussed in the previous subsection as it is roughly proportional to the square of both noise standard deviations h sd 1 and h sd 3 .

Simulations and Predictions Using Single Spin Measurements
In the previous subsections, we have seen on simple toy models that field noise leads to effective field and spurious coupling responses. For small coupling and field magnitudes, these "spurious" responses were mainly quadratic in the input parameters. We now want to quantify the quadratic responses cause by field noise on a realistic four spin system and compare them to the type of quadratic responses found experimentally in Section . This four spin system being already too complex to obtain a closed form formula, we resort to using numerical simulations to extract the quadratic response coefficients. We model the system by a classical Boltzmann distribution conditioned on a noise realization s ∈ {−1, 1} 4 of the field noise parameters. The probability to obtain a configuration σ ∈ {−1, 1} 4 given s reads as follows, where the partition functions Z(s) is noise dependant. The Hamiltonian describing the magnetic field interaction contains terms h i for input fields, h bias for permanent biases and h sd i for the standard deviation of the noise as described in Section . An individual temperature β i is also assigned for each spin, The coupling Hamiltonian contains terms J ij for the input coupling strengths that are only along physical couplers and possessing their individual temperatures β ij . Motivated by considerations from Section , we assume that the interactions are noiseless and without biases, The effective model describing the probability distribution of the four spin system is obtained after averaging Eq. (S033) over the uniform and independant noise realizations, The numerical procedure to reconstruct from Eq. (S036) a quadratic response as a function of the input couplings J ij and fields h i is reminiscent of the experimental protocol described in Sections . We start by randomly selecting 20000 input coupling and field configurations whose values lies in the set {−0.05, −0.04, · · · , 0.05}, see Eq. (S023) and Eq. (S024). Then for each of these configurations, we compute numerically the effective frequencies of the 2 4 = 16 spin configurations using Eq. (S036) and summing over the 2 4 = 16 possible noise realizations. These frequencies are used in our reconstruction procedure, described in Section , to infer an effective Ising model with Hamiltonian, . Finally, we fit a quadratic response model between the inputs configurations and their corresponding inferred effective couplings, as described by Eq. (S020) and Eq. (S021), following the optimization procedure in Eq. (S022). The spins σ 1 , σ 2 , σ 3 , σ 4 in our model are identified with the hardware spins #304, #308, #305, #309 respectively. The values of field temperatures β i , field biases h bias i and field noise standard deviations h sd i are chosen to be those measured using the single spin quantum experiments described in Section . These values can be found in the last column of Table S2. The values of the coupling temperatures have been adjusted such that the simulated and measured effective temperatures coincide, i.e. β 12 = 12.1, β 14 = 12.2, β 23 = 12.5 and β 24 = 12.6.
The typical simulated and measured response for existing couplings, fields, and spurious couplings are depicted in the Fig. S13. We see that the patterns are qualitatively following the predictions from the simple theoretical models: The response for existing couplings is a linear self-transform or effective temperature, the response for fields consists in an effective temperature and a negative quadratic response from adjacent couplings and connecting neighboring fields, and finally the spurious couplings are formed by a negative quadratic response from adjacent couplings that span a triangle with the spurious coupling. The main noticeable difference with the theoretical predictions is the lowering of the coupling effective temperatures from its model temperature due to the presence of field noise. The comparison between the measured and simulated effective temperature can be found in Table S6. The leading offdiagonal coefficients of the quadratic response, both simulated and measured, are displayed in Table S7. The effective temperatures found through the simulation remarkably matches the effective temperatures measured in the hardware with a maximum difference of at most 7%. The negative sign of the quadratic response and the type of interactions involved is in perfect agreement with the theoretical model. There exists a discrepancy in the magnitudes predicted by the simulation and those found experimentally. The predictions are up to two times weaker for the susceptibility of the fields and up to four times weaker for the spurious couplings. This can be explained by the strength of the noise induced response as our theoretical models predicts that the field response is a second order effect in the noise parameters, see Subsection , and the spurious links response is a fourth order effect in the field noise intensity, see Subsection . Therefore, only a 40% difference in the single spin noise standard deviation can explain such differences. Note that the the quadratic response of the spurious coupling J 304,305 being weaker that the spurious coupling J 308,309 is correctly predicted by the simulation and with a similar ratio. FIG. S13. General quadratic response of effective output parameters. A comparison of the simulated and true responses at the quadratic level are given for the effective parameters J out 304,305 , J out 308,309 , and h out 309 . The dominant terms in the general quadratic response presented here are discussed in the Fig. 4 of the Main Text. Notice that here and below, we use a symmetric matrix representation for the quadratic response, which results in a factor 2 difference for matrix elements compared to the results presented in the Fig. 4

Impacts of Spin Reversal Transformations Theoretical Considerations
We are looking at distributions of Ising models on N spins σ = {σ 1 , . . . σ N } depending on input couplings J and magnetic fields h. If we consider that the fields are potentially noisy and there exists individual residual fields b i , random or deterministic, the Boltzmann distribution takes the following form, where β ij and β i are effective individual temperatures for couplings and fields respectively. The residual fields may have potentially strong undesirable effects such as favoring particular spin configurations among others that were initially designed to be equiprobable. There exists a heuristical method that aims at mitigating this problem called the spin reversal transform (SRT). This method consists at looking at 2 N possible remapping of the model, each of them indexed by a "gauge" which is a binary configuration τ = {−1, 1} N . For a given configuration τ , this gauge transform maps a spin configuration, input couplings and input fields to the values σ τ , h τ and J τ in the following way, The particularity of the transformation (S039) is that it creates an equivalence relationship between Hamiltonians without residual fields as for any gauge τ , For system without biases, it implies that samples generated from any set of gauge transformed inputs h τ and J τ are identical after a remapping of the samples using the same gauge σ τ . The SRT method consists in generating samples from a mixtures of randomly selected gauge transformed models with residual fields, and which are therefore no longer equivalent, in order to empirically average over the residual field values. The effective model describing this mixture is given by the average of the Bolztmann distribution (S038) over all possible gauge transforms and reads, where the gauge transformed models is explicitly expressed as follows, Note that the partition function in Eq. (S042) depends on the gauge transforms through the residual field values. We see with Eq. (S042) that the sign of the residual fields are effectively randomly flipped with the SRT method. Thus, the SRT removes the undesirable effects of permanent residual fields but transforms it into magnetic field noise with other potentially unwanted effects such as lower effective temperature, spurious links and field quadratic response described in Section . For the D-wave hardware where the fields noise is more important compared to permanent biases, see Section , the SRT removes permanent biases for only a limited increase in the noise. Therefore, the Ising model reconstructions appears less fluctuating over time with SRT than without. To illustrate this last point, consider an Ising model with permanent residual fields b i and no input fields i.e. h ≡ 0. In this model, the average of value of the spins are non-zero in general and are a non trivial function of the residual fields and couplings. However, in the effective model produced by the SRT µ effective (σ | J, 0), the average value of each spin is identically zero. This implies that the effective model has zero effective magnetic field regardless of the value of the residual biases. To see this, we first note that when h = 0, the partition function of a gauge transformation is invariant under a global sign change Z τ = Z −τ . This further implies the following equivalence between two probabilities of gauge transformed spin configurations µ(σ τ | J τ , 0) = µ(σ −τ | J −τ , 0) for all τ . Since σ −τ = σ τ , it shows that the average value of any spin σ u vanishes as,

Mitigating Persistent Bias and Flux Drift
Section highlighted persistent biases that the hardware exhibits. A useful feature of the spin reversal symmetry group is that combining data collected from symmetric models has the effect of averaging out persistent biases. This property is highlighted by replicating the field variability experiment with and without D-Wave's spin reversal transform feature. In this revised experiment, the data for the zero problem is collected at 1.5 minute intervals over a period of 2 hours and then the 2-body reconstruction is used to recovery a model from the observed samples. This experiment is conducted with two settings, raw data using this work's standard setting of num spin reversal transforms = 0 and spin reversal transform data using num spin reversal transforms = 10 (this setting results in a total of 200 transforms across the 200,000 samples collected for the zero-problem). Fig. S14 shows the reconstructed values over time. Table S8 presents the mean and variance of each of the field values in these time series. These results highlight how the spin reversal transforms can provide a drastic mitigation of the hardware's persistent bias. An unexpected result from conducing these spin reversal transforms is a notable reduction in the variance of the field values. Although the root cause for this reduction is not clear, we hypothesize that it is a side effect of an increased number of QPU programming cycles, which present another source of biases during the data collection process. In any case, the notable bias mitigating impacts of spin reversal transforms can have a significant positive impact on applications where the user would like an unbiased output distribution.
FIG. S14. Illustration of the impact of spin reversal transforms on the persistent biases and flux drifts. Reconstruction of the parameters of the zero input problem was repeated over several hours. Reconstructed output parameters are not exactly centered around zero, and instead are fluctuating around values that can be interpreted as persistent biases. Local fields present much larger fluctuations compared to both native couplers and spurious couplings, that are significantly larger than the reconstruction error, a signature of the instantaneous qubit noise averaged over the reconstruction period. The left plot features normal raw data, and the right plot utilizes data obtained with random spin-reversal transformations applied to the input problems, which to a large extent eliminates the impact of persistent bias and reduces the variance of fluctuations.

Model Reconstruction with Spin Reversal Transforms
Given the potential bias mitigating benefits of spin reversal transforms, it is natural to inquire how this feature impacts the results presented thus far. We begin by reviewing the two-body reconstruction results presented in Fig.  S15, which provides a side-by-side comparison of the results of Strong Ferromagnet model from Table S4 with and without spin reversal transforms. As the first observation, we notice that there is a notable change in the results of the zero-order terms. In the case with spin reversal transforms the zero-order terms are near zero, while statically significant non-zero values are exhibited in the raw data. The second observation is that the first-order terms do not show a notable change; in fact, these two reconstructions are remarkably consistent with and without spin reversal transforms. Both of these results indicate an absence of detrimental artifacts from utilizing this feature during data collection at the time scales that these experiments require.

Quadratic Response with Spin Reversal Transforms
Figs. S16 and S17 replicate the quadratic response experiment from Section with and without spin reversal transforms. This comparison shows that the overall consistency of the quadratic response picture. In accordance with the theoretical predictions outlined in the beginning of this section, we observe that the persistent bias essentially disappears under the SRT setting, and the overall response becomes much cleaner.

First−Order Interaction Distribution
Interaction Magnitude (log10) FIG. S15. The first-order (top) and second-order (bottom) terms of the two-body reconstruction without (left column) and with (right column) spin reversal transforms for the Strong Ferromagnet model.

Tests Across Three Generations of Quantum Annealers and Validation of the Instantaneous Noise Model
The previous sections have argued that the quadratic response model is a good valuable tool for characterizing the input-output behavior of a quantum annealer. Furthermore, the strength of spurious links can provide an indirect measurement of the instanteneous qubit noise that is occurring on a specific hardware device. We had the opportunity to collect data for the response analysis on 7 distinct QPUs spanning three generations of quantum annealing hardware. The specific device names and components used in the experiment are presented in Table S9.
Figs. S18-S25 present the quadratic response motifs from all of the QPUs that have been tested. Overall, the results are remarkably similar, which suggests the universality of the quadratic response characterization that is capturing fundamental properties of D-Wave's quantum annealing implementation, like effective temperature, persistent biases, and instantaneous noise in the local fields. With the exception of the new lower-noise QPU, the existence and strength of the spurious links is consistent across hardware realizations. We have observed that some QPU implementations feature asymmetric spurious links while others are symmetric. Identifying the root-cause of this distinction is an ongoing point of investigation.
A significant difference in the response function has been obtained for the lower-noise version of the D-Wave 2000Q annealer [33]. We observe a drastic reduction of the susceptibility responsible for the strength of the spurious couplings, while the linear scale terms remain on par with other 2000Q implementations, see Fig. S26. This result therefore provides strong evidence in support of the noise-based model introduced in this work. We anticipate that

RAW SRT
FIG. S16. A comparison of the key motifs of the quadratic response function (here, for the chimera couplings) without (left) and with (right) spin reversal transforms. The motifs are largely similar, however one can notice a considerable reduction in apparent noise in the quadratic response of the first-order terms. We hypothesize this is due mitigation of the flux qubit drift that occurs throughout the many hours of data collection required by this analysis. the quantitative measurement of the susceptibility associated with the spurious couplings using methods developed in this work will provide a valuable characterization of the qubit noise in the next generations [34] of quantum annealers and other analog machines with binary output statistics.

RAW SRT
FIG. S17. A comparison of the key motifs of the quadratic response function (here, for the local fields and spurious couplings) without (left) and with (right) spin reversal transforms. The motifs are largely similar, however one can notice a considerable reduction in apparent noise in the quadratic response of the first-order terms. We hypothesize this is due mitigation of the flux qubit drift that occurs throughout the many hours of data collection required by this analysis.

Open-sourced Tools
The experiments conducted in this work require the collection of billions of samples from D-Wave's quantum annealer and reconstruction of graphical models with multi-body interactions. However, neither of these tasks is readily supported by established software and the following software was developed and released as open-source to support this work. The first software is the D-Wave Ising Sample Collector (DWISC, github.com/lanl-ansi/dwisc), which enables the collection of millions runs on D-Wave hardware by orchestrating a series of jobs that conform to D-Wave's single-job run time limit of three seconds. The second software is GraphicalModelLearning (GML, github. com/lanl-ansi/GraphicalModelLearning.jl), which takes empirical state distributions and reconstructs effective multi-body graphical models in a factor graph representation, leveraging the Interaction Screening method described in Section and state-of-the-art second-order nonlinear optimization algorithms to provide model reconstructions that require the least amount of data. The notable improvement of reconstruction accuracy of interaction screening framework over established approaches, such as mean-field, is discussed at length in [27]. DWISC and GML form the foundation of the experiments in this work by providing the data and algorithms required to reconstruct high-accuracy multi-body models of the output from D-Wave's quantum annealer.

Low-Noise 2000Q 2000Q
FIG. S26. Heat maps representing the quadratic terms of the quadratic response function for the spurious link output parameters on a regular (left) and lower-noise (right) 2000Q quantum annealers. The significant reduction in the link strength in the lowernoise response (from -4.1 to -0.9 and -6.8 to -1.9) confirms the theoretical model of high-frequency qubit noise.