The Perils of Embedding for Sampling Problems

Sampling from certain distributions is a prohibitively challenging task. Special-purpose hardware such as quantum annealers may be able to more eﬃciently sample from such distributions, which could ﬁnd application in optimization, sampling tasks, and machine learning. Current quantum annealers impose certain constraints on the structure of the cost Hamiltonian due to the connectivity of the individual processing units. This means that in order to solve many problems of interest, one is required to embed the native graph into the hardware graph. The eﬀect of embedding for sampling is more pronounced than for optimization; for optimization one is just concerned with mapping ground states to ground states, whereas for sampling one needs to consider states at all energies. We argue that as the problem size grows, or the embedding size grows, the chance of a sample being in the logical subspace is exponentially suppressed. It is therefore necessary to construct post-processing techniques to evade this exponential sampling overhead, techniques that project from the embedded distribution one is physically sampling from, back to the logical space of the native problem. We show that the most naive (and most common) projection technique, known as majority vote, can fail quite spectacularly at preserving distribution properties. We show that, even with care, one cannot avoid bias in the general case. Speciﬁcally we prove through a simple and generic (counter) example that no post-processing technique, under reasonable assumptions, can avoid biasing the statistics in the general case. On the positive side, we demonstrate a new resampling technique for performing this projection which substantially out-performs majority vote and may allow one to much more eﬀectively sample from graphs of interest.


I. INTRODUCTION
Improving the efficiency of sampling from certain distributions, such as Boltzmann distributions, could provide significant benefits for machine learning and optimization.Sampling is a challenging task; for example, sampling from a Boltzmann distribution at sufficiently cold temperature is NP-hard.Special-purpose hardware, such as quantum annealers, have been proposed as potentially providing improved sampling capabilities [1][2][3][4][5][6].
Many interesting cases can be reduced to sampling from a Boltzmann distribution e −βH(s) /Z with H(s) a classical Ising model Hamiltonian of the form where the real-valued couplings J ij and local fields h i fully specify the problem, with the partition function Z = s exp(−βH(s)) for normalization of the probability distribution.The energy (cost) associated with state s = (s 1 , . . ., s N ) is given by H(s), where the spin variables s i take values in {−1, 1}.For optimization purposes, one is interested in the low cost configurations, or ideally the global minimum.
Depending on the problem one is considering, the couplings can define a complicated graph, such as a 3dimensional graph, or even a fully connected graph.Additionally, the hardware may have only selected couplings available, often just between nearest neighbors on the chip.For this reason, embedding is often necessary, where embedding maps Eq. (1) to a new Hamiltonian of a similar form but where the couplings are only specified over  1).In the embedded graph (right) an additional variable is used, with the green vertex being split into two, coupled with strength JF .This combined variable is often referred to as a logical vertex, or logical spin.Note, the embedding process is in general not unique.
the hardware graph where angle brackets denote the sum is over a restricted graph, and s necessarily contains more variables than s.See Fig. 1 for a simple example.For optimization, the requirement on embedding is that from the global minimum of Eq. ( 2) one can infer the global minimum of Eq. (1).We call this the global-to-global property.
Embedding, and the related topic of parameter setting, is a well studied concept, beginning with the early work of Choi [7,8].
Here, we consider the effect of embedding on the distribution as a whole.For concreteness, imagine the goal is to sample from a distribution D which depends on Hamil-tonian Eq. ( 1), obtaining samples of the form D(H). If one has a perfect sampler, but with a restricted topology, instead one will sample from a Hamiltonian of the form Eq. ( 2), thus obtaining samples D( H).In order to sample from the target Hamiltonian Eq. ( 1), one therefore needs to perform a projection on the sampled distribution Π : D( H) → D(H).The goal is to find a suitable choice of Π so that the target distribution is faithfully represented.
We focus on the case where D corresponds to a Boltzmann distribution, i.e.D(H) = exp(−βH)/Z where Z is the partition function, and β an inverse temperature.These distributions are of particular relevance given recent work using quantum annealers with a restricted topology to generate thermal samples from (classical) Hamiltonians of the form Eq. ( 1), for use in machine learning [1][2][3][4][5][6].The main goal of this work is to demonstrate that as problem sizes increase, the greater the need to develop new techniques for mapping from the embedded distribution to the native, logical distribution.Our results are three fold.First we will outline in more detail the problem of sampling from an embedded problem.In particular, we argue, and demonstrate numerically, that the number of samples received from D( H) requiring a non-trivial projection procedure grows exponentially in problem size N .That is, the probability of observing a sample from within the logical subspace decreases exponentially.We also show that, under a reasonable set of assumptions, it is not possible to find a projection Π that preserves Boltzmann distributions exactly.To highlight this further, we study perhaps the simplest (and most common) type of projection technique, typically referred to as majority vote (MV), showing that it is a poor choice in general.Next, we introduce a resampling technique (that we call RRS), which empirically outperforms MV.We finish with a discussion and outline possible future research directions based on this work.

II. EMBEDDING: DEFINITIONS AND NOMENCLATURE
An embedding uses multiple physical spins (vertices), and couplings between them, to represent single spins in the original problem on the connectivity-limited hardware.If one performs an edge contraction over these vertices in a specified manner, one will arrive at the graph for the original Hamiltonian.This general idea is illustrated in Fig. 1, where a triangular graph is embedded into a square graph, resulting in one additional variable, and one additional coupling which we denote by J F .The task of picking J F requires special attention; lower bounds on choices of the additional parameters to achieve the global-to-global property are given in Refs.[7,8].
More formally, consider the graph G H associated with Hamiltonian H of the form Eq. (1).Each spin s i in the model H defines a vertex i in G H , and a coupling between spins J ij defines a weighted undirected edge between ver-tex i and j.Each node also has associated with it the corresponding local field h i .
The graph G H is embeddable in another graph G if there exists a mapping φ : G H → G such that 1) each node i of G H is mapped to a (connected) subtree T i of G, with T i ∩ T j = ∅ for i = j, and 2) for each edge (i, j) of G H of weight J ij , there are edges from T i to T j in G which cumulatively sum to J ij .We also require that the local fields of each T i sum to h i .In this way, G H can be constructed from G by contracting the edges of each T i .Since the subtrees T i necessarily introduce additional variables, the dimensionality of the configuration space dim H = 2 Ñ ≥ dimH = 2 N where H and H are the configuration spaces for the models H and H respectively, with Ñ and N variables.
A configuration sL ∈ H, for which in each subtree T i the spins are all aligned identically, is known as a logical configuration, and belongs to the logical subspace HL ⊂ H of size dim HL = 2 N .Any configuration in HL has a corresponding and unique configuration in H which is found by simply replacing the identically pointing spins in each subtree by a single spin of same orientation.We will therefore throughout refer to the subtrees {T i } N i=1 as logical subtrees, or as logical spins when referring to the equivalent variables in model H.If a logical spin contains spins of differing orientations, we will often refer to these as broken.
In order to encourage the spins composing a logical spin to align under thermal sampling, strong ferromagnetic bonds J F < 0 can be placed between the vertices in the logical subtree, so that there is a cost penalty related to |J F | whenever a spin is misaligned.If J F can be chosen to be infinitely large and negative, thermodynamic sampling at finite temperature guarantees one never observes a configuration outside of the logical space.Practically, however, the size of |J F | is limited, both by the hardware and since too large a |J F | can introduce large energy barriers and deep local minima in the landscape of the problem, making it prohibitive for thermal algorithms to traverse [9].
Embeddings of this type guarantee that for any configuration s ∈ H of cost H(s), there is an equivalent logical configuration s ∈ H with cost H(s) = H(s) + C where C is a constant and global energy shift (i.e.independent of any particular s).If subtree T i in G contains n i vertices, with edge weights all J F , the energy shift C is simply given by This property is crucial for sampling purposes since it guarantees relative thermal sampling weights w ij := exp(−β(H(s i )−H(s j ))) are preserved by the embedding process, where s i,j are spin configurations.In particular, if we denote the Boltzmann distribution for Hamiltonian H at inverse temperature β over H by D H (H, β), then, restricting to the logical subspace of the embedded problem preserves the distribution:

A. Embedding Graph
Throughout this work, we use as our hardware restricted graph G = G(K, J F , N ) one in which each subtree is a chain (i.e. a path) with the same number of vertices K, and internal logical spin couplings all of the same strength J F .The total number of spins is N × K.Each problem coupling J ij of H is a single edge in G also of weight J ij , and local fields, h i , are divided evenly between each spin in a logical spin (i.e. with value h i /K).
In the hardware graph, each spin has coordinate (k, i) where i is the logical spin index (equivalent to a vertex index in G(H)), and 0 ≤ k < K denoting the spins position within the chain.We have two ways to connect logical spins in the hardware graph.If there is an edge J ij = 0 in G(H), we can either i) pick random 0 ≤ k i , k j < K such that there is an edge ((k i , i), (k j , j)) with weight J ij in G, or ii) follow a deterministic embedding such that: for This flexibility allows us to either i) simulate random embeddings in the hardware graph, or, ii) perform a direct comparison between different problems using a fixed embedding procedure.The first point is intended to address the fact, as mentioned in Sect.II, that there is typically not a unique choice of embedding, and the second point is so we can later compare between different projection techniques using the same embedding.
An example of our hardware graph is shown in Fig. 2, for K = 3, for the deterministic embedding.
We pick this graph G since each logical spin is treated equivalently, therefore allowing us to study directly the effect of changing J F and K on sampling quality.Moreover, we can embed any type (i.e.fully connected) of graph of size N into G(K, J F , N ).Throughout, our units are defined relative to the native Hamiltonian, i.e. relative to max{|J ij |, |h i |} (which we pick here to be 1 for convenience).

III. THE PROBLEM OF SAMPLING AFTER EMBEDDING: ANALYTICAL RESULTS
Sect.II introduced the key ideas behind embedding.We will now elaborate on this to highlight potential issues using embeddings in a sampling task.We focus on the task of Boltzmann sampling, however similar arguments can be applied to any form of sampling in which the statistics may be biased by embedding and projecting.
Our main result is an equation which shows that for a given embedding, and at fixed temperature, the probability of observing a configuration within the logical sub- Example of our embedding graph G(K = 3, JF , N = 5) for a 5 variable fully connected graph.Each horizontal row of three spins (yellow circles) is a logical spin (subtree); each variable is represented by K = 3 logical variables in this example.Some labels with spin coordinates are shown: each spin has coordinate (k, i), where i is the logical spin index and k denotes the spin position within the chain.Red (solid) lines indicate ferromagnetic couplers of strength JF < 0 'gluing' the logical spins, and the black (dash) lines are the problem couplings Jij between variables.Local fields are also present (divided evenly across the spins in a subtree), but are not shown for simplicity.For larger problems, or different logical subtree sizes (K), this basic structure can be repeated indefinitely (see main text).Performing an edge contraction over the red (solid) ferromagnetic edges results in the native fully connected problem.
space HL is exponentially small in problem size N , and also the subtree sizes K.This means that it is not practical to simply restrict to this subspace and utilize Eq. ( 4).
We see this striking unfavorable exponential scaling in Fig. 3, which for a fully connected graph under the embedding parameters and temperatures we study, demonstrates once the size is above around N ≈ 120, only around one sample per billion will be from the logical subspace.It is therefore prohibitive to simply discard solutions from outside of the logical subspace, for any problem of even modest size (e.g. 100 spins).The hotter the distribution, the worse the scaling and the more likely it is to leave the logical subspace.We study the exponential scaling in the next subsection.
Moreover, hardware restrictions on 1) the logical spin strengths, J F , 2) the size of the logical spins K, by way of the connectivity of the embedded graph, and 3) the temperature, also impose difficulties in skirting around this problem by using favourable parameter setting choices for the embedding or picking a low enough temperature.
Whilst the origin of points 2) and 3) is clear (fixed hardware graph and cooling limitations), we elaborate on 1).There are two factors to consider here.i) Physical device constraints may determine the maximum relative size of |J F | to problem couplings max i,j |J ij |.For example, on the current generation of D-Wave quantum annealing device, for problems with a reasonable distri-Figure 3. The probability PL ≡ P0 of observing a configuration from the logical subspace under an embedding, from Boltzmann sampling at two temperatures (see legend).The embedding is of a fully-connected graph, topology as described in Sect.II A, where each logical spin is made up of 3 spins (K = 3) and JF = −2.0.Couplings Jij and local fields hi chosen uniformly randomly from [−1 : 0.2 : 1] a .The solid lines represent the expected decay in PL from theory (Eq.( 13)).We see a clear exponential decay with problem size N .Each data point is averaged over 100 random instances.Error bars are one standard deviation over the problem instances.For each instance, we compute the exact PL by iterating over all configurations of the embedded problem (for N = 10 the embedded problem contains 2 30 configurations).
a -1 to +1 in steps of 0.2.bution of couplings, the largest relative coupling of |J F | to the largest problem coupling is 2 [10].ii) As mentioned previously, the landscape can become challenging to traverse if |J F | is too large, therefore practically limiting the size of this parameter.
We will now provide a counting argument which demonstrates these issues more precisely.

A. Analytic expression for relative subspace sampling
Let us assume for simplicity that each logical subtree is in fact a path of the same length; i.e. a linear chain, composed of K vertices.We denote by J F < 0 the ferromagnetic bonds linking the spins together.The native problem size is N , and therefore, the embedded version contains N ×K spins (vertices).We now estimate the relative sampling weight between subspaces with n broken logical spins (i.e.chains with not all identically aligned spins), under a Boltzmann distribution at inverse temperature β.In particular, we want to obtain P n , where P n is the probability of sampling from the subspace with n broken logical spins.This quantity will, of course, depend on details of the specific Hamiltonian, that is on the couplings J F , {J ij }, {h i } we are considering.To obtain an estimate of that, we consider its average with respect to the values of the couplings J ij and of the local fields h i , assuming that these random variables are independent and identically distributed with a symmetric probability density function.For simplicity, let us assume that their mean is zero.Now, consider two configurations, σ ( ) and σ( ) , with domain walls distributed over the chains (i.e.number of positions where the spin flips from one site to the neighbor within the chains).See Fig. 4. Notice that 0 ≤ ≤ N (K − 1).Let us relate the spin values of σ( ) i and σ , and else, −1 (where i = 1, . . ., N K).We have, labeling with p(σ) the probability averaged over the values of the couplings J ij and h i ("disorder") of the configuration σ, where Z = σ exp(−βH(σ)) is the partition function, and the overline denotes the average over the disorder.By re-defining couplings via J ij σ i σ j = Jij σi σj , where Jij = ξ i ξ j J ij (similar for J F and h i ), we can relate p(σ ( ) ) and p(σ ( ) ).In particular, as shown explicitly in Appendix A, we have where Z = Z (β, H, ξ) differs from Z = Z(β, H) through the re-mapping of variables via ξ.This calculation uses the fact that the average over the disorder is done with a probability density function which is symmetric with respect to a sign flip of each coupling J ij and h i (see Appendix A).Unfortunately, the change of sign of some of the couplings has the effect of changing the partition function Z → Z , and this is due to the fact that the ferromagnetic couplings J F are fixed and we are not averaging on their value.
To strongly simplify our equations, and ultimately allow us to estimate P n , we consider the so-called annealed approximation (see, for example, Ref. [11]), which consists in considering the couplings J ij and h i as dynamical variables, on the same footing of the spin variables.In this case and with Z = Z ann = Z , we obtain Therefore, under the annealed approximation, the probability of a configuration (averaged over the disorder) depends only on the number of domain walls.If we call p the probability of a configuration with domain walls, we have This fact, together with the fact that there are possible configurations with domain walls, allow us to write for the total probability of observing domain walls P ( ) : where P 0 is the probability to sample a configuration from the logical subspace (summed over all configurations and averaged over the disorder).In other words, P 0 = 2 N p 0 since there are 2 N possible logical configurations.
For the probability to observe a state outside the logical subspace P out , we have, by the binomial theorem, (12) Therefore, using that P 0 + P out = 1, Let us now turn to the general case, that is the computation of the probability of observing n broken chains.We have where the first binomial coefficient comes from the choice of n chains to break (among N available), the term are the possible configurations of n chains with q 1 , . . ., q n domain walls respectively, and p q1+•••+qn is the probability of observing q 1 + • • • + q n domain walls.We obtain where In particular, and we demonstrate the success of this equation, and so of the annealed approximation for our case, in Fig. 5, plotting for several parameter choices P n /P n−1 as a function of n/(N + 1).We now make some brief comments on these relations: i) Eqs. ( 13) and ( 15) are trivially exact for β → 0, since in this case all configurations are sampled equally.In general, the annealed approximation is correct in the thermodynamical limit as long as the partition function is a self-averaging quantity.This happens above the critical temperature of the spin glass transition.
ii) One consequence of our assumptions is that |J F | must be large enough so the global-to-global property holds, i.e.P 0 → 1 as β → ∞.In particular, if |J F | → ∞ then Eq. ( 13) is correct since P 0 → 1, and on the other hand if |J F | = 0 again Eq. ( 13) gives the correct result, that is each configuration has the same probability and therefore P 0 = 2 N /2 N K .The same, correct result is obtained for β = 0, where the annealed approximation is known to be exact.However, in general it is unclear the extent to which the annealed approximation gives an incorrect result in our computation for arbitrary temperatures or problem sizes.It is clear though, that if the global-to-global property does not hold, Eqs. ( 13), (15) will not be valid at low enough temperatures.
iii) The probability P 0 decays exponentially in problem size, and chain size.Thus there can be huge sampling benefits from utilizing more efficient embeddings with smaller chains.Compatible with intuition we see logical subspace sampling can be improved for larger β|J F | (colder temperature and/or stronger ferromagnetic couplings).iv) For hardware constrained β and J F (i.e. can not scale with N ), it is clear that for large enough problems, and ones with more complicated embeddings (larger K), there will inevitably be troubles sampling the logical subspace directly.In Fig 3 we show the decay of P 0 as a function of N , with K = 3, for two temperatures.The theory of Eq. ( 13) matches rather well with the numerical data, giving us confidence about the assumptions we made in our derivation, for the chosen parameters.
In the next subsection, we demonstrate the difficulty of solving this problem through a simple, but tractable, model.

B. Projection techniques and sampling bias
In this subsection, we describe limitations on postprocessing techniques that project from the embedded space back to the logical space.Specifically, we demonstrate by example that under reasonable assumptions on such projections, sampling bias is unavoidable.The example is simple and not contrived, suggesting that this bias is generally hard to avoid.The assumptions we make on the postprocessing are that 1) the temperature of the Boltz-mann distribution we are aiming for remains the same as for the logical subspace, 2) "if it ain't broke, don't fix it" -we do not adjust the values of any spins from nonbroken logical spins, 3) we do not discard solutions, and 4) we carry out the projection one solution at a time.These assumptions are motivated by the need to keep the postprocessing computational effort tractable and to avoid trivial solutions to the problem, such as providing Boltzmann samples at infinite temperature.It might be interesting to see if relaxing some of them, while keeping the computational effort reasonable, can lead to less bias or if one can prove that relaxing the assumptions does not help.These assumptions already encompass the leading postprocessing approach, majority vote, and allow for significantly broader approaches.In the next section, we will numerically demonstrate the significant bias resulting from majority voting, and provide an alternative that does better.
We prove the impossibility of postprocessing without biasing the sampling, under the assumptions above, by showing its impossibility for a simple case, i.e. through a counter example.Consider an N spin problem which is embedded by replacing one of its nodes with two nodes, resulting in an N + 1 spin problem.The postprocessing task is to provide means to decide, given a configuration in which the two spins in the logical spin do not align, with what probability they should be projected to both spin up, or both spin down (fixing the value of all other spins).The hope would be that after this projection, and with sufficiently many samples, the distribution is still Boltzmann at the same temperature.
Let us call C the configuration of the fixed N − 1 spins, and C −1,1 , C 1,−1 , C 1,1 , C −1,−1 the full configuration of N + 1 fixing the N − 1 spins as in C, with the subscript denoting the configuration of the logical spin.Similarly, we call the cost of these configurations E is the partition function for normalization.
Let us assume there does exist a procedure to re-map the probabilities such that they still follow a Boltzmann distribution at the same temperature.Then we have: The second equals sign is used to indicate we require that C −1,−1 , C 1,1 are sampled from a Boltzmann distribution with corresponding partition function over the logical subspace For now, let us assume no solutions are discarded, so that P (C) (−1,1)→(−1,−1) + P (C) (−1,1)→(1,1) = 1 (and similar for C (1,−1) ).In this case, these two equations, with two unknowns, can be solved.

One finds
which specifies two linear equations with the same gradients, but, in general, different intercept values, which therefore have no solutions.To see this, compare the ratio Z/Z L from solving Eqs.(21), with the exact which depends on all possible configurations c, and not just the single configuration C. In general, Eqs. ( 22) and (23) will not be the same, meaning the Eqs.( 19) cannot be simultaneously satisfied.We demonstrate this by example.
Now consider the quantity r(C) := Z Z L − 1 computed using the configurations C 1 and C 2 from Eq. ( 22): We have r(C 1 ) = r(C 2 ) (except for the very particular case β = 0), while the quantity Z Z L − 1 has to be configuration-independent as we can see from Eq. ( 23).Interestingly, in this case even knowing Z and Z L is not enough to solve this problem.Of course this does not exclude the possibility of obtaining Boltzmann samples from an embedded distribution by relaxing at lease one of the restrictions we imposed: 1) one may not require the final distribution is at the same temperature of the sampler, 2) one could use additional information about the structure of the problem, 3) one can discard certain configurations, or 4) performing post-processing on a large set of configurations.
Whilst the above argument indicates it is difficult, or impossible, to perfectly recover the target distribution, it is not clear the extent to which sampling can be biased by certain projection techniques.In the next sections we numerically study some examples.

A. Majority voting
In the context of optimization tasks, one will often use majority vote (MV) to obtain relevant solutions when illogical configurations (configurations outside of the logical subspace) are present in the sampling.This procedure is easy to implement and understand.Given a single configuration, for each logical spin which is not aligned identically, correct it by going with the majority.If there is a tie, one can pick at random.For optimization purposes, this is a simple way to obtain a greater number of solutions and does not cause any intrinsic issues.For sampling however, this introduces biases in the sampling rate of certain logical configurations.
We first demonstrate this by example using an embedding of a fully connected graph, where each variable becomes a logical spin of size K (see Fig. 2).The problems we study have values J ij and h i chosen uniformly randomly from [−1 : 0.2 : 1] (−1 to +1 with step size 0.2).We restrict our analysis for now to small sizes so we can exactly compute the probabilities of each configuration (i.e.compute the partition function).As a result, the largest system we study is 8 × 3 = 24 variables.In order to demonstrate the sampling bias for these small (numerically exactly solvable) problems, we take the temperature parameter β = 0.6.In general, colder temperatures will exhibit less bias (assuming the global-to-global property), by the arguments of the previous section.
Our analysis shows that in general, and unsurprisingly, performing majority vote induces biases into the sampling procedure, even when the ferromagnetic couplings are 'strong' (e.g.twice the magnitude of any coupling in the underlying Hamiltonian, as is the case in typical implementations on current hardware, such as the D-Wave 2000Q).An example of this is shown in Fig. 6 where one can notice a few distinctive features.1) The distribution after performing MV is not a Boltzmann distribution as the points do not lie on a straight line.2) Moreover, there exist configurations of the same cost, but different sampling rates.3) Assigning the best fit temperature to the distribution gives a hotter distribution compared to the sampling temperature; in particular, it tends to flatten out the distribution.
Indeed, in light of the discussion in Sect.III it is not surprising MV fails as it comes under a special case of the argument outlined which shows it is not possible in Figure 6.The effect of majority vote (MV) on sampling for an 8 variable fully connected problem.Here, E(c) is the cost associated with logical configuration c, and P (c) the corresponding sampling probability under a Boltzmann distribution.In the embedding, each variable becomes a logical variable of size K = 3 (see Fig. 2).We demonstrate with two different ferromagnetic coupling strengths, in units of the native Hamiltonian H.The straight lines are found by least squares fitting, where the gradient represents the inverse temperature (see legend).
general to perform such a mapping.What is perhaps not obvious is how poorly MV can perform, failing to capture much semblance of a Boltzmann distribution at all by biasing the statistics.We restricted ourselves to small sizes so that we could perform the computations exactly (i.e.analyzing all N ×K configurations), but our analysis also indicates that in general the biases associated with MV become more detrimental with size.
In Fig. 7 we notice two related effects.Firstly, larger problems are more adversely affected by MV as determined by the KL-divergence at the optimal temperature, and second, this optimal sampling temperature becomes hotter for larger problem sizes.The latter indicates the distribution is becoming flatter as problem size increases.This is not unexpected since here the temperature and ferromagnetic couplings J F are not scaling with problem size, and by the arguments in the previous section one therefore expects to observe a greater number of states outside of the logical subspace.

B. A better approach: restricted resampling
Here we outline a new approach called restricted resampling (RRS) to overcome some of the issues outlined above, inspired by thermal sampling algorithms.As before, we assume one receives perfect thermal (Boltzmann) samples of the embedded problem, at some inverse temperature β [12].In RRS, one performs a thermal re-Figure 7. KL divergence of majority voted distribution to Boltzmann distribution P (β) at inverse temperature β, as a function of problem size N (number of variables in fully connected graph).The sampling of the embedded problem was performed with β = 0.6.We use logical spins of size K = 3 for the embedding (as in Fig. 2).Each data point is averaged over 500 random problems and embeddings.The solid blue curve is the KL divergence between the MV data and the 'ideal' Boltzmann distribution (i.e. if no embedding was required).The dash blue curve is the KL divergence between the MV data and a Boltzmann distribution at the optimal inverse temperature βopt (which is found, for each problem, by minimizing the KL divergence).The dotted red line (right y-axis) is the optimal fitting inverse temperature.Error bars are standard deviation.Here |JF | = 2 in units of the Hamiltonian.
sampling at the designated temperature over a restricted number of problem variables.In particular, when one observes a configuration with N B broken logical spins, one implements a 'resampling' of these variables within the logical space at inverse temperature β; that is, one effectively performs a Monte Carlo algorithm over a space of size 2 N B .Though this does not guarantee to perfectly recover a Boltzmann distribution (again, this algorithm also falls under the arguments outlined in Sect.III), we show numerically it clearly outperforms MV.We therefore propose RRS as an alternative to majority vote and other similar projection techniques.
We outline the general idea of RRS in Algs. 1 and 2. This pseudocode is intended to just give the basic outline of how one could implement RRS, and we stress that any algorithm which can provide thermal samples can be used as the subroutine Alg. 2. For example, one could use cluster flips instead of single spin flips, or replicaexchange Monte Carlo (parallel tempering), to generate the samples.
In Alg. 1 we first construct the set B of broken logical spins, and also a configuration which respects the spinvalues for the logical spins which are not broken.We then thermally resample this configuration at temperature β, but only resampling over the set of spins B. for k = 1 N do 5: return BoltzmannSampleOverSubset(H,β,C,B) 13: end procedure Algorithm 1: Outline of RRS algorithm.The input is a configuration C ∈ H from the embedded space, the native Hamiltonian H (over N spin variables), and the desired sampling inverse temperature β.In line 5, V (T k ) corresponds to the vertices of the k-th logical subtree T k .S k is therefore the configuration of the k-th logical spin.An example implementation of the subroutine BoltzmannSampleOverSubset is given in Alg. 2. C ← FlipRandomSpinFromSet(C,B) 5: if Random(0,1) < min(1, e −β(E −E) ) then In line 4, FlipRandomSpinFromSet(C,B) will flip a spin in configuration C, chosen randomly from set B. We do not specify explicitly the break condition for the while loop since this is up to user implementation (e.g. after a fixed number of steps, or after the energy landscape has been explored sufficiently).
In Fig. 8, the analogue of Fig. 6 of the previous section, we show the effect of RRS for a single problem instance.We see that the resampled distribution is much closer to the ideal as compared to using MV.In particular, the effective temperature after resampling is almost identical to the temperature of the underlying distribution, and configurations of the same cost are sampled with much less variation, as compared to MV.
Note, for our simulations we do this remapping exactly by computing the partition function.In practice, one would need to implement a thermal sampling algorithm, for example based on Monte Carlo techniques.
In Fig. 9 we see that the scaling of RRS is much more  favorable than MV.Moreover, in Fig. 10 we see the effective sampled temperature after applying RRS is much closer to the physical sampling temperature.

C. Discussion
We have identified a potential issue for hardware restricted Boltzmann samplers where embeddings must be used.Whilst for strong enough logical spins (ferromagnetic couplings |J F |) and low enough temperatures it is exponentially unlikely in β|J F | to leave the logical space, in reality, these couplings are limited by hardware and do not scale with N .In fact, in current hardware such as the D-Wave 2000Q, |J F | is typically limited to a strength twice that of a problem coupling.To make matters worse, Ref. [13] found that effective sampling temperatures on an experimental quantum annealer tend to increase with problem size.Embedding therefore inevitably leads to the observation of states which are not in the logical subspace, and since the probability of this occurring nominally scales exponentially in N (Eq.( 13)), even for moderately sized systems, one may rarely (or never) observe logical configurations.These states are not erroneous, caused by errors in the device, but perfectly acceptable configurations in accordance with the Boltzmann distribution of the embedded problem.The task therefore is, given a sampler which works perfectly, what can be done to project back all configurations to the logical subspace, so that the distribution observed is the desired one (e.g. a Boltzmann distribution).If these so called illogical states were observed infrequently, a perfectly acceptable solution would be to simply discard these states, since the relative sampling weights are the same in the logical space of the embedded problem, and the native problem (Eq.( 4)) We argued in Sect.III that under a reasonable set of assumptions, it is not possible to find such a projection in general which works without error.Our argument assumed that 1) the temperature must remain fixed 2) no illogical configurations are discarded, 3) the projection is performed without knowledge of other configurations, and 4) only broken logical spins are changed.This includes a wide range of projection algorithms and applies to techniques such as majority vote (MV), and our introduced restricted resampling (RRS) scheme.This does not preclude the possibility of more advanced schemes where one may violate our assumptions above, for example, collecting many samples first and then performing the projection over the set of samples (e.g. through machine learning techniques), or discarding certain samples.
We have shown that one commonly used technique in the setting of optimization, majority vote, can fail quite spectacularly to capture the intended distribution.The reason for this is it introduces biases to the statistics, and the result is two logical states of the same cost can be sampled at massively different rates (e.g. over an order of magnitude difference in sampling probability).Moreover, the effective temperature after performing MV is much larger than the sampling temperature; i.e. it tends to flatten out the distribution.
We introduced a partial solution to this problem through a scheme called restricted resampling, where one resamples over a restricted set of variables; the ones which are not in the logical space.This not only clearly outperforms MV, but it also gives a distribution with a temperature much closer to the desired one.This resampling can be performed by a classical algorithm (such as a Monte Carlo type algorithm).
We show another example of a comparison between MV and RRS for a larger problem in Fig. 11, where all samples are generated by a Monte Carlo thermal sampler (described in Appendix B).This is in contrast to the previous sections where we exactly computed for small sizes the resampling weights for individual configurations.Since estimating the configuration probabilities is infeasible in this case (with > 100 spins), we focus on estimating the probability of an energy level being sampled P i = gi Z exp(−βE i ).One can see again that RRS matches closer to the ideal distribution, although there is a large variation between different samples (large error bars), in both cases.Fluctuations in the P i is due to errors arising from the inexact Monte Carlo implementation, and also due to biases from the projection methods.
One drawback of RRS is that it can be quite computationally intensive; indeed, when given a configuration where each logical spin has misaligned spins, RRS is equivalent to performing Boltzmann sampling in the entire space.If one regularly observes states where ∼ N logical spins are not aligned, then this will quickly become infeasible.By our Eq.(17) this is determined by the penalty weight term P w ; since P n /P n−1 is decreasing in n (and P 1 /P 0 > 1), the most probable number n max of broken logical spins (P nmax ≥ P n ) is found by setting P n = P n−1 which gives This means if P w is 'large', one may regularly find sam- ples with O(N ) broken logical spins.We see therefore that in looking to sample large problem sizes would require P w ∼ O(1/N ), which, from Eq. ( 16), can be achieved by scaling This scales very reasonably in N and K as shown in Fig. 12, suggesting the possibility of achieving this on hardware in the future.However, even without this restriction there is still hope.For example, for the parameters examined in this work, if β = 0.6 and J F = −2 (in units of the logical Hamiltonian), for chains of length K = 3 we get P w = 0.19, which means n max ∼ N/6.26 for large N .If we wish to sample a 1000 spin (logical) problem, RRS would likely only need to handle up to 300 spins which is significantly easier.Letting J f = −4 reduces the size RRS needs to handle further to around 20 spins (with 1 + P −1 w = 62).

V. CONCLUSION
We have demonstrated a clear potential pitfall for any thermal sampler with a restricted topology, such as a quantum annealer for use in machine learning.We showed, that under the annealed approximation of spinglasses, samples from the subspace one wishes to probe, the logical subspace, are exponentially unlikely in problem size and the complexity of the embedding (size of the logical spins K).We found analytic expressions which numerically capture this unfavourable scaling with good accuracy, for parameters studied in this work.We proposed a new method for projecting states back to the logical subspace, and propose a scaling for the ferromagnetic coupling strength of logical spins J F which guarantees the computational plausibility of this scheme.Fortunately, this scaling is only logarithmic in problem size N .
Going forward, it would be beneficial to improve, or bound in (β, |J F |), the accuracy of our general model (Eqs.( 13), ( 15)), perhaps by restricting to certain problem classes and therefore making more informed approximations.Moreover, there are many questions about how different problem types are effected by embeddings on various topologies.Similarly, it would be useful to obtain results for larger problem sizes and a larger range of temperatures, either analytically where possible, or through advanced sampling techniques (such as parallel tempering).Lastly, it is clear there is a lot of room for development of new projection techniques, expanding on, or going beyond the introduced RRS scheme.In RRS, it is assumed the temperature of the thermal sampler is known, and this may not always be the case; for example, in quantum annealers different sets of problems may be sampled at effectively different temperatures [13][14][15].One would first therefore need to estimate the temperature [2,[16][17][18][19].Since in general one will not obtain the exact temperature, a further study of importance is how the performance of RRS depends on noise in the temperature parameter.

Figure 1 .
Figure 1.Example of embedding a fully connected graph of three nodes (triangle) into a square graph.The edges have weights given by Jij defined by the Hamiltonian Eq. (1).In the embedded graph (right) an additional variable is used, with the green vertex being split into two, coupled with strength JF .This combined variable is often referred to as a logical vertex, or logical spin.Note, the embedding process is in general not unique.

Figure 4 .
Figure 4. Example of spin configuration of chain of size K = 6 with n dw = 3 domain walls.Vertical dash lines represent positions of the domain walls where the spin value changes between sites.There can be at most K−1 domain walls.The red links represent couplings JF .The energy increase (penalty) for introducing n dw domain walls is 2n dw |JF |.There are 2 K−1 n dw possible configurations of a spin chain with n dw domain walls.

Figure 5 .
Figure 5.We compare our theoretical Eq. (17) (dash lines) to numerical simulations, where Pn is the probability of observing a configuration with n broken logical spins.Each data point is an average over 100 random embedded problems for various choices of N, n, and with parameters given in the legend.Error bars are standard deviation.We fix |JF | = 2 in units of the original Hamiltonian for all data points.
for a, b ∈ {−1, 1}.With probability P (C) (a,b)→(c,c) configuration C a,b is projected to C c,c (c ∈ {−1, 1}).If the logical spin is aligned, we should not change it; P (C) (a,a)→(a,a) = 1.The probability to observe configuration C a,b , before any projection, is exp(−βE

Figure 8 .
Figure 8.The effect of RRS on sampling for the same 8 variable fully connected problem of Fig.6.Here, E(c) is the cost associated with logical configuration c, and P (c) the corresponding sampling probability under a Boltzmann distribution.In the embedding, each variable becomes a logical variable of size K = 3 (see Fig.2).The ferromagnetic coupling strengths is units of the native Hamiltonian H.The straight lines are found by least squares fitting, where the gradient represents the inverse temperature (see legend).It is clear that RRS outperforms MV.

Figure 9 .
Figure 9. KL divergence to ideal Boltzmann distribution after performing projection Π of RRS (red) or MV (blue).N is the native problem size, β = 0.6, with embedding as described in Sect.II A using K = 3 and JF = −2.Error bars (standard deviation) are over 500 random samples.

Figure 10 .
Figure 10.RRS version of Fig.7, with the same parameters.The effective temperature is much closer to the sampled temperature, although still decreasing with problem size.Similarly, the KL divergence values are less, by around an order of magnitude.

Figure 11 .
Figure 11.Comparison of RRS and MV for larger problem using Monte Carlo thermal sampler.Here the native problem is fully connected of size N = 35 with couplings and local fields uniformly random from [−1 : 0.2 : 1].Since the native problem is small enough, we can exactly compute the degeneracies gi for each energy level Ei.The blue solid line is the exact profile.Pi is the probability of observing energy level Ei under the sampling.The blue dots (with error bars smaller than the dots) is from sampling from the 35 spin problem using a Monte Carlo algorithm with β = 0.6, showing excellent agreement with the exact solid line.The red (MV) and yellow (RRS) dots with error bars (standard deviation) are from sampling the embedded problem (topology as in Sect.II A) with K = 3 and JF = −2 (in this case the embedded problem contains 35 × 3 = 105 spins).The red and yellow solid lines are from least squares fitting with gradient representing the sampling inverse temperature β as in the legend.The Monte Carlo algorithm uses 1000 thermalization steps per sample, with 200 random initializations and 10 6 samples per realization.

Figure 12 .
Figure 12.Graph of Eq. (27) for proposed scaling of |JF | (relative to problem couplings |Jij|) in order to minimize number of broken chains.We plot for three choices of embedding size K, and two temperatures.