Fault-tolerant quantum speedup from constant depth quantum circuits

A defining feature in the field of quantum computing is the potential of a quantum device to outperform its classical counterpart for a specific computational task. By now, several proposals exist showing that certain sampling problems can be done efficiently quantumly, but are not possible efficiently classically, assuming strongly held conjectures in complexity theory. A feature dubbed quantum speedup. However, the effect of noise on these proposals is not well understood in general, and in certain cases it is known that simple noise can destroy the quantum speedup. Here we develop a fault-tolerant version of one family of these sampling problems, which we show can be implemented using quantum circuits of constant depth. We present two constructions, each taking $poly(n)$ physical qubits, some of which are prepared in noisy magic states. The first of our constructions is a constant depth quantum circuit composed of single and two-qubit nearest neighbour Clifford gates in four dimensions. This circuit has one layer of interaction with a classical computer before final measurements. Our second construction is a constant depth quantum circuit with single and two-qubit nearest neighbour Clifford gates in three dimensions, but with two layers of interaction with a classical computer before the final measurements. For each of these constructions, we show that there is no classical algorithm which can sample according to its output distribution in $poly(n)$ time, assuming two standard complexity theoretic conjectures hold. The noise model we assume is the so-called local stochastic quantum noise. Along the way, we introduce various new concepts such as constant depth magic state distillation (MSD), and constant depth output routing, which arise naturally in measurement based quantum computation (MBQC), but have no constant-depth analogue in the circuit model.

Introduction -Quantum computers promise incredible benefits over their classical counterparts in various areas, from breaking RSA encryption [1], to machine learning [2], and improvements to generic search [3], among others [4,5].Although these and other examples of quantum algorithms do outperform classical ones, on the practical level, they in general require quantum computers with a high level of fault-tolerance and scalability, the likes of which appear to be out of the reach of current technological developments [6].An interesting question is thus, what can be done with so-called subuniversal quantum devices which are not universal, in the sense that they cannot perform any quantum computation, but are realizable in principle by our current technologies.Several examples of such practically motivated sub-universal models which nevertheless capture a sense of quantum advantage have been discovered in recent years [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22].In most of these works, sampling from the output probability distribution of these sub-universal devices has been shown to be classically impossible to do efficiently, provided widely believed complexity theoretic conjectures hold [7,8].Thus, these devices demonstrate what is known as an exponential quantum speedup.
The first experimental demonstration of quantum speedup is a major milestone in quantum information.
Recent audacious experimental efforts [18] and subsequent proposals of their classical simulation [23] bring to light the challenges and subtleties of achieving this goal.Statements of quantum speedup are complexity theoretic in nature, making it difficult to pin down when a problem can in practice be simulated or not classically, even if we know in the limit of 'infinite size' experiments that efficient classical simulation is impossible.At the same time, the role of noise in simplifying the simulation is ever more important, as systems grow, noise becomes more difficult to control, and it is a subtle question as to when it dominates; and even simple noise can very easily lead to breakdown of quantum speedup.Indeed, in [24][25][26][27][28][29][30] it was shown that noise generally renders the output probabilities of these devices (which in the noiseless case demonstrate quantum speedup) classically simulable efficiently.There is clearly a great need to understand better the effect of noise, and develop methods of mitigation.
Applying the standard approach to deal with noise in computation, fault-tolerance, is non-trivial in this setting for at least two reasons [31][32][33][34].Firstly, the resources it consumes can be huge.Secondly, it typically involves operations that step outside of the simplified computational model that makes it attractive in the first place.For example, in [7] the sub-universal model IQP was defined, as essentially the family of circuits where all gates are diagonal in the X-basis, and shown to provide sampling problems demonstrating quantum speedup in the noiseless case.However in [24] it was shown that a simple noise model -each output bit undergoes a bit flip with probability ε -renders the output probabilities of sufficiently anti-concentrated IQP circuits efficiently simulable classically.Interestingly, for this special type of noise, they also show that quanutm speedup can be recovered using classical fault-tolerance and larger encodings of the problem quantumly, still within the IQP framework [24].However, for more general noise (for example Pauli noise in all the Pauli bases), this does not appear to work, and it is not obvious if it is possible to do so within the constrained computational mode.In this case that would mean maintaining all gates be diagonal in X, which is not obvious as typical encoding and syndrome measurements involve more diverse gates.
In this work, we study how quantum speedup can be demonstrated in the presence of noise for a family of sampling problems.We take the local stochastic quantum noise (we will also refer to this noise as local stochastic noise) model, commonly studied in the quantum error correction and fault-tolerance literature [21,[35][36][37][38].Our sampling problems are built on a family of schemes essentially based on local measurements on regular graph states, which correspond to constant depth 2D nearest neighbor quantum circuits showing quantum speedup [11-13, 15, 20, 39].We show that these can be made fault-tolerant in a way which maintains constant depth of the quantum circuits, albeit with large (but polynomial) overhead in the number of ancilla systems used, and at most two rounds of (efficient) classical computation during the running of the circuit.
We present two different constructions based on two different techniques of fault-tolerance, the first of which involves the use of transversal gates and topological codes each encoding a single logical qubit [21,31,40].This construction results in a constant depth quantum circuit demonstrating a quantum speedup, but, because of the need for long range transversal gates, can only be viewed as a quantum circuit with single qubit Clifford gates and nearest neighbor two-qubit Clifford gates in 4D (we will henceforth refer to this as our 4D nearest neighbor (NN) architecture).Our second construction avoids using transversal gates by exploiting topological defect-based quantum computation [41], thereby resulting in a constant depth quantum circuit which is a 3D NN architecture.The tradeoff, unfortunately, is that our 3D NN architecture requires polynomially more ancillas than our 4D NN architecture, and has two layers of interaction with a classical computer, as compared to one such layer in our 4D NN architecture.
Our first construction in 4D uses several techniques from [21], in particular regarding the propagation of noise through Clifford circuits.For the second construction, we also develop techniques from [42].In [42], a construction for fault-tolerant quantum speedup was presented which consisted of a constant depth quantum circuit obtained by using defect-based topological quantum computing [41].This construction is non-adaptive (no interaction with classical computer during running of circuit), and can be viewed as a 3D NN architecture.The main disadvantage of the construction in [42] was the magic state distillation (MSD) procedure employed, which makes the scheme impractical in the sense that one should repeat the experiment an exponential number of times in order to observe an instance which is hard for the classical computer to simulate.In both our 3D and 4D NN constructions, we overcome this problem by optimizing our MSD procedure, thereby making the appearance of a hard instance very likely in only a few repetitions of the experiment, a feature called single-instance hardness [11].This, however, comes at the cost of adding adaptive interactions with the classical computer while running the quantum circuit.
This paper is organised as follows.First, we introduce the family of sampling problems using graph states, on which our constructions are based.After briefly defining the noise model, we describe in detail the encoding procedure for our 4D NN architecture.We then describe the effects of noise on our construction, step by step, starting from the Clifford part of the circuit and ending with the MSD, while introducing our optimized MSD techniques based on MBQC, namely constant depth non-adaptive MSD, and MBQC routing.Finally, we explain how to modify, using our optimized MSD techniques, the 3D NN architecture in [42] in order to give rise to the singleinstance hardness feature [11].Note that in our 3D NN architecture, we use different (fixed) measurement angles to those in [42] to construct a different sampling problem having an anti-concentration property [13,14,39].
Graph state sampling -Our approach is to construct a fault-tolerant version of the architectures based on measurement based quantum computation (MBQC) [43], which have recently been shown to demonstrate a quantum speedup [11-13, 15, 20, 39].In these constructions, the sampling is generated by performing local measurements on a large entangled state, known as a graph state.Given a graph G, with vertices V and edges E, the associated graph state |G , of |V | qubits is defined as where and CZ ij is the controlled-Z gate (CZ) acting on qubits i and j connected by an edge.For certain graphs of regular structure, such as the cluster [43] or brickwork [44] states, applying single qubit measurements, of particular choices of angles on the XYplane, effectively samples distributions, in a way that is impossible to do efficiently classically, up to the standard assumptions [11-13, 39, 42].
Although our techniques can be applied to any such architecture where the measurement angles in the XYplane of the Bloch sphere are chosen from the set 0, π 2 , π 4 [11][12][13]39]; for concreteness we will focus on the architecture of [39].Here, this red horizontal line is a linear cluster of twelve qubits measured at an XY angle of 0, this is in order to make the construction nearest neighbor.Note that this only adds single qubit random Pauli gates to the random gates of [39], and therefore does not affect their universality capacity in implementing a t−design.
Following [39] we start with a regular graph state, closely related to the brickwork state [44], composed of n rows and k columns.Then we (non-adaptively) measure qubits of all but the last column at pre-specified fixed XY -angles from the set 0, π 2 , π 4 effectively applying a unitary, on the n unmeasured qubits.This is illustrated in Figure 1.
Let V 1 ⊂ V be the set of qubits which are measured at angle π 4 and V 2 ⊂ V is the set of qubits which are measured at an XY angle π 2 .One can equivalently perform local rotations to the graph state and measure all systems in the Z basis.In this way, if we define This procedure effectively samples from the ensemble of unitaries 1 2 n.(k−1) , U s .It was shown in [39] that ))), this ensemble has the property of being an ε-approximate unitary t-design [45] that is, it approximates sampling on the Haar measure up to the t-th moments.This property allows us to reduce the requirements for the proof of quantum speedup since it implies anti-concentration for t = 2 from [13].
Measuring qubits of the last column in the computational (Z) basis and denoting the outcome by a bit string x ∈ {0, 1} n , our construction samples the bit strings s, x with probability given by ( Fixing t = 2 and ε to an appropriate value, in this case the value of k becomes k = O(n), we will use this value of k throughout this work.The results of [13,14] directly imply (see also [15]), that the distribution satisfies the following anti-concentration property [13, 14] where α is a positive constant, 0 < β ≤ 1, and P r s,x (.) is the probability over the uniform choice of bit strings s and x.By using the same techniques as [9,13,15], the following proposition can be shown.
Proposition 1 Given that the polynomial Hierarchy (PH) does not collapse to its 3rd level, and that the worst-case hardness of approximating the probabilities of D (Equations ( 3) and ( 4)) extends to average-case; there exists a positive constant µ such that no poly(n)-time classical algorithm C exists that can sample from a probability distribution D C such that Indeed, as shown in [39], (2) can be viewed as implementing a 1D random circuit, as those in [46].In this picture the circuits have depth O(n) (for fixed t and ε) and are composed of 2-qubit gates which are universal on U (4).These circuits are therefore universal under postselection implying that there exist probabilities D(s, x) which are hard ( P) to approximate up to relative error 1/4+O(1) [47] (this property is referred to as worst-case hardness of approximating the probabilities of D, or for simplicity worst-case hardness).Worst-case hardness together with the anti-concentration property of Equation ( 5) mean that the techniques of [9] directly prove Proposition 1.
Note that Proposition 1 is a conditional statement, meaning that it is true up to some conjectures being true.The first is that the P H does not collapse to its 3rd level, a generalization of P = N P , which is widely held to be true [48].The second conjecture is that the worst-case hardness of the problem extends to average-case, meaning roughly that most outputs are hard to approximate up to relative error 1/4 + O (1).Although this conjecture is less-widely accepted, there exists evidence to support it mainly in the case of random circuits sampling unitaries from the Haar measure [49,50].Particularly relevant to our case are arguments in [12,15] which give convincing evidence that worst-case hardness should extend to average-case for distributions of the form D(s, x) (Equation ( 3)), where the uniform distribution over bitstrings s effectively makes D(s, x) more flat as compared to, say, the outputs of random quantum circuits [49,50] or standard IQP circuits [9].Also, in [11] an averagecase hardness conjecture was stated involving an MBQC construction with fixed XY angles, as is the case here.Furthermore, we note that a worst-to-average-case conjecture is effectively always required in all known proofs of hardness of approximate classical sampling up to a constant error in the l 1 -norm [51].
The circuit implementing this construction is constant depth.To see this, notice that the regularly structured graph states of [11-13, 15, 20, 39] can be constructed from constant depth quantum circuits composed of Hadamard (H) and CZ gates [52].The measurements, being nonadaptive, may be performed simultaneously (depth one).The explicit form of the circuit can be seen by re-writing the state |G as follows where |T = Z(π/4)H|0 = (|0 +e iπ/4 |1 )/ √ 2 is referred to as the T -state or magic state.Taking out the T -state explicitly as here will be useful for applying fault-tolerant techniques.In this way, these architectures can be viewed as constant depth 2D circuits with NN two-qubit gates [53].
We will show that this constant depth property prevails in our fault-tolerant version of these architectures as well, in our case using 4D and 3D circuits with NN two-qubit gates.As a final remark, note that the 2D NN circuit presented here has the single-instance hardness property, because the choice of measurement angles is fixed [11].
Noise model -Before going into details of the faulttolerant techniques, we present the noise model which we adopt.We will consider the local stochastic quantum noise model, following [21,35].Local stochastic noise can be thought of as a type of noise where the probability of the error E occuring decays exponentially with the size of its support.This noise model encompasses errors that can occur in qubit preparations, gate applications, as well as measurements.It also allows for the errors between successive time steps of the circuit to be correlated [21,35].More precisely, following the notation in [21,35], a local stochastic noise with rate p, where p is constant satisfying 0 < p < 1, is an m-qubit Pauli operator E = ⊗ i=1,...,m P i , where P i ∈ {1, X, Y, Z} are the single qubit Pauli operators, such that for all F ⊆ {1, ..., m}, where Supp(E) ⊆ {1, .., m} is the subset of qubits for which P i = 1.Also following notation in [21,35], we will denote a local stochastic noise with rate 0 < p < 1 as E ∼ N (p).
We will use the following property of local stochastic noise, shown in [21], which says that all errors for constant depth Clifford circuits can be pushed to the end.Consider a constant depth-d noiseless quantum circuit which acts on a prepared input state and is followed by measurements, where each U i for i = {1, ..., d} is a depth-one circuit composed of single and two-qubit Clifford gates.It was shown in [21] that a noisy version of this circuit satisfies where ) and E out ∼ N (p out ) with constants 0 < p prep , p out < 1 are the errors in the preparation and measurement respectively [54].
For constant depth d, E ∼ N (q) where 0 < q < 1 is a constant which is a function of p 1 , ..., p d , p prep , p out [21] [55].Equation (8) shows that the errors accumulating in a constant depth quantum circuit composed of single and two qubit Clifford gates can be treated as a single error E. Furthermore, for small enough q (i.e small enough p 1 , ..., p d , p prep , p out -typically, these should be smaller than the threshold of fault-tolerant computing with the surface code [21,34] or of the 3D cluster state [41] in our case), E can be corrected with high probability by using standard techniques in quantum error correction (QEC) [21,40].Also, E can be propagated until after the measurements, where the error correction procedure is completely classical.

4D NN ARCHITECTURE
In this part of the paper, we will describe the construction of our 4D NN architecture demonstrating a quantum speedup.Our approach takes three ingredients, the sampling based on regular graph states mentioned above [11-13, 15, 20, 39], fault-tolerant single shot preparations of logical qubit states [21], and magic state distillation (MSD) [56][57][58].A large part of fault-tolerant techniques follow the work of [21], where they present a family of constant depth circuits which give statistics that cannot be reproduced by any classical computer of constant depth.To do so they introduce error correcting codes where it is possible to prepare logical states faulttolerantly with constant depth, and Clifford gates are transversal.Then, they also show that for local stochastic quantum noise, all errors for Clifford circuits can be traced through to effectively be treated as a final error, meaning that errors do not have to be corrected during the circuit.Together these allow for constant depth fault-tolerant versions of constant depth Clifford circuits.Compared to [21], the big difference in our work is the need for non-Clifford operations (for the choice of local measurement angle).To address this, we use so called magic states which can be distilled fault-tolerantly [56].Generally their distillation circuits are not constant depth however, and here we adapt the distillation circuits of [57] to be constant depth using ideas from MBQC.In particular we do not use feed-forward in the distillation procedure, and instead translate depth of circuits for cost of having to do many copies of constant depth circuits (each being an MSD circuit with no feed-forward) many times in parrallel.We show that, for specific MSD techniques [57,[59][60][61], a balance can be reached which gives sufficiently many magic states of high enough fidelity to demonstrate quantum speedup in constant depth with polynomial overhead in number of ancillas.We then use MBQC notions to route in the high fidelity magic states into our sampling circuit.This is also done in constant depth.At this point, interaction with a classical computer is required.This is mainly in order to identify which copies of MSD circuits (which are done in parallel) were successful in distilling magic states of sufficiently high fidelity.After, these high fidelity magic states are taken, together with more ancillas, to make a logical version of the graph state, which is then measured.Effectively we then have two constant depth quantum circuits with an efficient (polynomial) classical computation in between.
The constant depth MBQC distillation, together with the constant depth MBQC routing will ensure that enough magic states with adequately high fidelity are always injected into our sampling problem, thereby enabling us to observe quantum speedup deterministically at each run of the experiment, since we would determinstically recover an encoded version of the 2D NN architecture with the single-instance hardness property described in earlier sections [39].This is contrary to what happens in [42], where an encoded version of this 2D NN architecture is constructed probabilistically, albeit with exponentially low probability of success.
Logical encoding -Following [21], we use the folded surface code [21,62,63].A single logical qubit is encoded into l physical qubits.We denote the logical versions of states and fault-tolerant gates using a bar, that is, a state |ψ of m qubits would be encoded onto its logical version |ψ on m.l qubits and operator U would be replaced by logical operator U .The choice of encoding onto the folded surface code has two main advantages, firstly, Clifford gates have transversal fault-tolerant versions, meaning the fault-tolerant versions of a constant depth Clifford circuit are also constant depth and composed of single and two-qubit Clifford gates acting on physical qubits of the code [21].For example where V diag is the set of physical qubits lying on the main diagonal of the surface code, X i is a Pauli X operator acting on physical qubit i. Similarly for the logical version of the Pauli Z operator Secondly, the preparation of the logical |0 and |T states can be done fault-tolerantly in constant depth [21,58].
The preparation of the logical |0 state can be done fault-tolerantly using the single-shot preparation procedure of [21].This requires a constant depth 3D quantum circuit, together with polynomial time classical postprocessing, which can be pushed until after measurements of logical qubits of our circuit (see Figure 2).This constant depth quantum circuit consists of non-adaptive measurements on a 3D cluster state composed of O(l 3/2 ) (physical) qubits [21].The 3D cluster state being of regular structure can be prepared in constant depth.The non-adaptive measurements create a two-logical qubit Bell state up to a Pauli operator.The classical postprocessing is in order to trace these Paulis through the Clifford circuits (Figure 2) and correct the measurement results accordingly.In [21] it is shown that this preparation process is fault-tolerant, by showing that, in the presence of local stochastic quantum noise the overall noise induced from the preparation, measurements, and Pauli correction is a local stochastic noise with constant rate [21].For our purposes, we will only use one logical qubit of the Bell state [64].
The preparation of the logical T -state |T can also be done in constant depth by using a technique similar to [58].Indeed, in the absence of noise, a perfect logical T -state can be prepared by the initialization of l physical qubits (over a constant number of rounds), as well as three rounds of full syndrome measurements; as detailed in [58] [65].Each of the syndrome measurement rounds, because of the locality of the stabilizers in the surface code, can be scheduled in such a way as to be implemented by a constant depth quantum circuit composed of Controlled Nots and ancilla qubit measurements [33,58].In the presence of noise, this procedure prepares a noisy logical T -state (Equation ( 14)), starting from a noisy physical qubit T -state, and noisy preparations, gates and measurements [58] [66].However, distillation is required to get sufficiently high quality T -states, which will be dealt with separately later.For simplicity, for now we will assume perfect T -states.
Starting with the prepared logical |0 and |T , the logical version of Equation ( 7) is written in terms of the constant depth circuit C 2 , where Since all gates are Clifford, the physical circuit implementing C 2 is constant depth.This circuit is the last circuit element in Figure 2 which combines the elements of our construction.
The logical Z measurements are carried out by physical Z measurements on the physical qubits of the surface code, and several classical decoding algorithms have been established [21,33,34,40].In the noiseless case, the decoding algorithm consists of calculating the sum (modulo 2) of the measurement result from measuring Z on the physical qubits of the main diagonal of the surface code.In the presence of noise, the decoding algorithm takes as input the (noisy) measurement results of all the l physical qubits of the surface code which are measured in the same basis as the qubits on the main diagonal, and performs a minimal weight perfect matching to correct for the error induced by the noise [21,31,34].For small enough error rates (below the threshold of fault-tolerant computing with the surface code), the probability that these decoding algorithms fail, that is, the probability that the noise changes the parity of the Z measurement result after decoding, decreases exponentially with the code distance cd, which for surface codes scales as cd = O( √ l) [21,31,34,62]. Let denote the measurement results of the logical qubits of all but the last column of |G .Similarly, let denote the measurement results of the logical qubits of the last column of |G .
If we call D(s, x) the probability of getting (s, x) in the absence of noise, it follows straightforwardly from the logical encoding that for all s ∈ {0, 1} n.(k−1) , and x ∈ {0, 1} n , where D(s, x) is as defined in Equations ( 3) and (4).That is, in the absence of noise, measuring non-adaptively the logical qubits of |G in Z defines a sampling problem with probability distribution D demonstrating a quantum speedup, by Proposition 1.
We will now see that this sampling remains robust under local stochastic noise.Noise must be addressed at each part of the construction.The first being that each depth-one step of the circuit preparing |G is now followed by a local stochastic noise, as in the example of Equation (8).Also, the single-shot preparation procedure of [21] becomes noisy, however as shown in [21] this noise is local stochastic with constant error rate and therefore can be treated as a preparation noise in preparing |G , analogous to E prep in Equation (8).As seen earlier, the circuit preparing |G is constant depth and composed of single and two-qubit Clifford gates acting on physical qubits.Therefore, we can use the result of [21], which is shown in Equation (8), and treat all the noise accumulating through different steps of the circuit as a single local stochastic noise E ∼ N (q) with a constant rate 0 < q < 1, acting on the (classical) measurement outcomes [21].Therefore, when q is low enough [21], E can be corrected with high probability using the classical decoding algorithms described earlier [21].In appendix A, we show that when the number of physical qubits per logical qubit l scales as where n is the number of rows of |G [67], this suffices for our needs.
More precisely, we denote D1 (s, x) the probability of getting outcomes (s, x) in the presence of stochastic noise, after performing a classical decoding of the measurement results [21], but where logical T -states are assumed perfect (noisless).Then, if l satisfies Equation ( 12), and for small enough error rates (below the threshold of faulttolerant computing with the surface code [34]) of preparations, single and two-qubit gates, as well as measurements, D1 (s, x) can be made 1/poly(n) close in l 1 -norm to the noiseless version (Equation ( 11)).That is, This means that for a given constant µ 1 , there exists a large enough constant n 0 , such that for all n ≥ n 0 classically sampling from D1 up to l 1 -norm error µ − µ 1 implies, by a triangle inequality, sampling from D up to l 1 -norm error µ, which presents a quantum speedup by Proposition 1 [9].Therefore, we have recovered quantum speedup in the presence of local stochastic noise, assuming perfect T -states.

Distillation of T -states-
The final ingredient is the distillation of the T -states.The analysis we have done so far assumes we can still prepare perfect logical T -states.In reality, however, this is not the case.Indeed, in the presence of noise, the constant depth preparation procedure of [58] can only prepare a logical T -state with error rate 0 < ε < 1 with η an arbitrary l-qubit state.In order to get high purity logical T -states, one must employ a technique called magic state distillation (MSD) [56].An MSD circuit is a Clifford circuit which usually takes as input multiple copies of noisy T -states ρ T noisy , together with some ancillas, and involves measurements and post-selection in order to purify these noisy input states [56].The output of an MSD circuit is a logical T -state ρ T out with higher purity than the input one.That is, with 0 < ε out < ε < 1, and η an arbitrary l-qubit state.
For small enough ε [68] [69], ε out could be made arbitrarily small by repeating the MSD circuit an appropriate number of times [56].MSD circuits need not in general be constant depth.Our approach to depth is, again, via a translation to the measurement based quantum computing (MBQC) paradigm [43].In MBQC one starts off with graph state, for example the 2D grid cluster state, and computation is carried out through consecutive measurements on individual qubits.In order to preserve determinism these measurements must be corrected for.For a general computation this must be done round by round (the number of rounds typically scales with the depth of the corresponding circuit, though there can be some separation thereof [70]).If we forgo these corrections, we end up applying different unitaries, depending on the outcome of the measurement results -indeed, this is effectively what happens in Equation (2).Thinking of MBQC now as a circuit, if one could do all measurements at the same time, one could think of it as a constant depth circuit, since all that is needed is to construct the 2D cluster state followed by one round of measurements and corrections, which can be done in constant depth.This is possible for circuits constructed fully of Clifford operations, but not generally, and not for the MSD circuits we use here because of the T gates (or feedforward), so we are forced to sacrifice determinism.Now, in order to get constant depth MSD, we translate the MSD circuits in [57] to MBQC.The choice of this MSD construction is argued in appendix B 2. Since we want to maintain constant depth, we want to perform all measurements at the same time, however the cost is that it will only succeed if we get the measurement outcomes corresponding to the original circuit of [57] with successful syndromes.In order to produce enough T states, the trick is simply to do it many times in parallel.That is, we will effectively implement many copies of the MBQC computation, so that we get enough successes.Effectively we trade depth of the corresponding circuit for number of copies and ancillas.Fortunately, for our specifically chosen MSD protocols [57,61], we will see that this cost is not too high.
Furthermore, this is all done in the logical encoding of the folded surface code.Our construction for this, which we denote zM SD, is designed to take copies of the noisy encoded T -states ( [58]) and ancilla in the encoded |0 state, and affect z iterations of the fault-tolerant version of MSD protocol in [57].As discussed above, this happens only when the correct results occur in the MBQC.In this case we say the zM SD was successful.We denote the circuit version of this as C 1 (see Figure 2).In appendix B, we show that when zM SD is successful, ε out satisfies We also show that performing O(n 3 log(n)) copies of zM SD circuits (which can be done in parallel), each of which is composed of O(log(n)) logical qubits as seen in appendix B 2, guarantees with high probability that at least O(n 2 ) copies of zM SD will be successful (we will refer to these often as successful instances of zM SD).
Note that O(n 2 ) = O(k.n) is the number of perfect logical T -states needed to create |G [39].Furthermore, because zM SD is constant depth and composed of single and two-qubit Clifford gates, errors can be treated as a single local stochastic noise after the measurements with constant rate (see Equation ( 8)) which can be corrected classically with high probability when the error rates are low enough using the standard decoding algorithms described previously [21,31].The remaining task is to route these good states into the inputs of the circuit C 2 (Equation ( 10)) depending on the measurement outcomes -i.e.make sure that only the good outputs go to make |G .The most obvious approach, using control SWAP gates, results in a circuit whose depth scales with n.Here, once more, we use MBQC techniques in order to bypass additional circuit depth.The idea is to feed the outputs through a 2D cluster graph state, and dependent on the measurement results of the zM SD, the routing can be etched out by Pauli Z measurements.Since the graph is regular, and, since the measurements can be made at the same time, this can be done in constant depth, up to Pauli corrections (which can be efficiently traced and dealt with by the classical computation at the end).We denote the fault-tolerant circuit implementing this as C R , see Figure 2. Details of the construction can be found in appendix C, where we also show that errors remain manageable.
Finally, we denote D2 (s, x) to mean the probability of observing the outcome (s, x) after measuring all logical qubits after C 2 (Equation ( 10)), in the presence of local stochastic noise, and where each T -state fed into C 2 is replaced by ρ T out , and performing a classical decoding of these measurement results [21].Then, we show, in appendix B, that when ε out satisfies Equation ( 16), Therefore, by the same reasoning as that for D1 , for small enough error rates, for large enough n, and with very high probability p succ , we can prepare a constant depth quantum circuit sampling from a noisy distribution D2 under local stochastic noise, presenting a quantum speedup.
Our main result can therefore be summarized in the following Theorem, whose proof follows directly from showing that Equation (18) holds and using Proposition 1.
Theorem 1 Assuming that the P H does not collapse to its third level, and that worst-case hardness of the sampling problem (4) extends to average-case.There exists a positive constant 0 < p < 1, and a positive integer n o , such that for all n ≥ n o , if the error rates of local stochastic noise in all preparations, gate applications, and measurements in C 1 , C R , and C 2 are upper-bounded by p, then with high probability p succ (Equation ( 17)), the sampling problem D2 defined by ( 18) can be constructed, and no poly(n)-time classical algorithm exists which can sample from D2 up to a constant µ in l 1 −norm .
Overview of the 4D NN architecture -The overall construction is presented in Figure 2 as a combination of the three circuits mentioned above, C 1 implementing the MSD, the routing of successful T -states in C R , and the circuit for the construction of the state |G in C 2 .Overall it takes the noisy logical |0 and ρ T noisy states as inputs and the final measurements are fed back to a classical computer (CC) to output the error corrected results s, x, according to distribution D2 (Equation ( 18)).The preparation of the logical input states is done in constant depth [21,58] and each of these three composite circuits are constant depth, using at most three dimensions.Furthermore, assuming that classical computation is instantaneous, our entire construction can be viewed as a constant depth quantum circuit.Indeed, as already seen C 2 is constant depth, what remains is to show the same for C 1 and C R .We show this in appendix B 2 and C.
During the circuit, we require some side classical computation, which inputs back into the circuit at one point.Classical information to and from the classical computer are indicated by dotted orange and black lines in Figure 2. First, the measurements of the non-outputs for the zM SD in C 1 , along with measurement results (not illustrated in figure) of (physical) qubits used in preparing |0 [21] states making up the copies of zM SD, are fed into the classical computer in order to determine the choice of measurements after the routing circuit C R , as indicated by the orange dotted lines in Figure 2.This part simply identifies the successful zM SD outcomes, followed by calculating the routing path.This is the only point that classical results are fed back into the circuit, all other classical computations can be done after the final measurements.After these final measurements, the remaining measurements are fed back into the computer, indicated by black dotted lines in Figure 2. Together with the measurement results from the state preparations [21]  (not illustrated in the figure) these are incorporated into the classical error correction [21,71] giving the outputs s, x with probabilities D2 .The classical computation can be done in poly(n)-time [40,71,72].The total number of physical qubits required scales as O(n 5 poly(log(n))) (where n scales the size of the original sampling problem (Proposition 1)).This breaks down as follows.C 1 takes as input O(n 3 log 2 (n)) noisy logical Tstates ρ T noisy and O(n 3 log 2 (n)) ancillas prepared in |0 .C R takes the outputs of C 1 , and additional O(n 5 log(n)) logical ancillas prepared in |0 .This dominates the scaling.C R sends O(n 2 ) distilled T -states to C 2 , which also takes in O(n 2 ) copies of |0 .This means that in total we would need O(n 5 log(n)) logical qubits.Now, each logical qubit is composed of l ≥ O(log 2 (n)) physical qubits (Equation ( 12)), and some of these logical qubits ,which need to be prepared in |0 , require an additional overhead of O(l 3 2 ) ≥ O(log 3 (n))) physical qubits, as seen previously (see also [21]).Therefore, the total number of physical qubits needed is ∼ O(n 5 log 4 (n))) = O(n 5 poly(log(n))).
A crucial question relevant to experimental implementations would be calculating the exact values of the error rates of measurements, preparations, and gates needed to achieve fault-tolerant quantum speedup in our construction.Because the quantum depth of our construction is constant and composed of single and two-qubit Clifford gates (as seen previously), we know from [21] and the likes of Equation ( 8) that these error rates are non-zero constants independent of n.However, their values may be pessimistically low.A crude estimate of this error rate is p ∼ e −4.6×4 −d−1 .This is assuming preparations (including preparation of noisy logical T -states for distillation), measurements, and gates all have the same error rate p. d is a constant which is the total quantum depth of our construction, which is the sum of the depths of all preparations, gate applications and measurements involved in constructing zM SD, routing the outputs of succesful instances of zM SD, and constructing |G .This expression is obtained by using the same techniques as [21], where the error rate q of E in Equation (8) is chosen such that it satisfies q ≤ 0.01.This is in order for classical decoding to fail with probability decaying exponentially with the code distance of the surface code [21,31].This construction is a constant depth quantum circuit implementable on a 4D NN architecture (or a 3D architecture with long range gates).The reason for this is that our original (non fault-tolerant) construction is a 2D NN architecture [39] as seen previously, and the process of making this architecture fault-tolerant requires adding an additional two dimensions [21], albeit while keeping the quantum depth constant, as explained earlier.If we do not want to use long range transversal CZ gates in 3D, and want all the CZ gates to be NN, the only way to do this is to work in 4D.Note that this was not a problem in [21], as there the original (non faulttolerant) circuit was a 1D circuit, and introducing faulttolerance added two additional dimensions, making their construction constant depth with NN gates in 3D [21].Nevertheless, we will show in the next section how to make our construction constant depth in 3D with NN two-qubit gates.We will do this by avoiding the use of transversal gates to implement encoded versions of twoqubit gates; a feature which is naturally found in defectbased topological quantum computing [41].Armed with the ideas of constant depth MSD and MBQC routing, we shall present in this next section a constant quantum depth fault-tolerant construction demonstrating a quantum speedup with only nearest neighbor CZ gates in 3D.

3D NN ARCHITECTURE
In this part of the paper, we will explain how the construction for fault-tolerant quantum speedup described earlier can be achieved using a 3D NN architecture, based on the construction of Raussendorf, Harrington, and Goyal (RHG) [41].Note that in this construction (henceforth referred to as RHG construction), two types of magic states need to be distilled, the T -states seen previously, as well as the Y -states.A perfect (noiseless) Y -state is given by This state is a resource for the phase gate Z(π/2).The noisy Y -state ρ Ynoisy is defined analogously to a noisy T -state seen earlier with 0 < ε < 1 representing the noise, and η an arbitrary single qubit state.As already mentioned, the RHG construction was also used in [42] to achieve faulttolerant quantum speedup.However, our construction will differ from [42] in mainly two ways.The first, as already mentioned, is that our construction deterministically produces a hard instance, whereas that in [42] produces such an instance with exponentially low probability.Secondly, our sampling problem verifies the anticoncentration property by construction [39], as explained previously, whereas in [42], this anti-concentration was conjectured.Therefore, in our proofs we assume one less complexity theoretic conjecture ( we use two conjectures in total, see Theorem 1 and Proposition 1) as compared to [42].Note that we assume the minimal number of complexity-theoretic conjectures needed to prove quantum speedup, using all currently known techniques [51].
We now very briefly outline the key points in the RHG construction.More detailed explanations can be found in [41,73,74].In this construction, one starts out with preparing a 3D regular lattice of qubits (call it RHG lattice).This preparation can be done in constant depth by using nearest neighbor CZ gates [41].This lattice is composed of elementary cells, which can be thought of as smaller 3D lattices building it up.Elementary cells are of two types, primal and dual, and the RHG lattice is composed of a number of interlocked primal and dual cells [41,74] .Each elementary cell can be pictured as a cube, with qubits (usually initialized in |+ state) living on the edges and faces of this cube.The RHG lattice is a graph state, and is thus characterized by a set of (local) stabilizer relations [52].Errors can be identified by looking at the parity of these stabilizers.Usually, this is done by entangling extra qubits with the systems qubits, these extra qubits are called syndrome qubits.However, in the RHG construction this is accounted for by including these syndrome qubits a priori when constructing the RHG lattice, this region of syndrome qubits is usually called the vacuum region V [41].Logical qubits in this construction are identified with defects.These defects are hole-like regions of the RHG lattice inside of which qubits are measured in the Z basis, effectively eliminating these qubits.Eliminating these qubits (and some of their associated stabilizers) results in extra degrees of freedom which define the logical qubits [41].Defects can also be primal or dual, depending on whether they are defined on primal or dual lattices.Two defects of the same type (either primal or dual) define a logical qubit.The logical operators X and Z are products of X operators and Z operators respectively.These products of operators act non-trivially on qubits either encircling each of the two defects, or forming a chain joining the two defects, depending on whether the logical qubit is primal or dual [41,74].By measuring single qubits of the RHG lattice at angles X, Y , Z and X + Y √ 2 , one can perform (primal or dual) logical qubit preparation and measurement in X and Z bases, preparation of (primal or dual) logical Tstates and Y -states, and logical controlled not (CN OT ) gates between two defects of the same type (this however can only be accomplished by an intermediate step of braiding two defects of different types [41], which is one of the main reasons for the need for two types of defects).If performed perfectly (noiseless case), these operations are universal for quantum computation [43].Note that in our case, as in [42], we will replace measuring qubits The red box with an M symbol is a measurement either in X or Z. Circuit is shown up until C1, the remaining part of this circuit is the same as that in Figure 2, with Z measurements replaced by M measurements, and with some ancilla qubits being initialized in |+ as well as |0 .These slight changes are in order for the construction to be naturally integrated into the RHG framework [41].Also shown in the figure is the additional interaction with the classical computer CC (ingoing and outgoing red dotted arrows) needed in order to identify the succesfully distilled Y -states as well as construct the measurement pattern for the routing circuit C R .
|Y and |T , then measuring these qubits in the X basis.In this way, we will only perform single qubit X and Z measurements.One of the spatial dimensions of the 3D RHG lattice is chosen as simulated time, allowing one to perform a logical version of MBQC via single qubit measurements [41].
The preparation and measurement of logical qubits in the X and Z bases, as well CN OT can all be performed by measuring qubits of the RHG lattice in X and Z [41,74].All these operations can be performed faulttolerantly, and non-adaptively (up to Pauli corrections, which can be pushed until after measurements, and accounted for, since all our circuits are Clifford [72].), by choosing the defects to have a large enough perimeter, and a large enough separation [41,74].Indeed, in appendix D, we show that when L m = O(log(n)), where L m is the minimum (measured in units of length of an elementary cell) between the perimeter of a defect and the separation between two defects in any direction, we would recover the same fault-tolerance results as our 4D NN architecture under local stochastic noise, albeit with different error rates which we will also calculate in appendix D. The noisy logical Y -state and T -state preparations can also be prepared non-adaptively up to Pauli corrections by performing X and Z measurements on qubits of the RHG lattice, some of which are intialized in |Y (for logical Y -state preparation) or |T (for logical T -state preparation) [74].However, these preparations are unfortunately non-fault-tolerant (introduce logical errors), and therefore these states must be distilled [41].
If we could somehow obtain perfect logical Y -states, then our constant-depth fault-tolerant 3D NN construction under local stochastic noise would follow a similar analysis as our 4D NN case, and have a circuit exactly the same as that in Figure 2 (up to using X measurements in place of H gates followed by Z measurements), with one difference being that instead of using concatenated versions the of MSD circuits of [57] to construct C 1 , we will use concatenated versions of the MSD circuits of [61].This is in order to preserve the transversality of logical T -gates, which allows preparation of logical T -states in the RHG construction by using only local measurements [41] [75].Unfortunately, distilling logical Y -states in the RHG construction is essential.What makes matters worse is that using techniques of the likes of those used in the construction of C 1 , on MSD circuits capable distilling logical Y -states up to fidelity 1 − ε out (Equation ( 16)), namely circuits based on the Steane code [41], leads to circuits with a quasi-polynomial number of ancillas.This is much worse that the polynomial number of ancillas used in circuits C 1 needed to distill logical Tstates of the same fidelity 1−ε out , and based on the MSD circuits of [57,61] (see appendices B 2 and E).
Happily, we manage to overcome this limitation by observing two facts about our construction.The first is that the Z(π/2) rotations (and thus Y -states) are not needed in order to construct our sampling problem.Indeed, in Figure 1 every qubit measured at an XY angle π/2 in G B could be replaced by a linear cluster of three qubits measured respectively at XY angles π/4, 0, and π/4 (these measurements can be implemented by only using logical T -states in the fault-tolerant version).To make a graph state of regular shape, we should also replace all qubits at the same vertical level as the π/2-measured qubits in G B (see Figure 1), and which are always measured at an XY angle 0, with a linear cluster of three qubits measured at an XY angle 0. By doing this replacement, the new graph gadget G B which is an extension of G B now defines a so-called partially invertible universal set [15].Therefore, by results in [15], using G B instead of G B in our construction (Figure 1) also results in a sampling problem with distribution D = {D (s, x)} (where s and x are bit strings defined analogously to those in ( 3) and ( 4)) satisfying both worst-case hardness and the anti-concentration property [13,15].Thus, the distribution D , although different than D ( Equations ( 3) and ( 4)), can be used in the same way as D to demonstrate a quantum speedup (see Proposition 1).Furthermore, all previous results established for D also hold when D is replaced by D .
To see why G B defines a partially invertible universal set, call U 1 ⊂ U (4) (U 2 ⊂ U (4)) the set of all random unitaries which can be sampled by measuring the qubits of G B (G B ) non-adaptively at their perscribed angles.Straightforward calculation shows that U 1 ⊂ U 2 .Furthermore, both U 1 and its complement in U 2 (denoted U 2 −U 1 ) are (approximately) universal in U (4) since they are composed of unitaries from the gate set of Clifford + T [32,39].The set U 1 being both universal in U (4) and inverse containing [39], implies that U 2 satisfies all the properties of a partially invertible universal set [15].However, note that in using partially invertible universal sets, for technical reasons [15], the number of columns of |G should now satisfy k = O(n 3 ), resulting in an increase of overhead of ancilla qubits.
One could keep k = O(n) (as in the original construction with G B ) while using only π/4 and 0 measurements, by using one of the constructions of [12].However, the construction of [12] does not have a provable anti-concentration, although extensive numerical evidence was provided to support the claim that this family of circuits does indeed anti-concentrate [12].
Although Y -states are not needed in the construction of our sampling problem, they are still needed to construct MSD circuits for distilling logical T -states of fidelity 1 − ε out (Equation ( 16)) [61]; which brings us to our second observation.In order to distill logical T -states of fidelity 1 − ε out (Equation ( 16)), we only need logical Y -states of fidelity 1 − ε out with In other words, the required output fidelity of the logical Y -states need not be as high as that of the logical T -states.In appendix E, we show that this leads to a construction of a (constant-depth) non-adaptive MSD (analogous to how C 1 is constructed) which takes as input a polynomial number of logical ancillas, initialized in either noisy logical Y -states, |+ , or |0 , and which outputs enough logical Y -states of fidelity 1 − ε out needed in the subsequent distillation of logical T -states.This circuit, which we call C 1 and which is based on concatenations of the Steane code [41], is a constant depth Clifford quantum circuit composed of CN OT gates, and followed by non-adaptive X and Z measurements.C 1 , as C 1 , prepares the graph states needed for non-adaptive MSD via MBQC (as seen previously).Note that here we will use CN OT gates instead of CZ gates in order to prepare logical graph states, since these gates are more natural in the RHG construction [41].The preparation procedure is essentially the same as that with CZ modulo some H gates, but these logical Hadamards can be absorbed into the initialization procedure (where some qubits become initialized in |0 instead of |+ ) and the measurements (where some X measurements after C 1 are changed to Z measurements, and vise versa.).The same holds for all other circuits based on graph states in this construction.
With the distillation of logical Y -states taken care of, we now summarize our constant depth construction based on a 3D NN architecture.The circuit of this construction is found in Figure 3.It takes as input logical qubits initialized in the states |+ , |0 , ρ T noisy , and ρ Y noisy , and outputs a bit string (s, x) sampled from the distribution D 2 demonstrating a quantum speedup (see Theorem 1 and Proposition 1).Note that D 2 is the fault-tolerant version of the distribution D defined earlier.D 2 is defined analogously to D2 in Equation ( 18), which is the fault-tolerant version of the distribution D (Equation ( 4)).Our 3D NN architecture is composed of five constant depth circuits acting on logical qubits, and C 2 are as defined previously, and C R is a routing circuit, analogous to C R , which routes succesfully distilled logical Y -states to be used in C 1 .Furthermore, all of these circuits, as well as the preparation of logical qubits, can be constructed by non-adaptive single-qubit X and Z measurements on physical qubits arranged in a 3D RHG lattice, whose preparation is constant depth and involves only nearest neighbor CZ gates.These physical qubits are initialized in the (noisy) states |+ , |Y , and |T [41].Our construction has two layers of interaction with a classical computer, needed to identify succesfully distilled logical Y and T -states respectively.The number of physical qubits needed is O(n 11 poly(log(n)), this calculation is performed in appendix E. The additional overhead as compared to our 4D NN construction comes from mainly two sources, the partially invertible universal set condition [15], and the circuits C R and C 1 which arise as a result of needing to distill logical Y -states in the 3D RHG construction [41].
As in our 4D NN architecture, the noise model we use here is the local stochastic quantum noise defined earlier [21,76].Since the circuit needed to construct the 3D RHG lattice is composed of single and two-qubit Clifford gates acting on prepared qubits [41], all errors of preparations and gate applications can be pushed, together with the measurement errors, until after the measurements; as seen previously.Because the circuit preparing the RHG lattice is constant depth, the overall local stochastic noise has a constant rate (see Equation ( 8)), and therefore could be corrected with high probability for low enough (constant) error rates of preparation, gate application, and measurements [21] (see appendix D where we calculate an estimate of these error rates).The error correction, as in our 4D NN architecture, is completely classical and involves minimal weight matching [71].This error correction is poly(n)-time and is performed at each of the two layers of interaction with the classical computer, as well as after the final measurements.Also, as in the 4D NN case, other poly(n)-time classical algorithms are included in the classical post processing; these are in order to identify succesful MSD instances, and identify the measurement patterns of the routing circuits.The classical computer at each layer of interaction as well as after the final measurements takes as input measurement results of qubits involved in the computation, as well as measured qubits in the vacuum region V .These vacuum qubits give the error syndrome at multiple steps in the computation, and are therefore needed for the minimal weight matching [41].
Discussion− In summary, we have presented a construction sampling from a distribution demonstrating a quantum speedup, which is robust to noise.Our construction has constant depth in its quantum circuit, and can be thought of as a fault-tolerant version of the (noise free) constant depth quantum speedup based on generating and measuring graph states [11-13, 15, 20, 39].We have shown how to implement this construction both by using a 4D architecture with nearest neighbor two-qubit gates, or by using a 3D architecture with nearest neighbor two-qubit gates.The circuits of each of these architectures interact at most twice with an (efficient) classical device while running, and have different requirements in terms of overhead of physical ancilla qubits, owing to the fact that they are based on two different constructions for fault-tolerance [21,41].
The overheads are large in terms of the number of (physical) qubits, however these may be improved.In any case, our construction is considerably simpler than fault-tolerant full blown quantum computation where circuits are scaling in depth and many adaptive layers are required.Therefore our architectures demonstrate potential for interim demonstration of quantum computational advantage, which may be much more practical.Indeed, if one considers classical computation temporally free, our construction represents a constant time implementation of a sampling problem with fault-tolerance.
We note that although we have presented here a faulttolerant construction for a specific graph state architecture [39], the same techniques can be applied to any of the sampling schemes based on making local XY measurements from the set {0, π/2, π/4} on regular graph states [11-13, 15, 39].
In particular it can be easily adapted to cases where the measurements are not fixed but chosen at random before the running of the circuit [11][12][13].This would essentially just fix the locations of the distilled T -states, but it could be done before hand, and would not effect the efficiency of the routing circuits.This has the potential of relating the average-case hardness conjecture to that of other more familiar problems [9,11,12].
Our work also has potentially another interest, as it can alternatively be viewed as a constant depth quantum circuit which samples from an approximate unitary tdesign [45] fault-tolerantly.Indeed, our techniques can be used to directly implement a logical version of Equation ( 2), which samples from an approximate t-design.These t-designs have many useful applications across quantum information theory [45,[77][78][79][80][81].
Several interesting approaches for optimization may be considered.One could think of using different quantum error correcting codes, such as those of [33,76], to decrease the overhead of physical qubits.One could also aim to optimize the overhead of both gates and physical qubits of the MSD by using techniques similar to those of [82,83].
The ability to efficiently verify quantum speedup is also an important goal.Although this question has already been pursued in the regime of fault-tolerance in [42], and the techniques developped there are directly applicable to our 3D NN architecture; it would be interesting to develop verification techniques more naturally tailored to the graph state approach [43,52] and MBQC [43,72], which we use heavily here.In this direction, the work of [84,85] can be used for this purpose when the measurements (both Clifford and non-Clifford) as well as the CZ and Hadamard gates (needed for the preparation of the graph states [52]) are assumed perf ect (noiseless).Indeed, in this case the verification amounts to verifying that the graph state was correctly prepared, for which [84,85] provide a natural path to do so, by giving good lower bounds (with high confidence) on the fidelity (with respect to the ideal graph state corresponding to the sampling problem) of the prepared graph state in the case where a sufficient amount of stabilizer tests pass [84,85].These lower bounds on the fidelity, tending asymptotically to one [84,85], allow one to verify that quantum speedup is being observed, as long as one trusts the local measurement devices (which, being small, can be checked by other means efficiently).This verification of quantum speedup can be done by using the standard relation between the fidelities of two quantum states (which in our case are the ideal state and the state accepted by the verification protocol) and the l 1 -norm of the two output probability distributions corresponding to measuring the qubits of these two states [32].
These techniques, however, do not easily extend to the case where the measurements and gates needed for preparation are noisy; since for graph states of size m, even for an arbitrarily small (but constant, for example below the threshold for fault-tolerant computing) noise strength, the verification protocol might fail (not accept a good state) in the asymptotic (m → ∞) limit (see for example [85] where the verification accepts with probability one asymptotically only if the noise strenght scales as 1/poly(m)).We leave this problem for future investigation.qubit of the Bell state non-adaptively in Z, then decoding the result and applying an X to the unmeasured logical qubit dependant on the decoded measurement result.This should be done after the recovery Pauli operator of [21] has been applied.The noise acting on the unmeasured qubit after completion would still be local stochastic with constant rate.Indeed, after applying the recovery operator of [21], we are left with a Bell state with some local stochastic noise E [21], then after measuring one logical qubit and decoding (which succeeds with high probability if error rates are small), we apply a conditional X operator to the unmeasured logical qubit.In the case this X is applied, it introduces also a local stochastic noise E , but because X is a constant depth Clifford gate with only single qubit gates, E can be merged with E to give a single local stochastic noise E which is still local stochastic with constant rate, by the likes of arguments of Equation (8).In what remains, we incorporate this operation into the classical post-processing needed to apply the recovery operator of the single-shot preparation procedure of [21].We will therefore mean by recovery operator hereafter, the Pauli recovery operator of [21] together with the conditional X which is applied to the unmeasured logical qubit.Note also that, as mentioned in the main text, we will often push applying this recovery operator until later parts of the circuit (for example after measuring the non-outputs of all copies of zM SD as well as after the final measurements of C2), in that case the arguments for the overall noise being local stochastic still hold and follow similar reasoning as above.
[65] The constant depth procedure of [58] also requires some post-selection (in the presence of noise).However, this post-selection is usually over measurement results of a small (constant) number of qubits, and the success probability is also a constant [58].We can therefore implement in paralell O(1) runs of this constant depth procedure, and we are guaranteed with high probability that at least one run corresponds to the desired postselection.
[66] Although the noise model used in [58] is not the same as the one we use here, where in [58] they use independent depolarizing noise for preparations and gate application, and with different rates for single and two-qubit gates, we believe their results hold in our case as well.Indeed, viewing a local stochastic noise with rate p on a single qubit, this qubit could experience an error (after preparation, measurement or gate application) with probability pr ≤ p (from the definition of local stochastic noise with |F | = 1, see main text), this is in line with the noise model of [58] where the probability of error is exactly p.Furthermore, choosing different error rates for local stochastic noise applied after single and two-qubit gates allows mimicking what happens in the noise model of [58].
[67] This is usually the input part of an MBQC [43], which is the basis of our construction.very mild and has no effect on our end result.To see this, suppose we drop this assumption, then the probability of success of all k.n decodings should now be calcuated by a union bound.From the properties of local stochastic noise (namely that local stochastic noise on a subset of qubits of the system is still local stochastic with the same rate [21]) a decoding of a logical qubit succeeds (is able to identify and correct for the error) with probability p single = 1 − p f = 1 − e −O( √ l) (when the error rates of all local stochastic noise in our construction are adequately low, i.e below the threshold of fault-tolerant computing with the surface code), therefore the probability that all k.n decodings succeed is given by P = 1 − k.n + k.n.p single = 1 − k.n.e −O( √ l) , by a standard bound on the intersection of k.n events derived from a union bound.The assumption we make in the main text results in a good approximation of P , and is simpler to state (which is why we used it in the main text).Finally, note that this does not mean that errors between physical qubits of two entangled logical qubits are uncorrelated.Indeed, the correlation between these qubits is accounted for in the propagation rules of local stochastic noise [21], since forward propagating local stochastic noise in Clifford circuits composed of single and two-qubit gates generally results in local stochastic noise with higher error rate [21].
[87] Actually, it is something like 2.k.n if we include decoding of measured logical qubits of the Bell states obtained at the end of the single shot procedure of [21] (see [57]).This changes nothing in the analysis we have done, so we chose to omit it in the main text for simplicity.
[88] Actually, this relation holds for the l1-norm distance between the probability distributions over physical qubits.However, as the absolute value of the sum is less than the sum of absolute values, this relation also holds for the probabilities in Equation (B11).
[89] Since quantum speedup is usually defined with respect to quantum devices using polynomial quantum resources [32].

Properties of zM SD
zM SD implements non-adaptively z iterations of the MSD protocol of Theorem 4.1 in [57].Note that in the protocol of [57], the MSD circuit was for magic states of the form |H = cos(π/8)|0 + sin(π/8)|1 whereas in our case we need distillation circuits for T -states |T defined in the main text.However, since HZ(−π/2)|H = e −iπ/8 |T , the circuits in [57] can be adapted to our case by adding a constant depth layer of H and Z(−π/2) gates, whose logical versions can be done fault-tolerantly and also in constant depth in our construction .We call 1M SD a circuit which implements non-adaptively one iteration of the protocol of Theorem 4.1 in [57].Note that both zM SD and 1M SD will be based on non-adaptive MBQC.We will begin by calculating the number of qubits of 1M SD.
In is the depth of the Clifford part of the circuit, which is composed of long-range Cliffords [57].Therefore, the MSD circuit is an O(d)-qubit circuit of depth O(d 2 .log(d)).In order to implement this circuit on a regular graph state (for example, the cluster state [43]), one must transform the Clifford circuit composed of long range gates, to that composed of nearest neighbor and single qubit Clifford gates, since these single qubit and nearest neighbor two-qubit gates can be implemented by measuring O(1) qubits of a cluster state in the X and Y bases [43,72].An m-qubit Clifford gate can be implemented by an O(m 2 )-depth circuit composed only of gates from the set {CZ ij , H, Z(π/2)} [97].Furthermore, CZ ij could be implemented by a circuit of depth O(i − j) composed of nearest neighbor CZ gates [98].The same arguments hold in the logical picture by replacing H, CZ, Z(π/2), and noisy input T -states with their logical versions H, CZ, Z(π/2), and ρ T noisy .m = O(d) in our case, thus the number of columns of the cluster state needed to implement 1M SD is zM SD can be thought of as a concatenation of z layers of 1M SD, where the output of layer j is the input of layer j + 1.Because the noisy input T -states in the protocol of [57] are injected at different parts of the circuit, this means that the output qubits of layer j should be connected to layer j + 1 at different positions by means of long range CZ gates.Therefore, the graph state implementing zM SD can be seen as cluster states composed of logical qubits, and connected by long range CZ gates, as shown in Figure (4).One could equivalently replace these long range CZ gates with a series of SW AP gates, which can be implemented (up to Pauli correction by means of non-adaptive X and Z measurements) on a 2D cluster state with only nearest neighbor CZ gates [43,72].Because these long range CZ gates act on qubits separated by a distance poly(d), the introduction of SW AP gates introduces an additional (constant) overhead of O(poly(d)) qubits to n T , but makes the construction of 1M SD implementable on a 2D cluster state with only nearest neighbor CZ gates.The first layer consists of N copies of cluster states implementing 1M SD (see Figure with respect to |T .The total number of qubits of the graph state implementing zM SD is then given by z is the last layer, therefore N d z−1 = 1 and thus For a succesful instance of zM SD, in order to arrive at Equation ( 16), choose this implies that each copy of zM SD is composed of logical qubits, as mentioned in the main text.Indeed, replacing d z = a.log(n), with a a positive constant in Equation (B14) yields while noting that C.ε d < 1 [57].Equation ( 16) is therefore obtained for an appropriate choice of a or ε.Now, we will calculate the probability p szM SD of a single successful instance of zM SD.We will assume, rather pessimistically, that only one string of non-adaptive measurement results of zM SD corresponds to a successful instance.This string we will take, by convention, to be the one where all the measurement binaries (after decoding ) are zero.In this case, Note that the lower bound is actually higher than that Equation (B17) for two reasons.The first is that not all qubits of the graph state implementing zM SD are measured.Indeed, the output qubits of the last layer of zM SD are unmeasured and, in the case when zM SD is successful, are in the state ρ T out .The second reason is that some of the measurements correspond, in the successful case, to post-selections which in the protocol of [57] occur with probability greater than 1/2 .Indeed, for small enough ε , the acceptance rate of the protocol of [57] is approximately 1. Now, ε out = 1 n β , with β ≥ 4, and n N M SD = γ.dz (Equations (B15) and (B16)), with γ a positive constant.By choosing ε = e −γ.β.log(2) C 1/d , and performing a direct calculation using Equation (B14), we get that n N M SD = log 2 (n).Therefore, One might ask, why do other MSD protocols like those of [56,99], for example, not work (using our techniques)?The answer to this question has to do with the number of noisy input T states n noisy with fidelity 1 − ε with respect to an ideal T -state, needed to distill a single T -state of sufficiently high fidelity 1 − ε out with respect to an ideal T -state.n noisy is usually given by [56] γ is a constant which depends on the error correcting code from which the MSD protocol is derived [59].In the protocol of [57] (as well as those in [61]), γ ∼ 1.Whereas for the Steane code for example [41], which we used to distill Y -states in our 3D NN architecture, γ > 1. γ ∼ 1 in the protocol of [57] is what allowed us to get a p szM SD of the form of Equation (B18).On the other hand, the protocols of [41,56,99] have a γ > 1, which leads to a lower bound of p szM SD which looks like 1/qp(n)-by using similar arguments for calculating n N M SD -where qp(n) is quasi-polynomial in n (if one requires ε out = 1/poly(n)).Indeed, N is proportional to α.n noisy , where α is the number of output T -states with error ε out .Therefore, it follows that n N M SD = O(N ) = O(n noisy ), and that 2 n N M SD = 2 O(nnoisy) , which is a quasi-polynomial when γ > 1.This would mean, using our proof techniques, that we would need a quasi-polynomial in n (which is greater than polynomial in n) number of zM SD copies to get a succesful instance, thereby taking us out of the scope of what is considered quantum speedup [89].Other protocols which we could have used and could have worked are those of [60,61] which gives γ ∼ 1, or that of [59] which gives γ < 1, albeit with a huge constant overhead of 2 58 qubits [59].It is worth explaining why the overall noise on the routed ρ T out will still be local stochastic with constant rate.Firstly, note that C R is a constant depth Clifford circuit composed of single and two-qubit Clifford gates acting on outputs of zM SD circuits, and therefore all local stochastic noise after each depth one step of this circuit can be treated as a single local stochastic noise E d ∼ N (m) with constant rate m at the end of this circuit, as in Equation ( 8) [21].The outputs of zM SD circuits are acted upon by local stochastic noise with constant rate (as seen earlier overall noise on zM SD is local stochastic with constant rate, therefore noise acting on a subset of qubits of zM SD (the outputs) is also local stochastic with the same rate [21]), and therefore can be incorporated as preparation noise (analogous to E prep in Equation ( 8)) with E d to give a net local stochastic noise E ∼ N (c) with constant rate c acting on qubits of C R .After measurements, the unmeasured outputs of C R will also be acted upon by E ∼ N (c) which is local stochastic with same rate as E, but with smaller support, from the properties of local stochastic noise [21].
This relation simply counts the number of ways in which a relevant non-trivial error can occur, this type of error is restricted to errors induced by self-avoiding walks (SAWs) on the lattice, as argued in [34].n(L) = 6.5 L−1 calculates all possible SAWs of total lenght L originating from a fixed point in the lattice [34], P (n) = poly(n) is the the total number of fixed points (i.e physical qubits) on the lattice, since SAWs can originate at any fixed point, and prob(L) ≤ (4q) L 2 is the probability that the minimal matching induces an error chain (SAW) of length L, this probability is calculated using the techniques in [34], but adapted to local stochastic noise (whereas independent depolarizing noise acting on each qubit was considered in [34]).The sum is over all non-trivial errors of lenght L m ≤ L ≤ poly(n).Noting that P (n) Now, we want to find an estimate of the individual rates of preparation, gate application and measurement in our 3D NN architecture.Assuming at each layer of the circuit, qubits are acted upon by a local stochastic noise E ∼ N (p) with 0 < p < 1 a constant, we get that q ≤ 4p 4 −D−1 [21], where D is the total quantum depth of the RHG construction.D = 6, one step for preparation, one for (non-adaptive) measurements (assuming instantaneous classical computing as mentioned earlier), and four steps for preparing the RHG lattice [100].Setting q ≤ 0.0075 [41], we get that the errors in preparation, gate application, and measurement should satisfy p ≤∼ e −40000 .Note that, for completeness, the threshold error rate for the distillation ε should also be taken into account.Usually, ε should be lower than some constant [68] in order for distilation to be possible, but this is accounted for in the chosen value of q [41].
N Y can be though of as the number of logical qubits of a 2D logical cluster state needed to distill a logical Y -state of fidelity 1 − ε out .As in appendix B, if we do this MBQC non-adaptively, we only succeed with probability In our case, we need O(n 5 log 2 (n)) logical Y -states of fidelity 1 − ε out in order to distill O(k.n) = O(n 4 ) T -states to be used in the construction of C 2 .O(n 5 log 2 (n)) is the number of qubits of C 1 when k = O(n 3 ) (number of columns of |G ).Therefore, by results in appendix B 3, we would need C 1 to be composed of O(n 6 log 3 (n)) logical qubits in order to distill, with exponentially high probability of success, enough (O(n 5 log 2 (n))) logical Y -states with fidelity 1 − ε out .Now, we will see why logical Y -states of fidelity 1 − ε out = 1 − 1/O(poly(log(n))) suffice to disill O(n 5 log 2 (n)) Tstates with fidelity 1 − ε out = 1 − 1/O(poly(n)).In the construction of C 1 in appendix B 2, replacing a perfect logical Y -state with a logical Y -state of fidelity 1 − ε out , then measuring this state, results in applying the gate HZ(π/2) with probability 1/2(1 − ε out ) instead of 1/2 in the perfect logical Y -state case.Therefore, the success probability of zM SD becomes in this case as compared with Equation (B18) in the perfect logical Y case.By choosing, as we did, ε out = 1/poly(log(n)), the above equation can be rewritten, for large enough n, as Thus, we have recovered Equation (B18), and therefore can now use the same analysis as in appendix B to distill logical T -states of fidelity 1 − ε out in our 3D NN construction.This will allow us to construct the sampling problem Equation (18) showing a quantum speedup.

Overhead
In this subsection, we will estimate the overhead (number of physical qubits in the 3D RHG lattice) of our 3D NN construction.As in [41], we will make use of the concept of a logical elementary cell.Each logical elementary cell is a 3D cluster state composed of λ × λ × λ elementary cells (each of which has eighteen qubits).Logical elementary cells can be either primal or dual.Each logical elementary cell contains a single defect.A defect inside a logical elementary cell has a cross section of d × d (perimeter 4d) on any plane perpendicular to the direction of simulated time.For our purposes, we will choose λ = O(d), and d = O(log(n)).This will ensure that the perimeter of the defect (4d) and the distance between two defects (λ − d) satisfy the conditions in appendix D. In this picture, every logical qubit (composed of two defects of the same type) needs 2 × 18 × λ 3 = O(log 3 (n)) physical qubits.In order to not talk about primal or dual logical qubits (recall that computation is always carried out on logical qubits of same type, but we need braiding between two defects of different type in order to implement some gates such as CN OT ), we will assume each logical qubit needs four cells (two primal, two dual) to be defined, and therefore the number of physical qubits per logical qubit is 4 × 18 × λ 3 = O(log 3 (n)).Now, all we need to do is calculate the number of logical qubits we need in total.Preparations of logical qubits in states |+ , ρ T noisy , and ρ Y noisy , and applying CN OT gates can be done using a constant number of intermediate elementary logical cells [41].Therefore, we will only need to count the total number of logical qubit inputs for circuits C 1 , C R , C 1 , C R , and C 2 , then multiply this by a constant in order to get the total number of needed logical qubits including preparations and logical CNOT applications.As already calculated in the previous subsection, the total number of logical qubits of C 1 is O(n 6 log 3 (n)).The total overhead of circuits C 1 , C R , and C 2 is O(n 9 poly(log(n)) logical qubits, this is obtained by the same calculations as done in our 4D NN architecture, but with replacing k = O(n) with k = O(n 3 ), in order for the partially invertible universal set condition to be satisfied [15].Finally, the routing circuit C R (see appendix C) needs O(n 6 log 3 (n).n 5 log 2 (n)) = O(n 11 log 5 (n)), this term dominates the scaling.Multiplying O(n 11 log 5 (n)) by a constant (to account for preparation and logical CNOT gates overhead), then by O(log 3 (n)) (to get the number of physical qubits), we get that the overall number of physical qubits needed is O(n 11 poly(log(n)).

FIG. 1 .
FIG.1.Graph state |G of[39] together with the pre-specified measurements in the XY plane.This graph state is composed of n rows and k columns as seen in the main text (lower part of figure), and made up of two-qubit gadgets GB (green rectangles) zoomed in at the upper part of the figure (orange circle and arrow).Blue circles are qubits, blue vertical and horizontal lines are CZ gates, the symbols inside each circle correspond to the angle in the XY plane at which this qubit is measured.The π/4 symbol is a measurement at an angle π/4 in the XY plane, similarly for π/2 and 0. In the original construction of[39], the red horizontal line is a long range CZ, these are used periodically in |G to connect two consecutive GB gadgets acting on qubits of either the first row or the last row of |G .Here, this red horizontal line is a linear cluster of twelve qubits measured at an XY angle of 0, this is in order to make the construction nearest neighbor.Note that this only adds single qubit random Pauli gates to the random gates of[39], and therefore does not affect their universality capacity in implementing a t−design.

i θ 2 Z
where H is the Hadamard unitary and Z(θ) := e −is a rotation by θ around Pauli Z, then one can represent the outcome by a measurement result bit string s ∈ {0, 1} n.(k−1) , with associated resultant state

FIG. 2 .
FIG.2.Overview of the 4D NN circuit for our sampling problem.The overall circuit takes in noisy logical |0 and ρT noisy states, which can be prepared in constant depth[21,58] (up to Paulis, which can be traced through and dealt with efficiently classicaly after measurements of the logical qubits of our circuit, as described in the main text).It is composed of three underlying circuits, C1, which implements the MSD, then CR which routes the good outputs to the final circuit C2 which generates the graph state |G .The construction also calls on a classical computer (CC) to process correction operations, indicated by the dotted lines of different colours.The orange dotted lines are in order to identify the successful MSD outputs, and create the paths for routing them.The S gate is either a H gate or an identity gate, depending on the classical control.The black dotted lines are the measurement results of non-output qubits of CR, and the measurement results of qubits of C2 which are fed into the classical computer which performs a postprocessing to output the final sample {s, x} (Equation (18)).

FIG. 3 .
FIG. 3. Constant depth circuit for our 3D NN architecture.Logical states are up to Pauli corrections due to non-adaptivity.The red box with an M symbol is a measurement either in X or Z. Circuit is shown up until C1, the remaining part of this circuit is the same as that in Figure2, with Z measurements replaced by M measurements, and with some ancilla qubits being initialized in |+ as well as |0 .These slight changes are in order for the construction to be naturally integrated into the RHG framework[41].Also shown in the figure is the additional interaction with the classical computer CC (ingoing and outgoing red dotted arrows) needed in order to identify the succesfully distilled Y -states as well as construct the measurement pattern for the routing circuit C R .
Theorem 4.1 in [57], the MSD circuit takes as input O(d) qubits, where d is a positive integer, uses O(d 2 ) noisy input T -states with fidelity 1-ε with respect to an ideal (noiseless) T -state, and outputs O(d) distilled T -states with fidelity 1-O(ε d ) with respect to an ideal T -state (note that the ratio of the number of noisy input T -states to the number of distilled output T -states is ∼ d for large enough constant d [57].).Each time a noisy T -state is inserted it affects a noisy T -gate, inducing a so-called T -gate depth [57].The depth of the entire circuit is O(d 2 .log(d)),where O(d) is the T -gate depth, and O(d.log(d)) B12) where the O(d 2 log(d)) comes from the depth of the MSD circuit with long range Cliffords, O(d 2 ) is the depth needed to implement an arbitrary Clifford using H gates, Z(π/2) gates, and long range CZ's, and the O(d) is an overestimate and represents the number of nearest neighbor CZ's needed to give a long range CZ.The total number of qubits of the cluster state implementing 1M SD is then n T = O(d).nc = O(d 6 .log(d)).(B13)

FIG. 4 .
FIG. 4. Part of the graph state implementing the circuit zM SD.Blue filled circles represent logical qubits in the |+ state, which when measured implement the Clifford part of the MSD protocol of Theorem 4.1 in [57].The green filled circles are noisy input T -states ρT noisy .Purple filled circles are the output qubits of the first layer of zM SD.When zM SD is successful, these qubits are in a state with fidelity 1 − O(ε d ) with respect to the ideal T -state |T .The orange lines are CZ gates.Note that the output qubits of the first layer (purple circles) are connected to the second layer at different positions by means of long range CZ gates.These long range CZ gates can be implemented in constant depth, since they act each on distinct pairs of qubits.Also, as mentioned in the main text in this appendix, these long range CZ gates can be replaced by a series of SW AP gates making this construction a constant-depth 2D construction with only nearest-neighbor CZ gates.Measurements consist of non-adaptive X measurements, Z measurements, as well as Y measurements.As described in the main text, we could equivalently perform all measurements in Z, by introducing additional constant depth layers of H and Z(π/2) gates.

FIG. 6 .
FIG. 6. Routing via etching out from a grid.The purple vertices on the left represent outputs of the zM SD. a) The filled purple vertices are identified as the successful distilled T states from previous measurement results, and the paths to the outputs are identified.b) All other qubits are measured out in Z and the succesful outputs are teleported via X measurements.