Quantum computation with realistic magic-state factories

Leading approaches to fault-tolerant quantum computation dedicate a signiﬁcant portion of the hardware to computational factories that churn out high-ﬁdelity ancillas called magic states. Consequently, efﬁcient and realistic factory design is of paramount importance. Here we present the most detailed resource assessment to date of magic-state factories within a surface code quantum computer, along the way introducing a number of techniques. We show that the block codes of Bravyi and Haah [Phys. Rev. A 86 , 052329 (2012)] have been systematically undervalued; we track correlated errors both numerically and analytically, providing ﬁdelity estimates without appeal to the union bound. We also introduce a subsystem code realization of these protocols with constant time and low ancilla cost. Additionally, we conﬁrm that magic-state factories have space-time costs that scale as a constant factor of surface code costs. We ﬁnd that the magic-state factory required for postclassical factoring can be as small as 6.3 million data qubits, ignoring ancilla qubits, assuming 10 − 4 error gates and the availability of long-range interactions.

Architectures for quantum computers must tolerate experimental faults and imperfections, doing so in the most practical and efficient way. One aspect of faulttolerance is the use of error-correcting codes, which provides a storage method for protecting quantum information from noise. To perform quantum computations, additional techniques are needed to ensure a universal set of quantum gates can be implemented fault-tolerantly. Most error correcting codes natively allow fault-tolerant implementation of gates from the Clifford group, a nonuniversal set of gates. Fully functional quantum computation is attained by adding the Toffoli or π/8 phase gate to the Clifford group. The prevailing proposal for performing these gates is to first prepare high-fidelity magic states, which are then used to inject a gate into the main computation. These magic states are needed in vast quantities, and their preparation requires a significant portion of a device to operate as a dedicated magic state factory [1][2][3]. Alternatives exist to the magic state paradigm [4][5][6][7][8][9][10], but it is unclear whether they will be feasible substitutes due to worse thresholds [11,12].
Magic state factories use several rounds of distillation protocols, and several directions have been explored [15][16][17][18][19][20] to improve efficiency over the original proposal which uses Reed-Muller codes [1]. Notably, an n → k block protocol takes n input magic states and output k at higher fidelity, with higher ratios of n to k generally offering greater efficiency [15][16][17]. These block protocols do require more complex circuits, but there has been limited investigation into the full resource cost of these protocols. One advance in this direction [3] has shown that block protocols can be realized in constant time, independent of k, by braiding defects in the surface code. Despite this, the same work found that efficiency improvements of block protocols were modest. However, all previous work has taken a very pessimistic estimate of the fidelity of these protocols, leading to an overestimated cost. We present several results improving resource costs and leading to a more optimistic outlook for realising magic state factories.
We present a method of realising the Bravyi-Haah block protocols [16] in constant time without relying on braiding operations, providing a fast implementation to other architectures. Braiding operations natively support control-NOT gates with multiple target qubits in constant time, which is the key step in the circuit reduction of Ref [3]. Our approach does not require availability of such powerful gates, but rather reduces time costs by borrowing concepts from subsystem codes [21][22][23][24] and notions of gauge fixing [7,9,25]. In subsystem codes, operations acting on many qubits are broken up into more pieces each involving fewer qubits, which we find enables greater parallelisation in realising Bravyi-Haah. Our proposal is especially applicable to distributed architectures for quantum computation, and provides these devices with an approach to magic state distillation that is low in ancilla and time costs.
We also show that magic state factories can make more use of block protocols than previously thought by using new techniques to reduce and calculate the global output error. Consider a protocol outputting K qubits with multi-qubit global error rate g . All previous studies have used the union bound, also called Boole's inequality, to assert g ≤ K where K is the number of magic states and is the average error rate of single-qubit outputs, obtained by tracing the qubit from the multi-qubit state. For an uncorrelated product state, ρ g = ρ ⊗K , we have g = 1−(1− ) K and so to leading order g = K −O( 2 ), so the union bound is a good estimate for small . However, the output of block protocols can be highly correlated. Boole's inequality holds for correlated states, but g can be much smaller than K . In distillation protocols such as Bravyi-Haah, correlations are produced because block protocols reduce the probability of single errors, but pairs or clusters of errors can go undetected and lead to a correlated error pair. If one error is present, it almost certainly has a partner. As such, use of the union bound has lead to systematic over estimation of noise in Bravyi- 3 Toffoli gates, which dominates the overhead. We realise each of these gates using single Toffoli magic state or seven T states in parallel [13], whichever proves optimal. In this algorithm, the Toffoli gates are all sequential, and so using time-optimal methods [14] the fastest possible runtime is 40N 3 t meas/ff where t meas/ff is the time taken to make a physical measurement and feed-forward the result to selectively perform a single qubit gate elsewhere in the quantum computer. The number of 'physical qubits in factory' neglects qubit cost associated with measuring surface code stabilizers, and so for many architecture this number will be doubled. The variable tsc is the time taken to perform single round of the parallel stabilizer measurements of the surface code -a process involving four CNOT gates, two single qubit gates and a measurement. We assume throughout that t meas/ff = 0.1 · tsc, which is reasonable for a distributed architecture such as ion traps. magic state factory. We take as our starting point the triorthogonal codes for producing π/8 magic states [16] and a method of producing Toffoli magic states [26,27]. Once we begin tracking correlations, it becomes apparent that quality control of the factory can be improved. Detecting an error in one part of the factory can, due to correlations, indicate a likely undetected partner error elsewhere in the factory. We introduce the notion of module checking whereby we discard magic states brought into disrepute by their correlated partners. The enhanced fidelity of module checking is both analytically estimated and found by numerical Monte Carlo simulations, with excellent agreement between these methods.
The specification of the distillation protocol is just one aspect of designing magic state factories. The size of magic state factories depends both on the distillation protocols and also the underlying error correction codes, and significant saving can be made by judiciously reducing the error correction code at earlier rounds of magic state distillation. The balanced investment of qubits at each round of distillation uses small surface codes for lowfidelity magic states and larger surface codes for highfidelity magic states. A small fraction of magic states in the factory will be high-fidelity and require much larger surface codes. Raussendorf et al. [28] argued that consequently magic state distillation can be achieved at a constant factor cost over surface code error correction, which we also observe and discuss.
Taking all factors into consideration, we present a blueprint for a factory capable of delivering enough magic states to solve large Shor's algorithm tasks, beyond the reach of classical computation, within a surface code quantum computer. Our results are summarized in Ta-ble I, where we look at both the spacetime overhead required to produce a Toffoli magic state and the physical footprint of the factory required to produce magic states at an average rate that can just keep up with the 'time optimal' surface code implementation of the algorithm [14], which we further discuss in Section V.

I. Realising block protocols
Many descriptions of distillation protocols are highlevel, leaving open many aspects of how to implement these protocols. To assess the full resource cost, we require a low-level description of distillation protocols in terms of elementary operations, such as one and two qubit gates, preparations and measurements. We call such a description a realisation of a protocol, and a given protocol can have different realisations with varying costs.
This section presents a realisation of the Bravyi-Haah protocols, giving explicit instructions presented in Fig. 1. The protocol is presented as a four step process acting on a collection of n noisy |T -states, where |T ∝ |0 + exp(iπ/4)|1 . The first step uses ancillas to measure operators composed of Pauli-Z operators, and the second step applies a correction dependent on the measurement outcomes. The third step uses ancillas to measure operators composed of Pauli-X operators, with only certain measurement outcomes kept. After a successful third step, the k output magic states are within an encoded state and delocalized across 3k+8 sites. The fourth step uses measurements to localise the output qubits to ...

Header
Step 1 Measure Z-type stabilisers Step 2 Apply correc ion 2.1) Iden ify qubits to be corrected from outcomes of step (1.3) as follows: All ancilla with "-1" outcome are linked to an odd number of correc ion qubits; All ancilla with "+1" outcome is linked to an even number of correc ion qubits; 2.2) Apply local gate to every correc ion qubit. See App. C. for details.  FIG. 1. Explicit circuit for realising Bravyi-Haah (3k + 8) → k block protocols for k = 2, 6, 10, 14, . . .. Squares indicate the (3k + 8) noisy |T magic states to be distilled, and circles represent ancilla qubits used to effect measurements on the magic states. Increasing k does not increase the number of time steps in the protocol, but increases the number of qubits involved in a block. As we increase k we add qubits to the tail end, with the protocol translationally invariant along the tail. We report that we have independently confirmed the validity of the protocol for k = 2 by full wavefunction simulation, which further confirmed that all single errors are detected and all two errors processes lead to outputs with correlated errors. specific sites. All these steps are detailed in the figure, and show how the multi-qubit measurements are broken down into ancilla preparation, two-qubit gates, and single qubit measurements. Observe in Fig. 1 that each multi-qubit measurement involves only four entangling gates, with each such gate designated a distinct colored link. Therefore, the measurement is a four qubit operator. This is called the weight of the operator. When using a single ancilla to measure a stabilizer of weight m, we need m time steps to perform the required controlledgate operations. The low, and constant, weight of our measurements gives the realisation a constant time cost. More commonly, the Bravyi-Haah protocol is presented as requiring only 2 measurements of X-type observables that have weight 2k + 4, implying a potentially expensive time cost. However, the concept of gauge subsystem codes provides a method whereby such complex measurements can be broken down into a larger number of simpler measurements [21][22][23][24]. In App. C we present a rigorous demonstration that the Bravyi-Haah code can be viewed as a subsystem code, and using a combination of gauge fixing and a cat state ancilla we create a circuit of depth 4 for both X and Z measurement sets. We call this general approach gauge-MSD, where MSD is short for magic distillation. We remark that constant time realisation was also found Fowler et. al. [3], but was tied to a monolithic braiding architecture.
The time complexity of our realisation is simply where t cnot is the 2-qubit gate (e.g. CNOT) time, t A is the gate time for single qubit A = (X + Y )/ √ 2 rotation, t prep is the single qubit preparation time, and t measure is the single qubit measurement time. Here, all operations are applied fault-tolerantly to logical qubits within an error correcting code. Therefore, the time scales are for fault-tolerant gates. We will assume throughout that logical CNOT gates are applied transversally and thus take time t cnot = t g + d × t sc where t g is the time for the physical gate, d is the code distance and t sc is the time for a round of stabilizer measurements. Therefore, for large distances the time is dominated by 8dt sc . We will also take this as the time cost of a merge operation [29,30] which we use to project two ancillary qubits into the even or odd parity subspace. We assume entangling gates can be performed in parallel, but a qubit can only participate in one gate at a time. We allow entangling gates to be long-range as is feasible within distributed architectures for quantum computing [31][32][33][34][35][36][37][38], though this may be relaxed at only a modest increase in resources.
Our realisation uses a number of ancilla qubits, so that in addition to the n = 8 + 3k qubits being distilled we also use n anc = 3k + 6 ancilla qubits, giving n tot = 6k + 14 logical qubits in total. These ancillas appear as circles in Fig. 1. With these logical qubits encoded in a distance d toric code, the total qubit cost is N tot = (6k + 14)d 2 . Therefore, the total space-time cost amounts to N tot t block ∼ (48k + 112)d 3 , which compares well against other approaches (see App. A).
When using the toric code A gates are also not naturally fault-tolerant and thus require their own process of state distillation and injection. The time t A to implement such a gate is therefore the time taken by the logical CNOT which teleports the gate into the computation plus any Clifford correction that is required, although this can be rolled into a later Clifford operation. The resources required of the state distillation of the A are far less than that of the T -gate, and as the ancilla resource is reusable for many A gates [17,39]. As such we neglect the overhead of these gates as a small additional overhead to the main process of |T distillation.

A. Blocks, branches and modules
We begin by introducing some helpful vocabulary for describing magic states factories. Efficient distillation uses n → k block protocols, that take n noisy |T ∝ |0 + exp(iπ/4)|1 and with some probability output k states of a higher fidelity. Such a process we call a block, and the previous section described the details of the inner working of such a block. Now we treat each block as a black box with known relations between input and output, and consider how these blocks are composed together.
Distillation protocols have many levels forming a treelike structure with many branches that merge at points we call modules, shown in Fig. 2. Branches contain many qubits, which are potentially correlated. However, the inputs to block protocols must not be correlated, so each qubit in a branch must be fed into a different block. Therefore, as we enter a module, a branch of B l qubits is split up so that each qubit enters a different block. If each block implements a n l → k l protocol, then the whole module can be thought of taking B l n l inputs to B l k l outputs. This entails that B l+1 = B l k l and that each module has n l branches feeding into it. Initially, branches are single qubits, B 1 = 1, and so B l = 1≤j<l k j . This module-branch structure is common to all proposals to date. Such explicit terminology has not previously been introduced but rather been left as an implicit consequence of statements about correlation avoidance. Establishing clear vocabulary about this structure is important as we delve into the effect of postselecting at different levels on this structure. Previous protocols have considered whether individual blocks succeed or fail, we call this block checking. Below, we outline why it can be advantageous to postselect on the level of the modules, which are collections of blocks. We propose an additional quality check, so that the whole module is discarded whenever any of its blocks fail. We call this module checking.
A block will always detect a single incoming error, but might fail to detect a pair of errors. When a block detects an error, it indicates the presence of damaged branches, and since errors cluster together within branches, this increases the likelihood of errors in other blocks throughout the module. Consider when two branches fail each with a correlated error pair, the first branch sends damaged qubits to blocks 1 and 2, and the second branch sends damaged qubits to blocks 2 and 3. Since blocks 1 and 3 each received a single erroneous qubit, they will detect them. But block 2 receives a pair of errors, so they may go undetected. A simplified illustration of this process is shown in Fig. 2. Module checks improve fidelity by preventing these processes from degrading the output fidelity. Even with module checks, it is possible for a pair of corrupted branches to go undetected, but both branches must carry exactly the same pattern of errors, which is a very rare occurrence.

III. The G-matrix formalism
Bravyi and Haah introduced a matrix description of their n → k block protocols for |T 0 state distillation. The so-called G-matrix is split into two submatrices G 1 and G 0 , with G 0 describing the postselection criteria and G 1 accounting for how input qubits are related to output qubits. For distillation of |T 0 magic states, the G matrix must have the property of triorthogonality that the reader can learn about in Ref [16]. Rather here we give a pragmatic account of the the block's performance. We use |T 0 ∝ |0 + e iπ/4 |1 for a magic state and |T 1 = Z|T 0 for the orthogonal state with a Z error. A pure multi-qubit state |T x1 |T x2 . . . |T xn , we concisely represent with the vector x = {x 1 , x 2 , . . . x n }. If we apply a block protocol to state x, the block succeeds (detecting no errors) if G 0 x = 0 (mod 2) where the maths is performed modulo 2. When successful, the The a tree structure of many rounds of distillation, with branches (directed black lines) that merge at branching points that we call modules. The thickness of the branch increases with each round. The figure shows a fictitious scheme where n l = 3 and k l = 2 for all rounds. (inset) The structure within a module. Incoming branches contain many qubits, here this is shown to be 4. These qubits undergo a permutation σ and are feed into an instance of a block of a distillation protocol shown as a square. Here the 3 incoming branches carry 4 qubits and so we need 4 instances of a 3 → 2 protocol. We use a fictitious protocol to keep the numbers low enough to illustrate clearly. Each of the 4 blocks output 2 qubits and these are merged into a branch of 8 qubits feed into a later module. A pernicious error pattern is shown in red, which is detected in 2 of distillation blocks, marked with crosses, but goes undetected in a third block.
block outputs a state y = G 1 x. Noisy magic states will be some probabilistic ensemble over x, with probability Pr(x). The protocol will detect no errors and output state |T y1 |T y2 . . . Pr(x). (2) The total success probability is captured by the sum over all possible output states, so Pr(x) Pr(x).
Conditioned on success, the normalized distribution on output states is Pr out (y) = Pr unnorm (y)/P suc . Given an explicit form for G 0 and G 1 , this completes the black box picture of block protocol performance. These formulae form the basis upon which we build both our analytic and numerical analysis in following sections.
The G-matrix formalism of Bravyi and Haah has been significantly generalized [40,41]. This extension provides protocols that convert noisy T magic states into another species capable of injecting complex multi-qubit circuits. Included in this framework are protocols, based on G-matrices, which provide resources for implementing Toffoli gates. Protocols independently proposed by Jones [27] and Eastin [26] realized error suppressed Toffoli gates, and here we consider a variant based on Gmatrices that we discuss further in App. E. All three variants perform identically when we use block checking. However, with the G-matrix formalism we can again use module checking to track correlations and achieve superior error suppression. This is just one additional application of module checking, the technique can be deployed in conjunction with the general class of protocols introduced in Ref. [40,41] IV. Analysis of module checking We present a method of tracking the leading order errors, accounting for correlations, through many rounds of module checked protocols. At each level of distillation the protocol is characterized by a function η that summarises how well it tolerates leading order errors Definition 1 For every distance 2 G-matrix code that distills n → k qubits, we define a function η : where | . . . | is the weight |y| = j y j , and # counts the number of elements in a set {. . .}. In other words, the value η(y) counts the number of inputs x such that: 1. they are weight 2 (formally (|x| = 2)) ; and 2. they are undetected by the protocol (formally G 0 x = 0); and 3. give y as output (formally G 1 x = y).
Since η(y) counts the number of lowest-weight errors leading output y, the total error rate for 1 round of distillation can be simply estimated as Counting errors over many rounds is a more subtle problem, but we find that η still provides sufficient information to perform this calculation. If each round can even use a different protocol, we label the corresponding function with a subscript. We now state our key result Theorem 1 Consider L rounds of distillation with module checking, with associated functions η 1 , η 2 , . . . η L . Such a protocol outputs a multi qubit magic state where the l th level modules succeed with probability The leading coefficient C l for a variety of protocols with 2 and 3 levels of distillation. For clarity we also show C l in the large block limit (k → ∞). When we write BH k , we implicitly assume k > 2, as the results differ slightly for the k = 2 case. The final column shows the ratio between the union bound estimate made by utilising the reduced error rate on a single qubit BH made by Bravyi and Haah and the corresponding estimate of the global error rate given by C l 2 l . It can be seen that the benefit (in error rate) of module checking scales with both k and the number of rounds of distillation. and output states have global infidelity where we have made use of the following The most important quantity in the theorem is C l . After two rounds C 2 is simply ( y η 1 (y) 2 ) · ( y η 2 (y)). After l rounds, C l is a product of l terms. One may approximate the theorem to leading order (l) g ∼ C l 2 l . However, our later numerical investigations found that such an approximation was too coarse, and we really need the slightly more complex form given in the theorem. We postpone proof of this result to App. F.
For the Bravyi-Haah protocols it is easy to verify the following so that for k > 2 we have where the binomial factor arises in counting the number of y where |y| = 2. When k = 2, we have a special case as then the |y| = 2 terms and |y| = k terms are the all same, and so For the Toffoli protocol we have η Tof (y) = 4, y = 0.
and so Given expressions for η(y) m we can compose these protocols anyway we wish and obtain an estimate of the global error rate as given in Thm. 1. We perform numerical Monte Carlo simulations by sampling from the probability distribution of the raw magic states, where Pr(x) = |x| (1 − ) N −|x| , and track its evolution through the magic state factory. To our knowledge, this is the first such numerical investigation of correlations inside a magic state factory.
Our simulations track the progress of potentially damaging input error configurations of the initial raw magic states though the factory. Direct Monte Carlo simulation is possible for 1-2 rounds of distillation, but with 3 or more rounds the failure events become so rare that brute force simulation fails to provide statistically meaningful data. Rather we use a novel "rare events" technique that preselects cases with two or more corrupt branches. A full description of the method employed can be found in Appendix G. We thus investigate the performance of a factory which makes use of module checking for a range of raw magic state error rates between 0.1% and 1%. We find that the leading order analytic estimates match well with the numeric results, with the difference between the two being of the order of a few percent in the investigated parameter regime. In fact wherever the block protocols are involved, if k ≤ 14 we find the percentage difference in between the numerical simulation and analytic estimate is < 10%. This discrepancy between the analytic estimate and numerical simulations is not visible on a log-log plots presented in App. G.
The cost of a protocol is the average number of raw magic states consumed to produce one higher fidelity magic state. For an n → k protocol, which takes in states with error p, this is For l rounds of distillation we have C l (p) = . 3 compares the cost of a Bravyi-Haah magic state factory, with block checking and module checking, showing the minimum cost achievable for given output error. For output error rate and success probability we use known expressions for block checking, and for module checking the success probability and global error given by Thm. 1 with an estimate on the reduced error rate on a single qubit given by the global error rate estimate divided by the total number of qubits in this factory's output.
We find that module checking is superior to block checking for a large proportion of target error rates, and can use up to four times fewer raw magic states in some regimes. However, near a transition from j to j+1 rounds of distillation, module checking loses it advantage and may even be slightly outperformed. The best error rates that can be achieved for a given number of rounds use low k block codes, for these the benefit in global error rate of module checking over block checking is smaller (see Table  II) while the success probability is of course much inferior. Above the transition higher k values are being used, for which the success probability is much lower, and this washes out the benefit of the superior error suppression over block-checking, which has a higher success probability. In these regimes, as one might expect, module checking is not the optimal approach. In those regions between the transitions, module checking allows use of higher k protocols, which are more efficient, to achieve the error rates of lower k block checked protocols.

V. Factory overhead analysis
While the 'cost' is a useful guide to the performance of a distillation protocol, it fails to capture several important features of real magic state factories. The very purpose of magic state distillation is to supplement the shortcomings of a particular error correcting code, which is to say that a magic state factory is implemented at the level of logical qubits, with the quantum information already encoded. As such, the more relevant question is not how many noisy input magic states are required per output state, but rather the number of the physical qubits that will build up such a factory and the rate at which the distilled magic states are produced. Both of these numbers are of key importance to determining the size of a quantum computer and the run time of an algorithm. A single number, the 'spacetime overhead' of the magic state factory captures these both as a figure of merit, which was also studied in Ref [3]. In this section we present a comparison based on the spacetime overhead of implementing a magic state factory in a surface code quantum computer, utilising module checking, where appropriate, and our novel implementations of the Reed-Muller, Bravyi-Haah and Toffoli protocols.
We consider a number of issues in this estimate of the footprint of a magic state factory, including 1. balanced investment: the use of smaller surface codes during early rounds of magic state distillation; 2. clock rate zoning: cycling through distillation attempts faster during early rounds of magic state distillation; We will assume throughout that we use the method outlines by Li in Ref [42] to inject the initial raw magic state into a logical surface code. We will therefore assume throughout that the initial error rate on a magic state before distillation is in = 0.4p g . We also use the 'rotated lattice' surface codes [29], such that a distance d surface requires d 2 physical data qubits. Of course a practical surface code requires the use of physical ancilla qubits to make the stabilizer measurements of the code, we leave this as an extra multiplicative factor to be applied to our overhead calculated here, as different physical realisations have different requirements. We estimate the surface code distance required to protect a logical qubit up to error rate P l using the relation . Given an algorithm, and some implementation of it, the number of magic states required is determined in advance. For a given distillation protocol we must then determine whether its magic state factory is capable of producing magic states of fidelity great enough that there is a high probability that none of the magic states required for the algorithm fails. Thus we set a target global error rate that our factory must achieve.
where P suc,alg is desired success probability of our algorithm (i.e. the probability that every non-Clifford gate works) and N is the number of successful iterations of the factory needed to produce the desired number of magic states. For example, a magic state factory which utilises 3 rounds of Bravyi-Haah distillation with a k value of 10 in each round will produce 10 15 |T states after 10 12 successful iterations. A 90% success probability for the algorithm as a whole then implies target = 1.05 × 10 −13 . We must then check that the factory is capable of producing an output of this quality. If the factory is module checked then this '10-10-10' factory has a global error rate glo = 2.3 × 10 −16 , making this a valid protocol. However the estimate of the global error rate of a block checked factory gives 10 3 × BC ∼ 10 −11 so this would not be a valid factory for this task.

A. Balanced investment
Once we have established that a factory is valid we can calculate the spacetime overhead per magic state that it produces. This is done by: calculating the distance of surface code required at each level of distillation d i ; the The cost C of the Bravyi-Haah protocols utilising both block checking and module checking. It can be seen that only around the transitions to an additional level of distillation is block checking very slightly preferable.

Round 1
Round 2 FIG. 4. The concept of balanced investment. During the initial rounds of distillation smaller surface codes can be used to encode the logical information. The factory requires only logical surfaces of the distance corresponding to the target error rate of the round in which that qubit is involved. This target in each round will depend on the final error target, the protocol being used to achieve it and the error rate of the raw state.
length of time in surface code cycles that each round of distillation will take T i and the number of logical qubits Q i , including logical ancillas, required at each level of the factory. Determining the distance required at each level requires slightly different methods depending on whether module or block checking is used. To obtain the benefits of module checking we cannot make full use of balanced investment at the lower levels, as to do so would inject noise at a rate comparable or greater than the rate of correlated error. We can estimate the total global error output in the output of a n → k protocol as tot ∼ glo + k enc , where enc is the random error rate resulting from size of the logical encoding chosen. As such we determine the distance of the code at the intermediate levels based on a desired error rate enc to be 0.1 · glo /k to ensure that the error due to each qubits finite encoding is much less than the reduced error ∼ glo /k on that qubit. The final output of the factory can then encoded according to the target .
For a block checked factory (or the original 15-to-1 Reed-Muller protocol) we do not have worry about 'protecting' correlated errors. This means we can work backwards from our error target to determine an efficient balanced investment of qubits. For a local target error of p top = 10 −14 the top level of distillation needs vP L (d, enc ) < 0.1 × 10 −14 where v is the spacetime volume of a single block of distillation; which is the number of surface code qubits in the block multiplied by the number of times they undergo d rounds of surface code stabilizer measurements. Again we use a distance corresponding to a lower error rate by a factor of 10 to suppress any error injected by the logical circuitry above that which is left over after distillation. The distance required of the next level down can then be determined by, in this example p i−1 = 3 p top /35 for the 15-to-1 protocol, which gives distance d i−1 . And so on until p i > p in . The length of time T i for each round of distillation can be simply determined by the protocol used and how many times we attempt it before abandoning the round. We assume that measurements can be completed in one time step, and the time scale of the CNOTs, preparation and A gates is dominated by the requirement for d rounds of stabilisation afterwards. Therefore we let the time for each of these be d × t sc . The time taken to implement the distillation protocol in round i is then,

B. Clock-rate zoning
Balanced investment also allows another advantage. In the context of surface code computing, the distance of the code is not only relevant for the spatial dimensions of the computer, but also the execution time. A surface code of distance d must undergo d rounds of parallel stabilizers measurements to protect from measurement errors. As such, the time taken for a logical operation is proportional to d (except those which can be handled in software) and therefore using balanced investment the initial rounds of distillation will take less time. In the hypothetical case that a round of distillation leads to a squaring of the input error rate, this would correspond to a doubling of the code distance required by the next round (by the exponential suppression of error with distance of a sub threshold surface code). Therefore one can repeat the first round of distillation twice in the time taken for the second round of distillation to be completed. This increases the chance that all the necessary magic states for the second round will have been produced in time for the next round of distillation, without the need to decrease the rate of the factory. As such the time taken for a round to complete in the distillation factory is T i = τ i t i where t i is the number of attempts that you allow at each round. In all our simulations, any 'idle time' that the qubits experience is counted towards the spacetime cost.

C. Numerical simulations
We have therefore arrived at the expression for the full spacetime volume V in qubits-rounds occupied by a factory where k i labels the magic states protocol used at each round and P i is the probability that round i of distillation will succeed in producing enough magic states to feed the next round. All rounds must succeed for the factory to successfully output r i k i magic states. As discussed V is a function of several variables, the raw magic state error rate in , the error of gate operations p g , the total states required N , the protocol chosen {k i }, the repetitions allowed in each round {t i } and the probability with which you wish the algorithm to succeed P suc,alg .
In practice we calculate V for one, two and three rounds of distillation using combinations of the Bravyi-Haah, Reed Muller and Toffoli protocols with our proposed implementations, with both block and module checking, and a variety of combinations of t i . We then search for the most spacetime efficient method of producing either T or Toffoli magic states, given a p g , assuming To produce T magic states at a 'time-optimal' rate the factory must, on average, produce states at a rate equal to the fastest rate at which they can be used in sequence by an algorithm. In this paper, where we have assumed that tsc = 10t meas/ff , this means 10 magic states must be produced on average every tsc to keep up with the time optimal implementation of the algorithm. As this figure makes clear, it is indeed possible to produce a given number of magic states faster (slower) than this using more (fewer) qubits. Each data point represents one possible magic state factory given the input error and target number of 10 16 T magic states, only the factories near the boundary between possible and impossible factories are shown. Not shown in the lower-right of each graph are myriad other possible factories of lower rate and higher overhead. As seen clearly in panel (a) doubling the size of the factory can allow one to more than double the rate of magic state production when the larger factory allows the use of the higher k (and therefore higher rate) block codes. This point is less evident in panel (b) where only two rounds of distillation can be sufficient and the higher k block codes are not necessarily optimal. in = 0.4p g , the number of magic states desired and an arbitrary choice of 90% for P suc,alg .
The results of these simulations suggest that module checking can provide an improvement of a factor of 3 in the spacetime overhead in certain parameter regimes. We also see that in some regions which, as for the cost analysis, correspond to areas in which there is a transition from i to i + 1 rounds of Bravyi-Haah and it can be in fact detrimental by a small amount.
We use the numbers we have generated to estimate the size of a magic state factory required to perform some post-classical factoring tasks using Shor's algorithm. Our results are summarized in Table II, in which we approximate Shor's algorithm as modular exponentiation -as this is by far its most expensive part -and choose the minimum Toffoli gate-count implementation, which has Toffoli-count and -depth equal to 40N 3 for an N bit number [43]. We then determine the minimum possible spacetime overhead per magic state for this task and also the smallest possible factory -in terms of physical qubits -that can produce all the magic states necessary while keeping up with a time optimal quantum computation [14].
The smallest possible factory is not necessarily the most spacetime efficient factory possible: these factories tend to use larger k blocks which can make the factory formidably vast (cf. Fig. 2), but able to produce more qubits in the same time as lower k protocols. However, these may well produce magic states much faster than required by the fastest possible implementation of the algorithm. Fig. 6 shows that it is possible to use fewer qubits than required by the time optimal implementation and perform your computation at a reduced speed. Equally it is of course possible to increase the size of the factory to produce magic states at a faster rate. In this case though the extra qubit overhead is effectively wasted, unless the computer is performing many calculations in parallel. We find that all the magic states required for a time-optimal factorisation of a 1000 bit number can be produced with a surface code magic state factory of 5.6 million 'data qubits', if the infidelity of operations on physical qubits is 10 −4 . This is based on the cost of d 2 physical qubits to store the information in the rotated lattice surface code [29]. Although, for many architectures this number must be doubled to provide ancillas responsible for syndrome extraction. In this case the physical qubit overhead would be ∼ 11 million.
A summary of the time and space overheads for some example Shor tasks can be found in Table I. We selected Shor's algorithm as a benchmark as the number of non-Clifford gates required has been well studied and these results have been used in previous analyses. We note that due to the large resource overhead that we have demonstrated, early quantum computers may focus on other problems, particularly in quantum chemistry. A recent analysis of some such problems [44] demonstrates that they are solvable with lower overheads than we find here, albeit that work makes more optimistic assumptions of gate times and fidelities.
Of significant interest is how the cost of magic state distillation compares to the surface code overhead. Raussendorf et al. (see Sec. 6.2 of Ref. [28]) were the first to note that using balanced investment in a magic state distillation leads to a constant factor overhead compared to the CNOT gate. Our numerical results in Fig. 7 Space me cost show a ratio between T -gate cost and CNOT cost in the range ∼ 150 − 310 when 10 10 < N < 10 30 . This ratio is much smaller than estimated by Raussendorf et al. who did not make use of Bravyi-Haah distillation routines. A similar ratio can be extracted from the data tables provided in Ref. [3], though the numbers are not directly comparable. For instance, we also count the spacetime volume due to idle qubits, while they wait for distillation circuits to succeed.

VI. Conclusions
We proposed the notion of gauge-MSD, which is faster and uses fewer ancillas than previous realisations of Bravyi-Haah magic state distillation. We further introduced module checking as means to exploit correlations, and found it gave an additional factor ∼ 3 reduction in some parameter regimes. Fowler. et al [3] considered realising Bravyi-Haah using braiding, and found that Bravyi-Haah only offered a modest ∼ 3 improvement over the first magic state proposal that used Reed-Muller codes [1]. Therefore, our gains are comparable to, and build upon, other advances in the field. The work of Bravyi and Haah predicted a much greater improvement because they quantifed cost by the expected conversion efficiency of raw to high-fidelity magic states. In a fully costed analysis, as we perform here, error correction costs overwhelm and dominate the cost of magic state factories. We saw that an efficiently designed fac-tory using balanced investment is entirely limited by the surface code cost, and so refinement in distillation protocols can only offer constant factor improvements.
It is likely that this analysis represents an overestimate of the spacetime overhead of the implementations of the distillation circuits we describe. We have assumed the need for d rounds of surface code measurements after every two-qubit gate. However, it is not clear that this is necessary when performing transversal gates. In the implementation of Bravyi-Haah (see Fig.1) it may prove feasible to perform d rounds of error correction only after, say, the completion of each of the 4 steps described -reducing the time overhead from 11t sc d to 4t sc d [45]. Our cost analysis here could thus be further developed by considering the effect on the performance of the underlying surface code if multiple transversal operations are performed between rounds of error correction. However such simulations lie beyond the scope of this paper.
3D gauge color codes [7,25] and other recent ideas do not require magic states. But they have their own hidden costs. For 3D gauge color codes, spatial overheads scale as ∼ O(log(N ) 3 ) and time overhead as O(1). Using balanced investment and surface codes we see similar asymptotic scaling of resources. However, current evidence indicates an order of magnitude worse phenomenological threshold for color codes [11,12] compared to the phenomenological threshold for the surface code. Although a full circuit-based threshold has not yet been determined, it is unlikely to challenge that of the surface code due to the higher weight stabilizers required. This points towards 3D color codes requiring physical error rates of below 0.1%. Resource costs are heavily influenced by proximity to threshold, and so 3D color codes seem to require significantly lower physical error rates before they start can compete with surface codes augmented by magic states. Therefore, with current technology and fidelities, known schemes for avoiding magic states are a false economy. An additional benefit of the magic state paradigm is that it can also eliminate the additional burden of gate-synthesis costs by preparing exotic magic states [40,41,[46][47][48][49].

VII. Acknowledgments
Earl Campbell is supported by the EPSRC (grant EP/M024261/1). We thank Mark Howard for comments on the manuscript. We acknowledge support from the NQIT quantum hub in facilitating this collaboration. The authors would like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work. http://dx.doi.org/10.5281/zenodo.22558.

A. Comparison with braiding
Here we discuss how our results compare with prior work on braiding defects in the surface code. It has been shown that Bravyi-Haah can be realized in constant time [3] assuming the architecture supports constant time implementations of multi-target CNOT gates (in time t MT−cnot ). This has time cost where t inject is the time to inject a magic state into the circuit, and we can infer t inject = t cnot + t measure .
In the standard circuit model, multi-target CNOT gates do not take constant time to implement. In the braiding picture, using ancillary defects, this is possible. The "time" cost of braiding 12 such gates is 12 × 1.25 × d × t sc . The total space-time cost was reported as (96k + 216) so-called plumbing pieces, which converts into ( 5 4 ) 3 (96k + 216) qubit-rounds. It has been recently shown that lattice surgery also supports multitarget CNOT gates in constant time [50]. Though the Bravyi-Haah protocol was not considered in this setting, one can infer a lattice surgery time cost also scaling with ∼ 12d, but with the qubit cost is not currently unknown.
The above discussion implies a modest space-time saving of using gauge-MSD in a distributed architecture rather than braiding in a nearest neighbour picture. We remark that gate times and qubit expense will vary on a much greater scale between different hardware platforms. In particular, the long-range gates of distributed schemes are often much slower, with photonics protocols impeded by photon loss and the potential need for entanglement purification [31][32][33][34][35][36][37][38].

B. Formal tools a. Stabilizer Bravyi-Haah codes
Let us begin with some basic concepts from code theory and related notation. Stabilizer codes are subspaces described by an abelian group called the stabilizer S, which is a subgroup of the Pauli group. The projector on to the codespace is Π ∝ s∈S s, so that sΠ = Π for all s ∈ S. There always exists a minimal set of operators {S 1 , S 2 , ...S m } that generate the group, which we denote by S = S 1 , S 2 , ...S m . For the CSS codes, these generators can be chosen so that they are all either X-type or Z-type, as we define shortly. If g is a binary vector of length n we use Z[g] = ⊗ n j:[g]j =1 Z j and similarly X[g] = ⊗ n j:[g]j =1 X j . We say Z[g] are Z-type operators, and X[g] are X-type operators. Therefore, a CSS code Commutation of X[g] and Z[f ] is equivalent to (f, g) = 0 where we use the inner product (f, g) := f · g (mod 2).
Bravyi and Haah introduced the notion of a G matrix which is a binary matrix composed of two submatrices G 1 and G 0 . We label the rows of G as {g 1 , . . . , g b , g b+1 , . . . g m } where the rows {g 1 , . . . , g b } belong to G 0 and the rows {g b+1 , . . . , g m } belong to G 1 . These matrices define a CSS code as follows. The rows of G 0 are some set {g 1 , . . . , g b } that specify the X-type generators {X[g 1 ], . . . , X[g b ]}. The Z-type generators are given less explicitly. Denote G ⊥ to be the binary vector space orthogonal to both G 0 and G 1 , that is G ⊥ := {g : (f, g) = 0; ∀f ∈ G 0 , G 1 }. Note that Bravyi-Haah required that G 0 ⊂ G ⊥ . We define G ⊥ to be some This completes the description of the stabilizer codespace, though we also need to know how information is stored within the subspace. We have that the X operator for the k th logical qubit is X[g k ] where g k is a row of G 1 , and so b + 1 ≤ k ≤ m. These are representatives of the logical operators, with equivalent logical operators differing by only a stabilizer. For Bravyi-Haah protocols all rows in G 1 have odd weight, so we may also take that the Z operator for the k th logical qubit as Z[g k ] where g k is the k th row of G 1 .

b. Subsystem Bravyi-Haah codes
Given a G-matrix we define a subsystem code with stabilizerS := Z[g 1 ], . . . , Z[g b ], X[g 1 ], . . . , X[g b ] where {g 1 , . . . , g b } are the rows of G 0 . Notice that now there are equal numbers of X and Z stabilizers and they both correspond to the rows of G 0 . Bravyi and Haah considered a class of matrices obeying triorthgonality conditions, which require that G 0 ⊂ G ⊥ . Therefore, we have thatS ⊂ S, with the subsystem code having strictly fewer stabilizers than the original Bravyi-Haah code. We denote the subsystem projector asΠ ∝ s∈S s. We further take the logical operator for the subsystem code to be identical to those of the original Bravyi-Haah code. This leaves some degrees of freedom as neither stabilizers nor logical operators. We define the gauge group Let us recap. Our subsystem code is defined by its stabilizerS and gauge group S g , whereas the original Bravyi-Haah code has stabilizer S, and these groups sat-isfyS ⊂ S ⊂ S g . However, S g is inflated in size compared to S and is no longer abelian. Furthermore, one can verify that S g does not contain any logical operators as follows. First, note that Bravyi-Haah use triorthogonal (also called triply even) matrices where for any f, g ∈ G we have that (f, g) = 1 if and only if f = g and f ∈ G 1 .
As logical operators we take X[l] and Z[l] for each l in G 1 . From the triorthogonality of G we see that X[l] and Z[l] anticommute, but X[l] and Z[l ] commute when l = l . To properly describe a subsystem code, where measuring the gauge operators does not damage the logical qubits, we require that the logical operators are not elements of the gauge group S g . Recall the gauge group is defined by vectors that reside in the dual code G ⊥ . Therefore, every gauge operator must have vanishing inner product with every row in G. However, l ∈ G, and (l, l) = 1, so l is not in the dual space and the logical operators are not gauge operators. This completes our proof that the logical operators indeed lie outside the gauge group.

C. Realising the Bravyi-Haah protocols
There are many routes to realising magic state distillation. Assuming perfect Clifford operations, different realisations suppress errors equally, but differ in terms of temporal depth and required ancillas. Many of these potential realisations have only been sketched, without a complete assessment of resources involved. Here we introduce a new method particularly suitable for architectures implementing logical gates via transversal operations or lattice surgery [29]. Conceptually, we are inspired by notions of subsystem codes and gauge fixing techniques, and so call our approach gauge-MSD. We consider only even k with k ∈ {2, 6, . . . , 4m + 2, . . .} as then the Bravyi

Outline of protocol
Here we present an outline of gauge-MSD, with details of how to realise multiqubit Pauli measurements postponed until the next section. First we specify some notation. Refer back to App. B for definitions of the Gmatrix and G ⊥ -matrix. Let R be a binary matrix such that G ⊥ · R = 1l (mod 2), which is ensured to exist by virtue that G ⊥ is full rank. Furthermore, let M be a binary matrix such that M · G ⊥ = G 0 (mod 2), which must exist since G 0 is in the span of G ⊥ . Explicit examples of G ⊥ , R and M will be given in the next section. Measurement outcomes will be recorded in binary: 0 for " + 1" eigenvalues and 1 for " − 1" eigenvalues. where the numbers above the columns correspond to the elements in sets Z and X . Lastly, we will also make use of a k-by-dim(G ⊥ ) matrix denoted Q that satisfies G 1 = W + QG ⊥ where W has W j,6j+3 = 1, W j,1 = W j,2 = 1 and zero everywhere else. An example Q is given in the next section. We now state the protocol. 5. Measure qubits in set X in the X basis, recording outcomes as m X ; Simultaneously, measure qubits in set Z in the Z basis, recording outcomes as m Z .
6. Qubits in X and Z are discarded, while qubits in set O are retained as output as qubits (1, 2, . . . , k); or update Pauli frame accordingly.
After steps 1-2, the state is deterministically projected by , exactly as in the original Bravyi-Haah protocol. After step 3, the system is an eigenstate of (−1) γ k X[f k ] and we can associate some projector with Π X,γ with this process. The postselection in step 4 ensures the system is an eigenstate of +X[g k ] where g k is the k th row of M · G ⊥ . Since we defined M such that M · G ⊥ = G 0 , we have that g k are the rows of G 0 . It follows that Π X,γ = Π X Π X,γ where Π X is the projector for the X stabilizer of the Bravyi-Haah stabilizer code. Combining, steps 1-4 we have the projections where Π = Π X Π Z is the full projector onto the Bravyi-Haah stabilizer codespace. We have obtained the desired Π projection required by the Bravyi-Haah protocol. But we have picked up an additional Π X,γ . This additional projection results from measuring some the gauge operators of the subsystem variant of Bravyi-Haah. Thus the logically encoded qubits are unharmed but there has been a change to the gauge degrees of freedom.
In step 5, we perform single qubit measurements to isolate the k logical qubits from the n qubits in the subsystem code. Recall that in the original presentation of Bravyi Haah we have logical operators Z[l j ] and X[l j ] for the j th logical qubit, where l j is the j th row of G 1 . The measurements localise these logical operators onto single qubits, up to a Pauli correction depending on the measurement outcomes. That is, the measurements cause the state to become stabilized by some new ±Z[w] operators, and every logical Z[l j ] can be multiplied by these operators to obtain a single qubit ±Z operator acting on the (6+3j) th qubit. It is straightforward to verify that the ± sign is corrected to + by the Pauli correction X[H Z m Z ]. To show that X logical operators are also localized on a single output qubit, we first multiply X[l j ] by X-type gauge operators until it is acts on qubits 2,3 and (6 + 3j). Measuring qubits 1,2 and 3 completes the localisation of X[l j ] onto the (6 + 3j) th qubit. The required Pauli operator is now X[H Z m Z + Qγ], where now there is also some dependence on the eigenvalues of gauge operators obtained in step 3.

Step 4
Extract output 3.1) Measure all qubit in Pauli-Z basis, and all qubits in Pauli-X basis.

3.2) Remaining
are the k output qubits, up to a Pauli correc ion given.

Z X
Step 1 Measure Z-type stabilisers Step 2 Apply correc ion 2.1) Iden ify qubits to be corrected from outcomes of step (1.3) as follows: All ancilla with "-1" outcome are linked to an odd number of correc ion qubits; All ancilla with "+1" outcome are linked to an even number of correc ion qubits; 2.2) Apply local gate to every correc ion qubit.
Step 3 Measure X-type stabilisers 3 FIG. 8. Squares represent magic states to be distilled, and circles are ancillas used to implement measurements in Reed-Muller. Edges show required hardware connections and associated required control-phase gates. Edges show time ordering of control phase gates, e.g. red, blue, purple green and gold, demonstrating the required entanglement can be established in 5 time steps. Sometimes we call this the graph representing G RM z .
The measurements corresponding to weight four rows can be implemented with each measurement using a single ancilla in the |+ state and four entangling gates (control-Z or control-X depending on the measurements). Therefore, for these measurements it is possible that all these gates can be realized in four time steps, while respecting that a qubit can only be involved in a single entangling gate at a time. Unfortunately, there is a single row of G ⊥ with weight k +2, and so using a single ancilla to perform this measurement would result a growing time cost with k. Therefore, for this single measurement we make use of a cat state |0 ⊗k+2 + |1 ⊗k+2 so that the entangling gates can be performed in parallel. The cat state itself is constructed by merge operators on |+ ⊗k+2 qubits. The merge operations projecting onto |00 00| + |11 11| or |01 01| + |10 10| subspaces and so commute with the control gates and so the cat state can be built concurrently with the control gates. This opens the possibility of realising each round of measurements in 4 times steps, but depends on whether the entangling gates can be scheduled in an economical manner. The scheduling problem is equivalent to a graph coloring problem and we find that it can be solved in 4 time steps (e.g. using 4 colors) as in Fig. 1. We independently confirmed this using an automated solver of the edge colorability problem for k up to 40, see supplementary Mathematica script for details.

D. Reed-Muller connectivity
The usual form of the 15-qubit punctured Reed-Muller code the G matrix is  1 0 1 0 1 0 1 0 1 0 1 0 1 0 1  1 1 1 1 1 1 1 1 1 1 1 1 1 1 For our purposes an efficient choice of dual is which has a maximum row weight of 4 and a max column weight 5. There is a 5 colorable graph associated with this matrix shown in Fig. 8, similar to what we found for the Bravyi-Haah protocol. It is well known that the Reed-Muller code can also be viewed as a subsystem code, and so the Pauli-X measurements can clone the pattern of the Pauli-Z measurements. This approach yields the realisation shown in Fig. 8.

E. Toffoli protocol
The Toff protocol converts 8 T -states into one Toffoli state It has previously been described in the circuit picture with its performance found by brute force counting of errors [26,27]. Here we consider an equivalent protocol (with the same performance when using block checking) but with the G matrix formalism as the Bravyi-Haah protocols. The G matrix achieving this is simply 1 1 1 1 0 0 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 where proof that this distills was given in Ref. [40,41].
Here we show in Fig. 9 how to convert this abstract protocol to a realisation with low spacetime overhead.

F. Proof of error tracking
When iterating distillation protocols we have a collection of inputs, which may have been outputs of an earlier round. We illustrate the structure in Fig. 2, where we introduce the terminology of branches and modules. Our proof proceeds by considering how an individual module maps probability distributions over its inputs to a probability distribution over outputs. If a level-l module uses many n l → k l blocks, then it requires n l branches of inputs from earlier rounds. If each branch carries N l qubits, then we need N l instances of the n l → k l protocol so that the whole module maps N l n l → N l k l qubits. The size of N l equals the product of the k l values for all earlier rounds, which can be verified by simply counting back through the tree.
We use x (j) ∈ Z N 2 to denote the error distribution of the input contribution from the j th branch, and use probability Pr(x (j) ) for the probability of this event. For now we take this probability as given and note only that for a block that quadratically suppresses noise, we have that Pr(x (j) = 0) is to first order proportional to 2 l after l rounds of distillation. If a single block of a n l → k l protocol is described by a matrix G, then N copies within a module are described by G = G ⊗ 1l N where ⊗ is the tensor product and 1l N is the identity matrix of dimension N . On the first round, N = 1 and so we simply have G = G. Before proceeding we must clarify how this tensor product structure relates to the input strings x. We define δ (j) as a length n l binary vector with entries: 1 in the j th location, and 0 everywhere else. It follows that x = j (δ (j) ⊗ x (j) ), so that Gx = j (Gδ (j) ⊗ x (j) ). This ordering ensures that no two qubits from the same branch collide into the same block, which must be prevented because of correlations within branches. We call this the canonical ordering and it is assumed throughout. Other orderings exist that prevent such collisions, such as an arbitrary permutation of qubits within any branch. Potentially such permutations could perform better or worse than the canonical choice, leaving room for further optimisation. We continue with the canonical choice as Step 2 Apply correc ion 2.1) Iden ify qubits to be corrected from outcomes of step (1.3) as follows: All ancilla with "-1" outcome are linked to an odd number of correc ion qubits; All ancilla with "+1" outcome are linked to an even number of correc ion qubits; 2.2) Apply local gate to every correc ion qubit.
Step 3 Measure X-type stabilisers 3 9. Squares represent magic states to be distilled, and circles are ancillas used to implement the stabilizer measurements which project the T-states into the Toffoli state. As before edges show required hardware connections and associated required control-phase gates. Edges show time ordering of control phase gates, e.g. red, blue, purple green and gold, demonstrating the required entanglement can be established in 4 time steps for the Z stabilizers and 5 time steps for the single X stabilizer, which utilises just two of the four ancillas. Sometimes we call this the graph representing G T of f z . it is particularly natural and amenable to analysis.
After a module passes module checking, the output branch has errors distributed as Pr l (y) = 1 P suc,l {x:G0x=0,G1x=y} Pr(x), where the denominator is a normalisation constant accounting for the success probability P suc,l = y {x:G0x=0,G1x=y} Pr(x). (F2) For our proof, it is useful to work with unnormalized probabilities that we write as where we will renormalise later. Herein, we focus on the smallest weight errors that can lead to a particular y. The no error case (y = 0) occurs when there the initial states had no errors, so where m l counts the total number of raw qubits needed for l rounds of distillation, namely m l = j=1,...l n j . For protocols quadratically suppressing noise, 2 l is the smallest number of errors that can evade detection. Therefore, we take the approximation where C l (y) counts the number of different weight 2 l errors that lead to output y. For one round of distillation this is simply Recall that η(y) was defined back in Def. 1 as a function that counts precisely the number of weight 2 errors that lead to a particular output. Finding C l (y) for higher levels (l > 1) is more involved. The relation between input and output error strings is y = G 1 x (assuming G 0 x = 0), so x vectors of weight 2 can result into a variety of weights y vectors, potentially increasing the weight, and so some care is needed.
Consider a module (on some l > 1 level) where some of the incoming branches contain errors. The probability of two branches a and b containing errors is These errors may contribute to undetected errors leaving the module. Fewer or more branch failures do not provide leading order contributions. Consider when only 1 branch contains any errors. After this branch is split up and fed into different blocks, each n l → k l block can contain at most 1 error, and so it will be detected. If t > 2 branches contain an error, the probability will be weighted by t * 2 (l−1) , which for small is a rare process compared to two failed branches. Furthermore, with two erroneous branches a and b, we can further deduce that either x (a) = x (b) or the error will be detected. To see this, we begin by noting that errors will only be undetected if (G 0 ⊗ 1l N )x = 0 (mod 2), and we have Both G 0 δ (a) and G 0 δ (b) are columns of the G 0 and so are nonzero. Since no terms are zero, it can only vanish mod 2 if both G 0 δ (a) = G 0 δ (b) and x (a) = x (b) . This is a central pivot of the proof, so we will give a second explanation of what is happening here. Note that if x (a) = x (b) , then w.l.o.g x (a) had an error in at least 1 location that is absent from x (b) . If the error's location is t, then the t th distillation block will receive an input with only 1 error and must detect it. We conclude that the dominate source of errors is from the identical failure of pairs of branches.
To simplify notation, let u = δ (a) + δ (b) and f = x (a) = x (b) , so we have the more concise expression x = u⊗f for relevant errors, with u being weight 2. Above we noted that an undetected error must satisfy G 0 δ (a) = G 0 δ (b) , which is equivalent to G 0 u = 0 (mod 2). The output from such an error is y = (G 1 ⊗ 1l N )(u ⊗ f ) = (G 1 u) ⊗ f . Therefore, we can now deduce that by counting over all u that lead to the same v. Therefore, The summation is over weight 2 vectors u, but the terms are independent of u, and so we find that where we have again used the η notation introduced in Def. 1. For two rounds of distillation, l = 2, and so C l−1 = C 1 and we can end the recursion by using Eq. F6, so that However, if we have more than 2 rounds, notice that the relevant f will again have the form f = v ⊗ f so that and so on, until we reach C 1 and can use (see Eq. F6) to end the recursion. From C l (y) we have a good leading order approximation of the probability of an error P l (y). The total (unnormalized) probability of an error is where we have used the shorthand Equating also A l = P l (0) = (1 − ) m l , we find we have reached the expressions of Eqs 8. To renormalise the error probability B l , we simply divide through by the total probability A l +B l , yielding one equation of Thm. 1. The denominator A l + B l represents that success probability not just of a single module succeeds, but that all events feeding into that module also succeed. We are actually interested in the success probability conditioned on previous modules being successful, which is (A l +B l ) divided by (A l−1 +B l−1 ) n l . This divider here comes from the unconditional success probability of an (l − 1)-level module, (A l−1 + B l−1 ) and that n l of these feed into an l-level module. This completes the proof of Thm. 1.

Brute Force method
In simulating a magic state factory it quickly becomes apparent that a brute force method of simulation is inadequate. The "brute force" method simply involves randomly generating a boolean string of length l n l to describe the input to the magic state factory and performing calculations of the stabilizers and logical output of each module of the factory. The explicit procedure for a three round factory is given in Alg. 1. For each module in round one, a random input is generated and tested until the stabilizer checks for the module are passed. The logical output of each of these successful modules is saved until enough have been generated to feed into round two. These outputs are shuffled according to Equation F8 before entering round two. Here the module checking is performed again. If all the module checks are passed, then we again proceed by feeding the logical output of round 2 to round 3, after shuffling. Having reached round 3 we then determine the characteristics of this module by recording where the module check fails, and if it succeeds whether the output of the module contains a logical (undetected error).
Clearly a large number of iterations of this protocol are required to obtain reliable statistics for the performance of the factory. For example, our analytic treatment estimates that with a raw magic state error rate = 0.001 that the rate of undetected logical error of a round 3 module is ∼ 10 −21 . As such, successfully simulating this by brute force would require 10 21 attempts at the algorithm to be made, not just to build statistics but to ensure that enough instances of the third round module are generated in the first place. We find that the simulation of two round factories by brute force is achievable for 2 < k < 50, but to simulate three rounds a different method is required.

Rare Events method
An undetected error in the factory's output after three rounds of distillation is a rare event. To simulate these and gain adequate statistics we must use a method in which we as far as possible eliminate the simulations of input error configurations that we know cannot lead to a logical error. Our chosen method of doing this proceeds as follows. We know that a module only has a chance of failing if at least two of the branches entering that module contain an error. For a module in the third round, we focus on cases where at least 2 of its input branches contain errors. We do this by a process that can be called preselection in contrast to postselection. In postselection, we sample from a distribution and reject instances that do not meet a certain criteria. This is not feasible when the criteria (here having at least 2 corrupt branches) is rare, and so we instead construct a new probability distribution conditioned on the criteria being meet. The first step of the algorithm therefore is to first decide how many error containing branches will enter this module, given that this number is ≥ 2, based on the statistics already gathered for the rate of errors in the output of round 2 modules. Later we analytically adjust for this process of preselection.
For each of the 'corrupt' branches entering round three, we know that these must originate from a round 2 module which itself had 2 or more corrupt branches entering it. These branches again would have originated in a round 1 module (equivalently a block) which had at least two errors fed to it. We can thus greatly reduce the size of the simulation and the number of iterations required by simulating only on a subsection of the factory where these errors have occurred, see Fig. 11. The probability of an undetected error after the first round was determined analytically for Bravyi-Haah by explicitly calculating the weight enumerator which allows analytic calculation of the global fidelity of the output of a single round of distillation. The corresponding result for the Toffoli protocol is given in Ref [26].
Having chosen the number of failed branches that need to enter round three we attempt to generate these failed The close correspondence of our analytic treatment of the factory and numerical simulations is demonstrated for two and three rounds of the magic state distillation with module checking. Points represent simulation data, while the lines are the corresponding analytic estimates as determined by Thm. 1. The protocols are labelled by the k value of each round, which is set to be equal for each round of distillation. The global error, the probability of a logical error anywhere in the output of the factory is shown. For a "2-2-2" factory this is the probability of an error anywhere in the 8 output magic states, for a "26-26-26" this would correspond to an error anywhere in the 17,576 magic states that this factory produces as its output.
branches. We thus simulate the reduced factory, as described above, discarding and repeating each module not only if the stabilizer module check has failed, but also when the check is passed and the output does not contain a logical error. Using this method we eliminate the simulation of a vast number of the possible input error strings, while holding smaller boolean vectors in memory at all but the last step, the full simulation of the module in round 3. Some pseudo-code is presented in Alg. 2.
This method of simulation allows us to calculate the correlated error rate as follows.
A key number is the probability p num that two or more branches leaving round 2 of the factory contain an error.
glo is the probability that a branch exiting round 2 contains an undetected error, as determined by numerical simulations of a two round factory.
Using this allows us to determine the success probability and the global error in the output of the round 3 module.  where b 3 = #{"ERROR"} #{"SUCCESS"}+#{"FAIL"}+#{"ERROR"} is the estimate of the probability of a logical error in the output branch of round 3 from output of the simulation; and a 3 = #{"SU CCESS"} #{"SUCCESS"}+#{"FAIL"}+#{"ERROR"} We limit our simulations to a limited set of k values, with k i = k for each simulation. This is an arbitrary choice and simplifies the comparison made in Fig. 10. For three rounds the simulations are limited to low k values, as these can be simulated in a reasonable timeframe and adequate statistics gathered to infer the output error rate. randomly generate binary string v length n1 5: measure stabilizers G0(k1).v 6: if stabilizers failed then return to line 3 7: end if 8: calculate logical output of module v = G1(k1).v

9:
Append v to list of logical outputs V 10: end for 11: shuffle output of round 1 to firewall correlations V → V {Round Two} 12: for j < M2 do 13: for each block ii in a round 2 module (there are k1) do 14: take (ii × j) th string of length n2 from V : v 15: measure stabilizers G0(k2).v 16: if stabilizers failed then return to line 1 17: end if 18: calculate logical output of module w = G1(k2).v 19: Append w to list of logical outputs W 20: end for 21: end for 22: shuffle output of round 2 to firewall correlations W → W {Round Three} 23: for l < k1k2 do 24: measure stabilizers G0(k3).w 25: if stabilizers failed then return FAIL 26: end if stabilizers passed 27: calculate logical output of module G1(k3).w

28:
Search for logical error in output 29: if logical error found then return ERROR generate number of corrupt round 1 modules N1 6: for j < N1 do 7: randomly generate binary string v length n1 8: measure stabilizers G0(k1).v 9: if stabilizers failed then return to line 6 10: end if 11: calculate logical output of module G1.v 12: if there is a logical error then Append v to list of logical outputs V if stabilizers failed then return to line 6 22: end if 23: calculate logical output of module w = G1(k2).v 24: if there is a logical error then Append w to list W end for 28: end for{ Round Three } 29: Pad W with 0s so it is length k1k2n3 30: shuffle output of round 2 to firewall correlations W → W 31: for l < k1k2 do 32: measure stabilizers G0(k3).w 33: if stabilizers failed then return FAIL 34: end if stabilizers passed 35: calculate logical output of module G1(k3).w 36: Search for logical error in output 37: if logical error found then return ERROR