Universal quantum computing with twist-free and temporally encoded lattice surgery

Lattice surgery protocols allow for the efficient implementation of universal gate sets with two-dimensional topological codes where qubits are constrained to interact with one another locally. In this work, we first introduce a decoder capable of correcting spacelike and timelike errors during lattice surgery protocols. Afterwards, we compute logical failure rates of a lattice surgery protocol for a biased circuit-level noise model. We then provide a new protocol for performing twist-free lattice surgery, where we avoid twist defects in the bulk of the lattice. Our twist-free protocol eliminates the extra circuit components and gate scheduling complexities associated with the measurement of higher weight stabilizers when using twist defects. We also provide a protocol for temporally encoded lattice surgery that can be used to reduce both runtimes and the total space-time costs of quantum algorithms. Lastly, we propose a layout for a quantum processor that is more efficient for rectangular surface codes exploiting noise bias, and which is compatible with the other techniques mentioned above.


I. INTRODUCTION
Fault-tolerant quantum computing architectures enable the protection of logical qubits from errors by encoding them in error correcting codes, while simultaneously allowing for gates to be performed on such qubits. Importantly, failures arising during the implementation of logical gates do not result in uncorrectable errors as long as the total number of such failures remains below a certain fraction of the code distance [1][2][3][4][5]. In most practical settings, quantum logic gates are split into two categories. The first category corresponds to Clifford operations, which can be efficiently simulated by classical computers. The second category corresponds to non-Clifford operations, which cannot be efficiently simulated using purely classical resources. Early proposals for fault-tolerant quantum computation used transversal gates to perform logical Clifford operations [6]. Later, it was shown that by braiding defects in a surface code, some Clifford operations could be realised faulttolerantly in a 2D local architecture with a high-threshold [7]. Recently, lattice surgery [8] has replaced the braiding approach due to its ability to retain locality constraints and high thresholds (features which are required by many hardware architectures), while additionally offering a much lower resource cost [9][10][11][12]. These approaches all perform non-Clifford gates by teleportation [13,14] of magic states prepared by some distillation procedure [15][16][17][18][19][20][21]. Alternative ideas have been proposed for circumventing the need for magic states [22][23][24][25], but detailed studies [26][27][28] have not found any of these alternatives to be competitive for a wide range of failure rates below the surface code threshold.
Our work introduces the following key results. After briefly reviewing the model of Pauli based computation and its implementation via lattice surgery in Section II, we then explicitly provide a decoder compatible with lattice surgery in Section III. In particular, our decoder is capable of correcting both spacelike and timelike errors that occur during lattice surgery protocols. We then perform simulations of an X ⊗ X Pauli measurement using a biased circuit-level noise model.
In Section IV, we introduce a twist-free approach for measuring arbitrary Pauli operators using the surface code. Our approach avoids the extra circuit and gate scheduling complexities that arise when using twists, where by twists we refer to lattices which contain twist defects in the bulk (see for instance Fig. 3 in Section II B). We show that the approximate cost of avoiding twists is a 2× slowdown in the algorithm runtime and a negligibly small additive cost to the number of logical qubits. We expect that twist-based lattice surgery has it own associated costs, which may exceed those of our twist-free approach, but twist performance has never been fully quantified and so represents a currently unknown factor in quantum computing design.
In Section V, we show how to reduce algorithm runtimes using a new technique that we call temporal encoding of lattice surgery. By using fast lattice surgery operations (which are inevitably noisier), errors arising from the extra noise can be corrected by encoding the sequence of measured Pauli operators within a classical error correcting code. The resulting runtime improvement grows (as a multiplicative factor) with the parallelizability of the algorithm and total algorithm runtime. We find that in a regime of interest to quantum algorithms of a practical scale, we can achieve a 2.2× runtime improvement. Our temporal encoding does not directly lead to additional qubit overhead costs since it occurs in the time domain, and so the overall spacetime complexity is improved.
Lastly, in Section VI, we describe our core-cache architecture. We show that by using thin rectangular strips of surface codes for settings where a large noise bias is present, routing overhead costs in our proposed architecture add a factor of 1.5× to the total resource costs for performing lattice surgery. This can be compared with the 2× cost of Litinski's fast data access structures [12]. Furthermore, we provide a layout that compactly stores surface code patches in a cache to further reduce the extra overhead arising from routing costs, at the cost of some additional time needed for reading/writing to the cache. Using the numerical results obtained in Section III, in Appendix C we provide resource cost estimates for simulating the Hubbard model using our core-cache architecture.

II. BRIEF REVIEW OF UNIVERSAL QUANTUM COMPUTING VIA LATTICE SURGERY
In Section II A, we briefly review the principles of Pauli based computation used throughout this work. We then review in Section II B how multi-qubit Pauli operators are measured using lattice surgery.

A. Overview of Pauli based computation
In the model of Pauli-based computation (PBC), we have a reserve of magic states and drive the computation by performing a sequence of multi-qubit Pauli measurements {P 1 , P 2 , . . . , P µ } where later Pauli measurements depend on measurement outcomes of earlier measurements. In this notation, P 2 does not denote a specific Pauli, but one conditional on the outcome of P 1 . This conditionality occurs because (in the circuit picture) each Pauli measurement would be followed by a conditional Clifford operation. However, in a PBC, these Cliffords are conjugated to the end of the computation, thereby changing subsequent Pauli measurements. Since in a PBC all Cliffords are performed "in software", it is clear the algorithm runtime will be independent of the Clifford complexity. The idea of PBC appears throughout the literature, but the phrase Pauli-based computation was first coined in Ref. [29]. In Fig. 1, we present several computationally equivalent circuit diagrams for performing 2 T gates, with the last diagram representing the PBC approach.
In Section II B, we review how multi-qubit Pauli measurements can be performed using lattice surgery. Crucially, even when Pauli operators commute, it might not be possible to measure such Pauli operators simultaneously due to the extra space required to perform lattice surgery (known as the routing space). We can be obstructed from measuring commuting Pauli operators when the required lattice surgery operations need access to the same routing space. Therefore, it is appropriate to consider sequentially measuring each Pauli operator, which we call a sequential Pauli based computation (seqPBC). In se-qPBC, the time required to execute all the Pauli measurements will then be proportional to T PBC = (d m + 1)µ where we budget +1 for resetting qubits between lattice surgery operations. Here d m corresponds to the number of rounds of stabilizer measurements during lattice surgery, and µ is the number of sequential Pauli operators being measured. The proportionality factor will depend on the time required to measure the surface code stabilizers during one syndrome measurement round.
There are several contributions to the number of Pauli measurements µ. At a high level of the stack, we may think of a quantum algorithm as consisting of a series of unitaries with some Pauli measurements for readout, and let N A denote the number of such algorithmic readout measurements. However, as we saw in Fig. 1, non-Clifford unitaries are performed by measurements. If an algorithm has N T T -gates, then we also need an additional N T Pauli measurements. The Clifford plus T gate set is universal. However, it is advantageous to use an overcomplete gate set such as Clifford plus T and Toffoli. While Toffoli can be synthesized using 4 T 1. Equivalent approaches to implementing two T gates. Blue rounded rectangles show multi-qubit Pauli measurements. (a) A simple unitary circuit approach. (b) Using gate teleportation with |T := (|0 + e iπ/4 |1 )/ √ 2 magic states, CNOT gates, Pauli Z measurements with outcomes m1 and m2 and classical conditioned Cliffords based on these outcomes. The conditional S gates are S = |0 0| + i|1 1|. (c) Using gate teleportation with the CNOT gates replaced by two-qubit Pauli measurements. (d) Given an input state |ψ which carries a Clifford frame correction C, we have conjugated C through the circuit so that the multi-qubit Pauli-measurements are now P1 = CZ1Z3C † and P2 = CZ2Z4C † (which commute) and the output state carries a new Clifford frame correction C = S m 1 +2q 1 1 S m 2 +2q 2 2 C. This last circuit represents the standard PBC approach for computing T ⊗2 .
gates [30], it is often more efficient to directly prepare Toffoli magic states [30][31][32][33]. Furthermore, it only takes 3 Pauli measurements to teleport a Toffoli state rather than the 4 measurements needed to teleport T states and then synthesize a Toffoli. As such, if an algorithm can be executed with N TOF Toffoli gates and N T T -gates, then we need N T + 3N TOF measurements to perform these teleportations. Further refinements are possible by using an even richer gate set and preparing more exotic states [34][35][36], but we will not explicitly discuss those schemes here. Lastly, we can replace some non-Clifford gates with Pauli measurements and feedforward (with no magic state needed). For instance, such a measurement appears in Gidney's circuits for adders [37] and more generally any uncomputation subroutine of an algorithm where the Toffoli gates are replaced with Pauli measurements and A simple example of lattice surgery illustrating the space overhead needed for routing. In (a), we show 8 rectangular surface code patches with some surrounding idle qubits that form the routing space. In (b) we show the stabilizers needed to measure a logical X ⊗ X operator between the separated surface code patches using the routing space between them. Similarly, in (c) we show the stabilizers needed to measure a logical Z ⊗ Z operator between the separated surface code patches using the available routing space. Stabilizers of mixed X and Z-type are referred to as domain walls and are used to ensure that the dz distance of the surface code does not decrease when measuring Z-type logical Pauli operators via lattice surgery. White ancillas mark the stabilizers that directly contribute towards computing the parity of the X ⊗ X and Z ⊗ Z measurement outcomes. feedforward. We use N unTOF to denote the number of Toffoli uncomputations performed in this manner. Hence, for a Clifford plus T and Toffoli gate set, the total number of Pauli measurements will be µ = N A + N T + 3N TOF + N unTOF . Note that in many algorithms, Toffolis exclusively appear in compute/uncompute pairs, and then we have N unTOF = N TOF . Furthermore, algorithms often only have a small number of qubit readouts, so N A N T , N TOF . As such, we commonly have µ ≈ N T + 4N TOF .
An architecture also requires time T magic to produce the required magic states, say N T T -states and N TOF Toffoli states. If an architecture produces all the required magic states in a shorter amount of time than is required to teleport them, so that T magic < T PBC , then we say the seqPBC is Clifford-bottle necked. On the other hand, if T magic ≥ T PBC , we say it is magic-state bottle necked. The running time of the algorithm will be determined by max{T magic , T PBC }.
It is informative to briefly review the impact of these bottle necks on the history of algorithm resource analysis. The time T magic can be made arbitrarily small, by simply increasing the number of magic state factories, though this comes at an increased qubit cost. Some early algorithm overhead estimates [5,38,39] minimized T magic by having a large number of factories leading to widespread claims that magic state factories could be ∼ 99% of the whole device. However, such resource estimates ignore the question of how quickly these states can be teleported and ignored the T PBC bottleneck. Accounting for this bottleneck, there is no benefit from pushing T magic to be small T magic T PBC . As such, more recent and careful overhead analyses [33,[40][41][42] have assumed only a few factories, which is enough to achieve T magic ≈ T PBC and results in only a small percentage of device footprint (often less than 1%) being used as a magic state factory. While these analyses have minimal qubit cost, the runtime now is T PBC bottlenecked, motivating rigorous approaches to beating the T PBC bottleneck. Later, in Section V A, we will review prior lattice surgery methods to speed up algorithms by using additional teleportation gadgets, though this comes at a high qubit cost. We will then introduce our own approach that instead uses temporal encoding of lattice surgery (TELS).

B. Overview of lattice surgery
For quantum hardware where physical qubits can only interact with one another locally, lattice surgery is a fault-tolerant protocol that can be used to measure arbitrary multi-qubit Pauli operators. The main idea is to encode the logical qubits in some topological code arranged in a two-dimensional layout (in this work, all logical qubits are encoded in the rotated surface code [43]). The layout contains extra routing space between the surface code patches which consists of additional qubits. By applying the appropriate gauge-fixing operations in the routing space (see for instance Ref. [44]), which involves measuring surface code stabilizers, the surface code patches involved in the Pauli measurement are merged into one larger surface code patch. After gauge fixing, the parity of the measurement outcome of the multi-qubit Pauli operator being measured is obtained by taking the products of the appropriate stabilizers in the routing space. Lastly, the surface code patches for each logical qubit can be detached from the merged patch by measuring the qubits in the routing space in the appropriate basis. An illustration of X ⊗ X and Z ⊗ Z Pauli measurements is shown in Fig. 2. Products of the surface code stabilizers marked by white ancilla qubits give the parity of the X ⊗ X and Z ⊗ Z measurement outcomes. Note that for Z-type Pauli measurements, we use domain walls at the Z logical boundaries of the surface code patches. Domain walls correspond to stabilizers of mixed X and Z-type as illustrated in Fig. 3. The primary reason for using domain walls is to prevent a reduction in minimum-weight representatives of logical Z operators during lattice surgery. Step (1) is the initial setup of 2 surface code patches. In step (2), the surface code patches are extended and corners moved using the routing space. Such an extension allows logical Y operators of the surface code to be expressed along a horizontal boundary. In step (3), we measure Y ⊗ Y using a combination of domain walls, elongated stabilizers and two twist defects. Domain walls measure Z ⊗ Z ⊗ X ⊗ X stabilizers and typically offer no additional challenge compared to normal stabilizer measurements. Elongated stabilizers are long range operators that pose an additional difficulty in hardware implementations. Elongated stabilizers can be implemented using either: two ancilla qubits (e.g. prepared in a GHZ state) that we connect with a dashed line; or these two ancilla qubits must be merged into a single qubit (hardwired in the architecture) and using long-range gates. Twist defects (yellow plaquettes) present the biggest difficulty since in addition to the challenges faced by elongated stabilizers, they also require a weight-five measurement.
To measure multi-qubit Pauli operators containing Y terms, one option is to extend the surface code patches using the routing space in such a way that the logical Y operators can be expressed along horizontal boundaries of the surface code. Logical Y operators can then be measured using a twist defect as shown in Fig. 3. Note that such a protocol requires measuring a weight-five operator, such as the ones shown in yellow in Fig. 3. Such high-weight measurements can be undesirable for many hardware architectures. An alternative approach which does not require the extension of surface code patches and the use of twist-defects in the bulk is provided in Section IV.
When performing a lattice surgery measurement of a logical Pauli operator, there will be some probability that we obtain the wrong outcome. Even with large code distances, the lattice surgery measurement could still fail due to time-like errors occurring during the finite time allowed for lattice surgery. The probability of these failure events is exponentially suppressed in the number of rounds d m for which we repeat the stabilizer measurements during lattice surgery. Therefore, we call d m the measurement distance which quantifies the protection against repeated measurement failures during lattice surgery, and hence explains our choice of subscript. Defining d z and d x to be minimum-weights of logical Z and X-type operators of the surface code, this exponential suppression will hold until d m O(d z , d x ) when logical Pauli errors become the dominate mechanism again. Let us assume that code distances d x and d z are chosen so that even a single logical Z and X error is very unlikely over the course of the whole computation. We expect, and numerically find, that these timelike errors occur with a probability P for which we have a bound of the form where p quantifies the physical gate failure probabilities, {a, b, c} are constants and L is the area of the patch used for lattice surgery. The value of L will vary for different measurements and different layouts, but it will be convenient to think of it as a constant representing the worst (or average) case area of lattice surgery patches.
In general, if we want to sequentially perform µ Pauli measurements in the algorithm and we want them to fail with probability no more than δ, then we choose d m to be large enough such that where the approximation holds for small P. In Section III, after introducing a decoder compatible with lattice surgery protocols, we will compute timelike failure probabilities in addition to probabilities for other noise processes given a biased circuit-level noise model. Such results will allow us to obtain accurate resource overhead estimates for implementing quantum algorithms and are discussed further in Section VI.

III. DECODING TIMELIKE ERRORS DURING LATTICE SURGERY
In this section, we provide an explicit decoding protocol for correcting both spacelike and timelike errors which can occur during lattice surgery protocols. In particular, our protocol protects logical qubits encoded in surface code patches while at the same time correcting logical multi-qubit Pauli measurement failures that can occur during lattice surgery. We then provide numerical results for performing X ⊗X measurements Example of an X ⊗ X measurement performed on two dx = 5, dz = 7 logical patches. Data qubits in the routing space are first prepared in the |0 state. In the first round of stabilizer measurements where the logical patches are merged into one large surface code patch, the product of all stabilizers marked by white vertices (which we call parity vertices) gives the result of the X ⊗ X measurement. After measuring the stabilizers of the merged patch for dm rounds, the patches are split by measuring the data qubits in the routing region in the Z basis. In the first round of the merge, measurement errors occurring on parity vertices can result in a wrong X ⊗ X measurement outcome. Additionally, an odd number of dataqubit Z errors along the boundary of the logical patches prior to the merge, such as the one circled in the top row of the figure, can also result in a wrong X ⊗ X measurement outcome.
showing both the logical multi-qubit Pauli measurement failure rate as a function of the number of syndrome measurement rounds and the logical qubit failure rates. The generalization of toric code decoders with periodic boundary conditions to the surface code was first done in Refs. [7,45,46] by adding virtual vertices at the boundaries of the surface code. Follow up work in Refs. [8,9] provided a high-level account of how surface code decoders could be used in the context of lattice surgery. However, details such as the correct specification of boundary vertex locations for both spacelike and timelike failures were not provided. In Section III A, our lattice surgery decoding algorithm makes use of boundary vertices for both spatial and timelike boundaries. Further, no previous work has performed such realistic, circuit-level simulations of lattice surgery. For comparison, in Ref. [33] timelike errors were simulated, but Ref. [33] used a toy model with idealized boundaries to exclude logical spacelike errors and thereby simplifying both the simulations and required decoding algorithm. In what follows, we will refer to a surface code patch encoding a logical qubit as a logical patch. Space used for performing multi-qubit Pauli measure-ments via lattice surgery will be referred to as the routing space or routing region.
In Fig. 4, we provide an example of an X ⊗ X Pauli measurement performed between two d x = 5, d z = 7 logical patches. After preparing ancilla qubits (grey vertices) in the routing space in |+ and data qubits (yellow vertices) in |0 , X-type surface code stabilizers are measured. The products of stabilizers marked by white vertices gives the measurement outcome of the logical X ⊗ X operator. In what follows, white vertices whose product gives the result of a P 1 ⊗ P 2 ⊗ · · · ⊗ P k Pauli measurement will be referred to as parity vertices. We say that a logical timelike failure occurs if a set of errors result in the wrong parity measurement of P 1 ⊗ P 2 ⊗ · · · ⊗ P k . Further, we assume that the logical patches are measured for r rounds prior to being merged in round r + 1.
When measuring P 1 ⊗ P 2 ⊗ · · · ⊗ P k using lattice surgery, in addition to an odd number of measurement errors occurring in round r + 1, an odd number of data-qubit errors along the boundaries of the logical patches prior to the merge can also result in a logical timelike failure. An example is provided in Fig. 4 where a single data-qubit Z error along a boundary of the left logical patch prior to the merge gives the wrong parity of X ⊗ X. We also note that during the syndrome measurement round r + 1, an odd number data-qubit errors which anticommute with P 1 ⊗ P 2 ⊗ · · · ⊗ P k will (unless corrected) also result in a logical timelike failure.
The above examples show that in order to obtain the correct parity measurement of a P 1 ⊗ P 2 ⊗ · · · ⊗ P k operator in the presence of full circuit level noise, one must have a decoding scheme which, while constantly correcting errors on the logical patches, also corrects spacelike and timelike errors which can flip the parity of the measurement outcome.

A. The decoding algorithm
In order to correct logical timelike failures using a minimumweight-perfect-matching (MWPM) decoder [47], we must add timelike boundaries to the matching graphs of the surface code as shown in Fig. 5. In particular, we divide the measurement of an operator P 1 ⊗ P 2 ⊗ · · · ⊗ P k into three steps. In the first step, the logical patches are measured for r rounds. In round r + 1, the patches are merged by measuring the appropriate operators in the routing space (see for instance Fig. 4) and the parity of the measurement outcome is given by the product of all parity vertices. The merged patches are measured for d m rounds, and then in round d m + 1, the qubits in the routing space are measured in the appropriate basis to split the patches back to their original configuration. In round r, we add extra virtual vertices to the matching graph with vertical edges which are incident to such vertices and to the parity vertices in round r +1 (see the pink edges in Fig. 5b for the X ⊗ X measurement). We call such vertices past ancilla vertices and the pink edges incident to them past vertical edges. Similarly, in round r + d m + 1 (i.e. right after the split), we add virtual vertices to the matching graph with vertical edges which are incident to such vertices and to the parity vertices in round r + d m (see the purple edges in Fig. 5a for the X ⊗ X measurement). (a) Two-dimensional slices of the surface code matching graphs for syndrome measurement rounds r + dm (bottom) and r + dm + 1 (top). The graph in round r + dm + 1 includes the future ancilla vertex with future vertical edges (solid purple edges) connecting to the parity vertices (white vertices circled in red) of the matching graph in round r + dm. (b) Two-dimensional slices of the surface code matching graphs for syndrome measurement rounds r (bottom) and r + 1 (top). The graph in round r includes the past ancilla vertex with past vertical edges (solid pink edges) connecting to the parity vertices (white vertices circled in red) of the matching graph in round r + 1. Transition vertices are the purple boundary vertices of the graph prior to the merge (note that transition vertices appear in rounds 1 to r), in addition to the purple parity vertices in round r + 1.
We call such vertices future ancilla vertices and the purple edges incident to them future vertical edges. Importantly, the pink vertical edges that are incident to the past ancilla vertices and to the parity vertices in round r + 1 have zero weight, while the purple vertical edges incident to the parity vertices in round r + d m and the future ancilla vertices have non-zero weights. These weights are computed from all timelike failure processes which can result in measurement errors occurring in round r + d m . When performing MWPM over the full syndrome history, the parity of the measurement outcome of P 1 ⊗ P 2 ⊗ · · · ⊗ P k is flipped if there are an odd number of highlighted vertical edges incident to parity vertices in rounds r +1 and r +2. In such a setting, one would require a sequence of consecutive measurement errors that is greater than (d m −  6. A two-dimensional slice of the surface code matching graph in the timelike direction for syndrome measurement rounds performed during an X ⊗ X measurement protocol using lattice surgery. The vertical axis corresponds to the timelike direction. We show a subset of the vertices and edges of the matching graph for correcting Z errors for an X ⊗ X measurement using lattice surgery. Parity vertices are the white and red vertices in the ancilla patch region. Transition vertices are both boundary and parity vertices colored in purple. Pink edges are past vertical edges incident to the past ancilla vertex and parity vertices in round r + 1, whereas purple edges are future ancilla edges incident to the parity vertices in round r + dm and the future ancilla vertex. We also add a dashed-blue weightless edge connecting the past and future ancilla vertices. 1)/2 in order to cause a timelike logical failure. Note that since the measurement outcomes of the parity vertices in round r + 1 are random, such vertices are never highlighted. If a change in the measurement outcomes of a subset of the parity vertices are observed between rounds r + 1 and r + 2, then vertices corresponding to such parity vertices in round r + 2 would be highlighted. Furthermore, green horizontal edges incident to the parity vertices in round r + 1 are taken to have zero weight (or can be omitted).
As mentioned above, a lattice surgery decoder also needs to correct logical timelike failures arising from sets of data qubit errors along boundaries of the logical qubit patches prior to merging them (recall the example shown in Fig. 4). The decoder also needs to correct wrong parity measurements arising from data qubit errors in round r + 1 which anticommute with the Pauli operator being measured by lattice surgery. In constructing such a decoder, note that prior to merging the logical patches for the X ⊗ X measurement, a single Z error along the relevant boundaries would result in a highlighted edge (after implementing MWPM over the full syndrome history) incident to one of the purple boundary vertices shown in Fig. 5b. Note that such boundary vertices become parity vertices after merging the surface code patches. For the measurement of a general P 1 ⊗ P 2 ⊗ · · · ⊗ P k Pauli operator, we define transition vertices to be vertices in the set V (s) bm } are labels for boundary vertices of the graphs of split logical patches which become parity vertices in round r + 1. If s = r + 1, then V of the logical qubit patches and routing space used to merge the logical qubits (see for instance the parity vertices highlighted in purple in Fig. 6). Based on previous observations, after implementing MWPM over the full syndrome history of a multi-qubit Pauli measurement via lattice surgery, if an odd number of space-like highlighted edges are present in logical patches, and such edges are incident to transition vertices, the parity of the Pauli measurement needs to be flipped. An illustration of a two-dimensional slice of the matching graphs of Fig. 5 in the timelike direction (which contains a subset of the spacelike edges and vertices) is shown in Fig. 6. In particular, the figure illustrates transition vertices, the past and future ancilla vertices in addition to the past and future vertical edges.
Combining all the notions introduced in this section, the decoding algorithm for implementing a multi-qubit Pauli measurement via lattice surgery is described in Algorithm 1. Each highlighted edge in step 7) of Algorithm 1 encodes a particular data qubit correction. Writing such corrections as a binary row vector, where each column corresponds to a data qubit, we add all corrections arising from each highlighted edge using modulo-two arithmetic. Further, note that space-time correlated edges incident to parity vertices in round r + 1 need to be treated with care in order to correct errors up to the full code distance. In particular, a subset of the space-time correlated edges incident to transition vertices can also contribute to v 2 (defined in Algorithm 1). A more careful treatment of such edges is provided in Appendix B.

B. Decoder simplifications
We point out a simplification that can be made in the implementation of Algorithm 1. Notice that vertices in V (r+1) par are never highlighted during MWPM due to the random outcomes of stabilizers in the routing space marked by white vertices in round r + 1. As such, one could remove all vertices in V (r+1) par and instead have the past vertical edges incident to the vertices in V (r+2) par . In such a setting, edges incident to V (r+1) par and V (r+2) par would be removed, and their weights would be assigned to the past vertical edges. Lastly, we remark that all boundary vertices, including the past and future ancilla vertices, can be merged into a single boundary vertex. In such a setting, each edge of the matching graph G r encodes both a space-like and timelike correction. The timelike component for the edge e j is obtained by observing if the failure mechanism resulting in the highlighted edge e j flips the parity of the multi-qubit Pauli measurement. We chose to describe the decoding protocol using Algorithm 1 to avoid figures with multiple edges all incident to the same boundary vertex.

C. Noise model and simulation methodology
Using the decoding algorithm given in Algorithm 1, we performed a full circuit-level noise simulation of various code distances and syndrome measurement rounds to estimate the parameters in Eq. (1) for a X ⊗ X measurement. We chose the following biased circuit-level noise model: 1. Each single-qubit gate location is followed by a Pauli Z error with probability p 3 and Pauli X and Y errors each with probability p 3η .
2. Each two-qubit gate is followed by a {Z ⊗ I, I ⊗ Z, Z ⊗ Z} error with probability p/15 each, and a {X ⊗ I, Algorithm 1: Decoding algorithm for measuring P 1 ⊗ P 2 ⊗ · · · ⊗ P k via lattice surgery. Result: Data qubit and parity measurement corrections. initialize: v 1 = v 2 = 0. Let G r be the graph for the surface code patches before, during and after the merge.; Measurement: Measure the stabilizers of the split logical patches for r rounds. Merge the logical patches in round r + 1 via lattice surgery to perform the P 1 ⊗ P 2 ⊗ · · · ⊗ P k measurement and let s par be the parity of the measurement outcome. Repeat the stabilizer measurements of the merged patch for d m − 1 rounds.; 1) Add the past ancilla vertex v past to G r for the round r (round before the merge). Let V par } be the set of parity vertices for the syndrome measurement round r + 1. Add weightless past vertical edges to G r which are incident to v past and all vertices v ∈ V (r+1) par . Add weightless edges to G r which are between v past and virtual boundary edges of all surface code patches; 2) Add the future ancilla vertex v future to G r for the round r + d m + 1 (round after the merge). Let V par } be the set of parity vertices for the syndrome measurement round r + d m . Add future vertical edges (of non-zero weight) to G r which are incident to v future and all vertices v ∈ V (r+dm) par ; 3) Add a weightless edge to G r which is incident to v past and v future ; 4) Set all edges incident to any two vertices v i , v j ∈ V (r+1) par to have zero weight; 5) Given the full syndrome measurement history, if the total number of highlighted vertices (obtained by taking the difference between any two consecutive syndrome measurement rounds modulo 2) is odd, highlight v future ; 6) Implement MWPM on G r . Set v 1 to be the number of highlighted edges incident to vertices in V (r+1) par and V (r+2) par , and v 2 to be the number of highlighted edges incident to transition vertices in the data-qubit patch regions. If v 1 + v 2 is odd, set s par → s par + 1 (mod 2); 7) Apply data qubit corrections based on all highlighted two-dimensional and space-time correlated edges.

With probability 2p
3η , a single-qubit Z basis measurement outcome is flipped. With probability 2p 3 , a single-qubit X-basis measurement outcome is flipped.
FIG. 8. Spacetime diagram of a lattice surgery protocol illustrating different types of failure mechanisms represented by the vector b = (bZL, bTL, bZR, bX ). For each element bj in the vector b, we associated a logical sheet (shown in yellow or green when triggered) and we have bj = 1 if and only if the relevant error type crosses the logical sheet an odd number of times. Z-strings trigger yellow logical sheets and terminate on pink boundaries. X-strings trigger green logical sheets and terminate on the foreground and background boundaries (these are transparent for visual clarity). For instance, a (1, 0, 0, 0) event (top right) occurs in the presence of a logical Z string-like excitation on the left surface code patch. Since the excitation does not cross the yellow shaped logical sheet, the correct outcome of the multi-qubit Pauli measurement is recorded. A (0, 1, 0, 0) event (middle right) is a pure timelike failure, where the incorrect multiqubit Pauli measurement outcome is recorded without introducing additional logical Z failures to the two logical patches. This holds because the Z string only crosses the yellow shaped logical sheet. A (0, 1, 1, 0) event (bottom left) occurs when a Z string-like excitation crosses the rightmost yellow logical sheet in addition to the yellow shaped logical sheet. Such an error results in both a logical Z error on the right logical patch in addition to a timelike lattice surgery failure. Lastly, we illustrate an X string-like excitation crossing the green logical sheet resulting in a logical X failure on both logical patches.

Lastly, each idle gate location is followed by a Pauli
Z with probability p 3 , and a {X, Y } error each with probability p 3η .
In our simulations, we chose η = 100 and for simplicity added a single idle location on the data qubits during the measurement and re-initialization of ancilla qubits. Note that in the limit η → 1, the above noise model reduces to the depolarizing noise model used in Refs. [48,49] with the exception that two idle locations were included during measurement and re-initialization of the ancillas. Furthermore, the above noise model assumes that the duration of a CNOT gate is identical to the duration of an ancilla measurement and re-initialization. For a biased circuit-level noise model (such as the one described above) and an X ⊗ X Pauli measurement implemented via lattice surgery, there are fifteen different types of failure mechanisms that can arise during the protocol. We label such failure mechanisms using the binary string The bit b ZL = 1 corresponds to a logical Z error on the left logical patch, whereas b ZR = 1 corresponds to a logical Z error on the right logical patch. The bit b TL = 1 indicates a logical timelike failure. Finally, the bit b X indicates if a logical X error occurred during the lattice surgery protocol. Examples of such failure mechanisms using two-dimensional slices of the matching graph are shown in Fig. 7. For instance, in Fig. 7a, a series of consecutive measurement errors occur on the same parity vertex, starting in round r + 1. Since a single measurement error occurs in round r + 1, the wrong parity of X ⊗ X is measured. Due to the series of measurement errors, a single parity vertex near the top boundary is highlighted in G r . The shortest path correction (highlighted in yellow) matches the highlighted parity vertex to the future ancilla vertex. Hence no parity corrections are applied and a logical (0, 1, 0, 0) error occurs.
Another example is shown in Fig. 7b, where a string of Z data-qubit errors results in the highlighted vertices shown in the figure. Prior to the merge, there are no Z errors at the boundary between the logical patches and routing space. Therefore, the correct parity of X ⊗ X is measured. However, the minimumweight path (highlighted in yellow) connecting the highlighted vertex to the future ancilla vertex goes through a transition vertex, so that v 2 = 1. The correction thus results in a logical Z error on the left logical patch, in addition to a logical parity measurement failure (since the decoder incorrectly flips the parity) leaving the code with a logical (1, 1, 0, 0) error. Additional examples of failure mechanisms using space-time diagrams instead of matching graphs are provided in Fig. 8.
We note that if stabilizer measurements are terminated after r + d m rounds, higher order failure mechanisms are required to produce logical failures corresponding to (1, 0, 0, 0), (0, 0, 1, 0), (1, 0, 1, 0) and (1, 1, 1, 0) bit strings. Failures corresponding to such strings are thus much less likely compared to (1, 1, 0, 0), (0, 1, 0, 0) and (0, 1, 1, 0) [50]. For instance, to obtain a logical failure of the type (1, 0, 0, 0), a logical error on the left logical patch would need to occur without flipping the parity of the X ⊗ X measurement. As such, in addition to a logical Z error occurring before the merge, a second failure mechanism would need to occur to undo the wrong parity flip (such as a string of measurement errors like the one shown in Fig. 7a). Given these observations, we only present the logical failure rate polynomials P (0,1,0,0) , P (1,1,0,0) and P (0,1,1,0) . Note that due to the high noise bias, we choose a d x distance such that µP (0,0,0,1) ≤ δ, where µ is the number of lattice surgery operations in the algorithm.
Our simulations are performed for syndrome measurement rounds 1 to r + d m , based on the biased circuit-level noise model described above. The last round is a round of perfect error correction to guarantee projection to the code space. We used Algorithm 1 to correct both spacelike and timelike errors, where each edge in step 7) of the algorithm encodes a particular correction on a subset of the data qubits. Note that we do not perform a round of perfect error correction between rounds r and r + 1 as was done in Ref. [44]. Instead, we performed MWPM using the full syndrome measurement history from rounds 1 to r + d m , and use Algorithm 1 to determine which corrections are applied.

D. Simulation results and conclusions
Here, we report the outcome of our lattice surgery simulations as summarised by Eqs. (3) to (5) and Fig. 9. For each of the dominant failure mechanisms in (b ZL , b TL , b ZR , b X ), we fit all our data to an ansatz with two free parameters to generate the failure rate polynomials given by In Eq. (3), corresponds to the width of the routing space between the two logical patches. All logical error rate polynomials in Eqs. (4) to (6) provide error rates per syndrome measurement round. Per round error rates are computed by varying the number of syndrome measurement rounds for fixed d x and d z distances, repeating such procedures for different d x and d z distances, and fitting all the obtained data to an ansatz. Note that the d z distance in Eq. (6) is taken to be the d z distance of the full merged surface code patches, whereas d z in Eqs. (4) and (5) is the d z distance of the individual logical patches (since logical Z errors are much less likely to occur when surface code patches are merged due to the increased d z distance). In Fig. 9, we compare the best-fit polynomial P (0,1,0,0) with a representative subset of our data obtained from our Monte Carlo simulations for various values of d m , where the chosen parameters are described in the caption. The plot shows the exponential suppression in purely timelike error probabilities as a function of d m , and that the data is in good agreement with our best-fit polynomials. In Section VI, after introducing our protocol for minimizing routing costs, we use the logical failure rate polynomials in Eqs. (3) to (6) to estimate the overhead costs for implementing quantum algorithms.
We conclude this section by pointing out that the above labels used in the logical error rate polynomials, which represent different failure mechanisms that can occur during lattice surgery, can be generalized using k + 2 bits for an arbitrary P 1 ⊗ P 2 ⊗ · · · ⊗ P k Pauli measurement. Out of the k + 2 bits, k bits are used to represent a logical Pauli error on the logical patches which can also flip the parity of the measured Pauli (such as logical Z errors for X ⊗ X). Another bit represents a logical parity measurement failure. The last bit encodes the logical Pauli error that affects all logical qubits in the merged patch (such as a logical X error during an X ⊗ X measurement).

IV. PROTOCOL FOR TWIST-FREE LATTICE SURGERY
Lattice surgery provides a fault-tolerant way to measure Pauli operators and is well suited for topological codes. However, not all Pauli operators are equally easy to measure. We say an operator is a XZ-Pauli when it is a tensor product of {1l, X, Z}. For XZ Pauli operators, standard lattice surgery suffices and a surface code architecture would need only weight 4 stabilizer measurements. However, for some topological codes such the surface code, measuring Pauli operators containing any Y terms is more difficult. It has been shown that this can be achieved by introducing a twist defect for each Y in the Pauli operator [11,12,51]. For examples of twist defect lattice surgery, see Fig. 3 of this work and Fig. 40(d) of Ref. [12]. However, each surface code twist defect requires a stabilizer measurement on 5 physical qubits. This can be very challenging to implement in a 2D architecture with limited connectivity and could require multiple ancilla qubits. The additional ancilla qubits and gates will thus increase the total measurement failure probabilities for weight-five checks. Furthermore, even a single isolated weight-five check will have an impact on the gate scheduling over the whole surface code patch which can introduce additional types of correlated errors. Lastly, twist-based surface codes coupled with a MWPM decoder have been shown to have a reduced effective code distance [52]. As such, we expect twist-based Pauli measurements will suffer a performance loss relative to twist-free Pauli measurements. Any increases in measurement error probabilities during twist-based lattice surgery can be suppressed by extending d m . In other words, we expect use of twists to increase the runtime of lattice surgery computations. The exact magnitude of this runtime cost is currently unknown and will depend on the precise twist implementation details and the noise model [53].
Here we outline an alternative twist-free approach to measuring operators containing Pauli Y terms. The additional cost of supporting Pauli Y terms relative to measuring XZ-Pauli operators is roughly a 2× slowdown in algorithm runtime and a +2 additive cost in the number of logical qubits (though we show later that one of these logical qubits can be borrowed from space allocated to routing). Whether this 2× slowdown is preferable to the slowdown incurred by using twists is an open question, due to a lack of data on twist performance.
To explain our protocol, we make use of the notation for any binary vectors u = (u 1 , u 2 , . . . u N ) and v = (v 1 , v 2 , . . . v N ). It is well-known that any Hermitian Pauli operator can (up to a ±1 phase) be decomposed as Then u · v = j u j v j counts the number of locations where X[u] and Z[v] have overlapping support. Therefore, using . To obfuscate this unwanted information, we perform the protocol as illustrated in Fig. 10  6. If q = 1, then apply a Z[v] correction (to the Pauli frame).
In step 4, we use a constant c that we define as follows  . Example of a 2D layout for a quantum computer and the implementation of twist-free lattice surgery operations to realise a P = Y ⊗ Y ⊗ X ⊗ Z Pauli measurement, for which the circuit diagram is given in Fig. 10. In (a), we show an initial layout of thin, rectangular surface codes (dx = 3 and dz = 7) with labels showing the Pauli operators to be measured. Note the pink rectangle where the hardware layout is slightly adapted to enable elongated stabilizer measurements just within this region. The space between surface code patches is referred to as routing space and will be used in the subsequent steps. In (b), we show the preparation of a logical |0 state in the routing space. In (c), we show a lattice surgery measurement of X ⊗ X ⊗ X ⊗ 1l ⊗ X. In (d), we show a lattice surgery measurement of Z ⊗ Z ⊗ 1l ⊗ Z ⊗ X where at every Z logical boundary we use a domain wall and at the single X logical boundary use elongated stabilizers within the pink region. In both steps (c) and (d), the parity of the logical Pauli measurement is determined by the product of the stabilizers marked with a white vertex, with corrections applied by the decoder.
(|0 + i|1 )/ √ 2. Then, by measuring Y ⊗ P we effectively measure P . Furthermore, if P contains an odd number of Y terms, then Y ⊗ P contains an even number and can be measured using the above construction. This modified variant of the twist-free lattice surgery measurement is also illustrated in Fig. 10. Note that the Y ⊗P measurement does not affect the |Y state and so it can be reused many times and its preparation cost (e.g. through state distillation [7]) only needs to be paid once per algorithm and is therefore negligible.
Our twist-free approach uses up to two logical ancillas, a |0 ancilla that is repeatedly reset and sometimes a |Y ancilla that is reused. Therefore, we have an additive +2 logical qubit cost. The runtime cost is dominated by steps 2 and 3. All other steps use only single-qubit operations that effectively take zero time in lattice surgery. Therefore, the runtime has doubled compared to the runtime of measuring a Pauli operator free from Y terms. If steps 2 and 3 each fail with probability P (e.g as in Eq. (1)), then whole protocol fails with probability P = 2P(1 − P) ≈ 2P. This is a minor effect since failure probabilities are exponentially suppressed by increasing the runtime d m and using large enough d x and d z code distances.
In Fig. 11, we show an example of how the circuit picture of Pauli measurements in Fig. 10 can be explicitly mapped into a 2D lattice surgery protocol consisting only of XZ-Pauli operator measurements. We present our protocol for thin rectangular surface codes, though our protocol would also work for square surface code patches. First, we notice how the temporary |0 ancilla is prepared in the spare routing space provided for performing lattice surgery, so that it does not actually contribute to the space overhead. Notice also that to accomplish the Z[v] ⊗ X A measurement in panel 4, there is one region (highlighted in pink) where we measure elongated stabilizers. Here we have assumed that the hardware is permanently deformed in this region. In other words, the circuit is hardwired at this location so that the elongated stabilizer can be measured with minimal performance loss compared to any other weight-four stabilizers (for instance, by using longer resonators) and thus does not require additional ancilla qubits. Alternatively, these elongated stabilizers could also be measured in a homogeneous hardware layout but with a modified procedure for performing the measurement. For instance, one could use two ancilla qubits prepared in a GHZ state to measure the elongated stabilizers. However, since the result of the stabilizer measurement would be given by the product of the measurement outcomes of both ancillas, and due to the extra fault locations, using GHZ states would increase the total measurement failure probability of the elongated checks. Another possibility would be to use the second ancilla qubit as a flag qubit [20,21,48,49,[54][55][56][57][58][59][60][61]. However, by doing so, one might require an additional time CNOT step per round of stabilizer measurements to perform all two-qubit gates for the stabilizer measurements while avoiding scheduling conflicts.

V. TEMPORAL ENCODING FOR FAST LATTICE SURGERY
In Sections II A and II B, we discussed the standard approach to Pauli-based computation using lattice surgery and related algorithm runtime bottlenecks. In this section, we show how to exceed this bottleneck and run algorithms at faster clockspeeds using temporal encoding of lattice surgery (TELS). The key idea behind TELS is to use fast, noisy lattice surgery operations, with this noise corrected by encoding the sequence of Pauli measurements within a classical error correcting code. This encoding can be thought of as taking place in the time domain, so the encoding does not directly lead to additional qubit overhead costs. Though, there can be a small additive qubit cost when TELS is used for magic state injection, with magic states needing to be stored for slightly longer times.

A. Parallelizable Pauli measurements
Here we review parallelization where the sequence of Pauli measurements can be grouped into sets of parallelizable Pauli measurements. Let P [t,t+k] := {P t , P t+1 , . . . , P t+k } be a sub-sequence of our Pauli operators. We say P [t,t+k] is a parallelizable set if: they all commute; any Clifford corrections can be commuted to the end of the sub-sequence. For example, we get a parallelizable set whenever we use magic states to perform a T ⊗k gate. We show in Fig. 1, several ways to implement T ⊗2 with the PBC approach requiring two parallelizable measurements {P 1 , P 2 } = {CZ 1 Z 3 C † , CZ 2 Z 4 C † }. Therefore, given a circuit with µ T -gates and T -depth γ, the Pauli measurement sequence can always be split into a sequence of γ parallelizable sets of average size k := µ/γ.
Fowler introduced the notion of time-optimal quantum computation [62] and Litinski (see Sec 5.1 of Ref. [12]) showed how this can be realised using lattice surgery in a two-dimensional layout. In time-optimal PBC, an n-qubit computation of T -depth γ can be reduced to runtime O(n + γ). However, the space-time volume is never compressed by using the time-optimal approach, so that reducing the algorithm runtime to 10% of a seqPBC runtime would require at least a 10× increase in qubit cost. Litinski worked through some highly-paralleziable examples in greater detail, showing: a reduction to 56.5% of seqPBC runtime would need over 6× the qubit costs; and a reduction to 11% of seqPBC runtime would need over 20× the qubit cost. The qualifier "over" in these estimates reflects that an increase in space-time volume will also increase the code distance needed, further increasing the overhead of the time optimal approach by a polylogarthmic factor on top of Litiniski's estimates. While this is a powerful approach to exploring space-time tradeoffs, early fault-tolerant quantum computers will be qubit limited. Kim et al. [63] also proposed another way to exploit large parallelizable sets, but they used long-range gates not possible in 2D hardware and also made some strong assumptions regarding the speed at which transversal gates can be fault-tolerantly applied.
From the above, we see that it is crucial to understand the extent to which algorithms possess potential for paralleliza- 12. A simple TELS protocol: it is an error detecting version of the PBC used in Fig. 1 using the measurement code given in Eq. (14). While this approach uses 3 multi-qubit Pauli measurements, the capability to detect an error means that lattice surgery can be executed over a shorter time dm. If there is no error, we will have that the measurement outcomes obey m1 ⊕ m2 = m3. When an error is detected we simply remeasure the Pauli operators.
tion. Fortunately Kim et al. [63] have already studied quantum algorithms for chemistry and found k to vary between 9 and 14 depending on the orbitals used, so we will regard this as a practically reasonable range.

B. Encodings and code parameter proofs
Here we introduce our own approach to exploiting parallelizable Pauli sets. Unlike, previous time-optimal approaches, it does not incur a multiplicative qubit overhead cost and can reduce the overall space-time cost.
Due to the properties of a parallelizable Pauli set, all Pauli operators within the set can be measured in any order. Furthermore, we can measure any set S that generates the group P t , P t+1 , . . . , P t+k . If the set S is overcomplete, there will be some linear dependencies between the measurements that can be used to detect (and correct) for any errors in the lattice surgery measurements. For example, consider the simplest parallelizable set {P 1 , P 2 } as in Fig. 1 and let d m be the required lattice surgery time, so performing both measurements takes 2(d m + 1) error correction cycles. We could instead measure {P 1 , P 2 , P 1 P 2 }. If the 3rd measurement outcome is not equal to the product of the first two measurements, then we know something has gone wrong and can repeat the measurements to gain more certainty of the true values. By measuring the overcomplete set {P 1 , P 2 , P 1 P 2 } we have performed an extra lattice surgery measurement, but we have gained that we can tolerate a single lattice surgery failure. This means that we could instead use d m d m and still achieve the same overall success probability. If 3d m 2d m then the computation has been sped-up. This is the key insight behind temporal encoding of lattice surgery (TELS) and next we dive deeper into more general encoding schemes and their performance.
In general, given a parallelizable Pauli set P = {P t , P t+1 , . . . , P t+k−1 } we can define operators generated from this set as follows where x is a length k binary column vector. Given a set that generates all the required Pauli operators, so S = P we can write the elements as with superscripts denoting different vectors. Since this is a generating set, the vectors {x 1 , x 2 , . . . , x n } must span the relevant space. Furthermore, we can define a matrix G with these vectors as columns and this matrix will specify the TELS protocol. In the simple k = 2 example where S = {P 1 , P 2 , P 1 P 2 }, we would have that Notice that the rows of this matrix generate the codewords of the [3, 2, 2] classical code. In general, we will consider G as the generator matrix for the codewords of an [n, k, d] classical code and we call this the measurement code for the protocol. Note that k is the number of (unencoded) Pauli operators in the generating set. We only consider full-rank G where k equals the number of rows in G. The number n represents how many Pauli measurements we physically perform in the encoded scheme and corresponds to the number of columns in G. The distance d is the lowest weight vector in the row-span of G.
Next we show that the code distance d does indeed quantify the ability of TELS to correct errors. First, we formalise the redundancy in the set of lattice-surgery measurements. For any length n binary vector u = (u 1 , u 2 , . . . , u n ), we have that Since the matrix G is full-rank and has more columns than rows, there will exist u such that j u j x j = 0. For these u, we have that Therefore, these u vectors describe redundancy in the measurements. The condition j u j x j = 0 can be rewritten compactly as Gu = 0. Following the convention in coding theory, this set of u is called the dual of G and denoted Next, we consider how this redundancy is used to detect timelike lattice surgery errors. We let m = {m 1 , m 2 , . . . m n } be a binary vector denoting the outcomes of the lattice surgery Pauli measurements in the set S. That is, if a measurement of Q[x j ] gives outcome "+1" we set m j = 0 and when the measurement of Q[x j ] gives "-1" we set m j = 1. Given a u ∈ G ⊥ , we know the Pauli operators product to the identity (recall Eq. (16)) so when there are no time-like lattice surgery errors we have Conversely, if we observe then we know a time-like lattice surgery error must have occurred. Let us write m = s + e where s is the ideal measurement outcome and e is the measurement error. The ideal measurement outcomes are self-consistent and so always satisfy u · s = 0 for all u ∈ G ⊥ . Therefore, we see that an error e is undetected if and only if u · e = 0 for some u ∈ G ⊥ . This is equivalent to undetected errors e being in the row-span of G (since the dual of the dual is always the original space). Recall, the distance d denotes the lowest (non-zero) weight vector in the row-span of G. Therefore, d also denotes the smallest number of time-like lattice surgery errors needed for them to be undetected by TELS. Consequently, if P is the probability of a single timelike error, TELS error-detection will fail with probability O(P d ).
Matrices such as G also appear in the literature in the context of measuring overcomplete sets of stabilizers for some quantum error correction code. In the error correction setting, these codes have been called measurement codes [64], metachecks [65,66], symmetries [67] and syndrome measurement codes [68]. However, to the best of our knowledge, this is the first time that this idea has been deployed in the context of lattice surgery as a strategy to improve algorithm runtimes.

C. Examples and numerics
The simplest examples of TELS will detect a single error. Given a Pauli set {P t , P t+1 , . . . , P t+k } we measure each of these observables separately, and then the product of them so that the measurement code has generator matrix which is an identity matrix padded with an extra column that is an all 1 vector. Therefore, this corresponds to a [α + 1, α, 2] classical code that detects a single error. Concatenating such a code m times gives a code with parameters [(α+1) m , α m , 2 m ].
We can also consider using a simple [8,4,4]  This corresponds to replacing {P 1 , P 2 , P 3 , P 4 } with S containing the 8 operators S = {P 2 P 3 P 4 , P 2 P 3 , P 2 P 4 , P 2 , P 1 P 3 P 4 , P 1 P 3 , P 1 P 4 , P 1 } Because the generator matrix has distance 4, this scheme will detect up to 3 errors. This Hamming code is the m = 3 member of a family of [2 m , 2 m − m − 1, 4] extended Hamming codes.
There are several viable strategies to handle a detected error. Here we consider the following detect/remeasure strategy: if a distance d measurement code is used with lattice surgery performed for time d m , then whenever an error is detected we "remeasure" but this time using the original Pauli set P instead of using the overcomplete set S. For the remeasure round, we perform lattice surgery using an amount of time qd m where q is some constant scaling factor whose value we discuss shortly. The expected runtime to execute the protocol is then where p d is the probability of detecting an error. When we do not detect an error, the probability of a failure is O(p d( dm +1 2 ) ) ≈ O(p ddm/2 ). When we do detect an error, the remeasure round will fail with probability O(p (qdm+1)/2 ) ≈ O(p qdm/2 ). The total failure probability will then be O(p ddm/2 + p d p qdm/2 ). Therefore, we can ensure the total failure probability is O(p ddm/2 ) by setting q = d. However, due to constant factors the optimal choice of q may be slightly different from q = d and so we numerically optimize from this initial guess. When an error detection occurs, this leads to a long delay of time kqd m to implement the remeasure round, but in practice p d is so small that this has minimal impact on the expected runtime.
We could alternatively just measure the overcomplete set S and run the measurement code in error correction mode with lattice surgery repeated for time d m . Then the protocol would fail with probability ≈ O(p dd m /4 ). Compared to detect/remeasure strategy, we need d m ≈ 2d m to achieve the same failure probability. The runtime of an error correction scheme will then be T = n(d m + 1) ≈ n(2d m + 1).
Compared to Eq. (23), in T we have dropped the second term at the price of roughly doubling the first term. However, the second term was small because p d is small, so overall error correction is not favourable compared to our detect/remeasure scheme. Fig. 13 shows some example numerical results using distance 2 and 4 codes. For example, when performing k = 11 are concatenated single error detection codes. While there are big jumps in k between the best performing codes, these jumps could be partially smoothed out by considering other codes such as concatenated codes with different inner/outer code sizes, such as the [(α + 1)(β + 1), αβ, 4] codes. We also considered the triply concatenated codes with parameters [(α + 1) 3 , α 3 , 8] but they performed poorly in the parameter regime shown here and so have been omitted for clarity. parallelizable gates the TELS scheme using extended Hamming codes will have a runtime of 46% that of a standard se-qPBC approach that measures the original parallelizable Pauli set P. Since Kim et al. [63] found that interesting quantum algorithms can have average k between 9 and 14, this suggests around a 2.2× speedup due to TELS on practical problems. To obtain a similar speedup, Litinski estimated the time-optimal approach would cost over 6× in qubit overhead. The TELS scheme has no multiplicative qubit overhead, though it does have a small additive qubit overhead as all k magic states must be stored for the full duration of the protocol. However, faulttolerant algorithms typically have N 100 logical qubits, and so the increase in logical qubits N → N + k is small. Indeed, overall the spacetime volume will decrease, which is impossible using Fowler's time optimal approach. We did not find any examples of higher distance codes (d > 4) that performed better in this parameter regime (e.g. δ = 10 −15 ). Going to even lower error rates (δ 10 −15 ) or changing the noise model, then higher distance codes will become useful and the advantage will improve further. Indeed, because of the existence of good classical codes with n/k = O(1) and d = Ω(k), we know that TELS will asymptotically (for large k) be able to execute k parallelizable Pauli measurements in O(k) time and with error δ = O(p αk ) for some constant α. In contrast, a standard seqPBC with unencoded lattice surgery would take runtime O(kpolylog(k)) to achieve the same error.

VI. THE CORE-CACHE ARCHITECTURE AND ROUTING OVERHEADS
In this section, we discuss a layout and data access structure for a quantum computer that extends on the layout given in panel 1 of Fig. 11. We will consider patches of (possibly rectangular) surface codes of size d z by d x . Between these patches we will have some qubits dedicated as a lattice surgery "bus" or routing space. We say the routing space supports fast access if logical X and Z operators of every patch are adjacent to the routing space. Litinski proposed several data access structures [12], with his fast data structures using twotile two-qubit patches (surface code patches that each encode two logical qubits) that are sometimes called hexon surface codes.
In Section VI A , we show that the hexon approach is not necessary, and give a layout for a quantum core (what Litiniski calls a fast data access structure). Furthermore, we show that a lower routing overhead is possible when the surface code patches are thin rectangles (e.g. d z d x ) as is the optimal choice for highly biased noise. In Section VI B, we will also discuss the idea of a core-cache model where logical qubits are temporally moved out of the fast data access structure to reduce the routing overhead.

A. A quantum computer core
We count resource costs in terms of square tiles as defined in Fig. 14a. Each tile contains a single data qubit and four quarter ancilla qubits. Therefore, a device with T tiles will require roughly 2T qubits. However, we cannot cut qubits into quarters and so a precise counting will include these. For instance, a rectangular device with a height of h tiles and width of w tiles would have a total of T = wh tiles and 2T + w + h + 1 qubits. When the device is roughly square, then h and w are of size O( √ T ) and so a negligible additive cost compared to 2T . We can realise a surface code patch using d x d z tiles as in Fig. 14b and therefore 2d x d z qubits. The number of data and ancilla qubits actively used in the surface code patch is 2d x d z − 1, and so when we try to pack them in a 2D arrangement, the tightest possible packing will contain one idling qubit per patch.
We collect surface code patches into groups of four, which we call a unit cell (see Fig. 14c). These unit cells are then repeated as shown in Fig. 14d to get the required number of logical qubits. Furthermore, we arrange the unit cells to form a quantum "core", and assume some additional padding shown in Fig. 14e. Notice that in Fig. 14, every patch has logical X and Z boundary operators connected to the routing space, which enables us to quickly perform multi-qubit Pauli measurements between any subset of qubits within the core. Additionally, there are unused qubits between some of the surface code patches. The spacing of the qubits ensures that lattice surgery can be performed (as we saw in Fig. 11) without using lattice twists that incur additional practical difficulties to implement in fixed and low connectivity hardware. In contrast, the data access structures proposed by Litinski [12] assumed liberal use of twists.
The routing overhead for unit cells is then the ratio of the number of tiles divided by the cost without any routing space (e.g. 4d z d x ).
The overhead for the entire core will include a contribution from the additional padding shown in Fig. 14e. However, in the limit of many unit cells, the total overhead is dominated by the unit cell overhead. In the limit of large distances d z , d x 1, we have Therefore, in the limit of large noise bias, d z d x , the routing overhead factor is 1.5×. We can compare this with the fast data blocks of Litinski (see e.g. Figure 23 of [12] ) where the overhead factor is 2× (in the large device limit) and so is more expensive. In contrast, for unbiased noise and d z = d x our scheme has an asymptotic routing overhead factor of 9/4, which is slightly worse than Litinski's 2× routing overhead. Indeed, solving Eq. (26) equal to 2 shows that d x < (2/3)d z is the condition for our approach to have a routing overhead advantage. A more general analysis of the routing overhead which includes the green padded regions shown in Fig. 14e and contributions from the cache (see Section VI B) is given in Appendix C. We also point out that routing overhead is not the only important figure of merit. Our proposed design avoids twist defects and other significant lattice irregularities, and therefore may be useful even without noise bias.

B. A quantum computer cache
We now proceed to build additional structure around the core. Using state distillation to prepare magic states, we will need factories that supply the core. The purely fast data access approach prioritises speed over qubit cost. Here we also discuss the idea of a core and cache architecture, where some logical qubits are temporarily stored in a quantum analog of cache. However, with some time cost, logical qubits can be quickly swapped in and out of the cache. A small-scale sketch of a device comprising core, cache and magic state factories is illustrated in Fig. 15.
Packing qubits more compactly in the cache will clearly reduce the overhead costs. However, such a layout comes at a price since only the X logical operators of these qubits can be accessed when it is in the cache. To access the logical Z operators, it must be swapped out of the cache and into the core. For a qubit stored in the cache, we can perform the following operations 1. Perform single qubit X or Z measurements for a qubit in the cache (time cost: zero).
2. Measure multi-qubit operators of the form A ⊗ B where B acts on the cache qubits and is a tensor product of X operators only, and A acts on the core qubits and can be an arbitary Pauli operator. (time cost: d m ).
3. Perform Pauli updates to the Pauli frame (in software).
4. Perform Cliffords to qubits in the core, by updating the Clifford frame (in software).
For algorithms where swapping in/out of the cache can be made infrequently (compared to other time costs), our approach will reduce the routing overhead with a mild impact on the algorithm runtime. Note also that our core-cache architecture can be used in combination with the twist-free or TELS schemes already proposed. This leaves the question of how to swap the location of qubits from the cache to the core. We cannot directly implement the Clifford SWAP operation, since Clifford operations can only be performed on core qubits. Furthermore, surface code patches cannot be moved around to swap their positions since such operations would require the Clifford frame C to be relabelled. Such a relabelling might make C act non-trivially on qubits in the cache. Rather, when performing a SWAP from a qubit in 17. Example lattice surgery diagrams for: using write to cache and read from cache to exchange qubits j and k between core and cache. (a) Illustration of the initial configuration with a |0 ancilla (index label i) in the cache. Qubit j is in the core and qubit k is in the cache and we wish to swap their locations. (b) We measure P ⊗ Xi where P = CXjC † where C is the Clifford frame. For simplicity, we assume P is composed of only X operators. (c) We measure Q = CZjC † and this time assume it is composed of only Z operators. (d) We measure P ⊗ X k where P = CXjC † is the same operator as in (b). (e) We measure the single qubit Pauli Z k . If either of the simplifying assumptions we made for P or Q do not hold, then we will need to use the twist-free protocol (or use twists). (f) At the end of the protocol, qubit j is in the cache, qubit k is in the core, and there is a |0 ancilla in the cache ready for further swaps. the core to the cache, we need to clean the Clifford frame so it only acts on core qubits.
We now define two elementary operations: write to cache (WTC) and read from cache (RFC), which when combined will enable a Clifford-cleaned swap. We first present the WTC and RFC protocols using circuit diagrams in Fig. 16. They are both essentially 2-qubit teleportation protocols, but with the Clifford frame adapting the Pauli measurements performed. The WTC operation uses two multi-qubit Pauli measurements, whereas RFC can be performed faster as it uses only one multi-qubit Pauli measurement and a single-qubit Pauli measurement (that takes zero time). The WTC operation requires a logical |0 ancilla in the cache, which through the protocol swaps place with a logical qubit in the core. The RFC operation requires a logical |0 ancilla in the core, which through the protocol swaps place with a logical qubit in the cache.
For a pair of logical qubits, one in the core and one in the cache, we can cleanly SWAP their positions, by executing WTC followed by RFC. The |0 ancilla initially in the cache, will move to the core, then back to the cache, but to a different cache location than it started. The whole swap procedure requires three multi-qubit Pauli measurements. Fig. 17 shows an example using lattice surgery. This figure shows a simple scenario where these three multi-qubit Pauli measurements are XZ Pauli measurements (even after conjugated by the Clifford frame) and so can be realised with three simple lattice surgery operations. However, more generally when some measurements are not of XZ-type, we need to either: use twist defects (and benchmark their performance); or use our twistfree protocol and realise the swap with up to six lattice surgery operations.
In Appendix C, we perform a resource cost analysis for simu-lating the Hubbard model using the full core-cache architecture described in this section. In particular, we provide a rigorous analysis of routing overhead costs including contributions from the cache and green padded region in Fig. 14e.

VII. CONCLUSION
In Section III we introduced a decoding algorithm compatible with lattice surgery protocols and numerically computed failure rate polynomials for the dominant failure mechanisms of an X ⊗ X Pauli measurement. Our analysis allows one to compute appropriate d x and d z code distances, as well as the number of syndrome measurement rounds d m during lattice surgery for successfully implementing algorithms.
In Section IV, we introduced a twist free protocol for measuring arbitrary Pauli operators using lattice surgery. The protocol incurs a 2× slowdown in algorithm runtime. However surface codes with twists require higher weight stabilizer measurements, increased gate scheduling complexities and have a reduced effective code distance when using a MWPM decoder. Such features will inevitably cause a reduction in performance compared to lattice surgery protocols involving only X and Z Pauli measurements. Consequently, a careful numerical analysis with twists is needed in order to determine whether using twists can beat the 2× cost of the twist-free approach.
In Section V we introduced a technique which we call temporal encoding of lattice surgery. By encoding lattice surgery measurements in the time domain, we showed that a 2× reduction in algorithm runtimes can be achieved for quantum algorithms of practical scale without incurring additional qubit overhead costs. For more highly parallelizable algorithms or larger algorithms, using larger classical code distances and exploring other code families will lead to even greater improvements in algorithm runtimes. After posting a pre-print of this work, Craig Gidney emailed us and later posted on Twitter, remarking that temporal encoding of lattice surgery could be used to reduce the runtime of magic state distillation factories.
Lastly, in Section VI, we provided a core-cache architecture compatible with our new lattice surgery protocols. A subset of the data qubits are stored in a cache, which reduces the footprint of the routing space, and can be quickly accessed when Z measurements need to be performed. We found that for such an architecture, routing overhead costs add a factor of 1.5× to the total resource costs for performing lattice surgery. A clear direction for future work would be to analyze such architectures in the presence of twists in order to better understand the tradeoffs with using a twist-free lattice surgery protocol.
Apart from considering twists, a direction of future work would be to apply our methods using other error correcting codes to potentially achieve lower resource costs. Promising code candidates include codes tailored for biased noise such as the XZZX surface codes [67,[69][70][71], subsystem codes with high thresholds [72] and other codes families with high encoding rates such as hyperbolic surface codes [73,74].

VIII. ACKNOWLEDGMENTS
We thank Giacomo Torlai for his help with using the AWS clusters. We thank Oscar Higgott, Fernando Brandao and Noah Shutty for their comments and suggestions on our manuscript.
Appendix A: Twist-free proof Here we give a formal proof that the twist-free lattice surgery protocol works as claimed. Consider the case when step 5 yields a q = 0 outcome so we project onto the |0 state. Then steps 1-5 implement where Using that for arbitrary Q and so m x ⊕ m z ⊕ 1 is the outcome of measuring P (justifying c = 1 in this case). Recall that we are currently assuming u · v is even.
In the event that step 5 yields a q = 1 outcome, we have that Using that for arbitrary Q We have Therefore, we see that M −1 differs from M +1 by a Z[v] correction that we perform in step 6. There is also a global phase (−1) mz but this is unimportant. We remark that for the case where u · v = 1 (mod 4), adding an additional Y operator to P and performing the measurement using the |Y ancilla is identical to the case where u · v = 2 (mod 4). Hence c will be 1. A similar argument can be used to show that c = 0 for the case where u · v = 3 (mod 4).
Appendix B: Treating space-time correlated edges incident to parity vertices As mentioned in Section III, the space-time correlated edges incident to the parity vertices during the first round of the merged surface code patches (i.e. vertices v ∈ V (r+1) par ) need to be treated with care. Such edges are highlighted in red, black and purple in Figs. 18b and 18c for an X ⊗ X measurement.
First, in round r + 1, edges highlighted in purple must have infinite weight and can thus be removed. This is due to the fact that when the logical patches are merged using the routing space in round r + 1, individual X-stabilizer measurements performed in the routing space (whose ancilla's are marked by white vertices) will have random outcomes and thus cannot be highlighted. As such, failures arising from two-qubit gates performed in the routing space, and which introduce errors that do not interact with stabilizers belonging to logical patches, cannot generate a non-trivial measurement outcome between two consecutive rounds of stabilizer measurements. Although the purple edges discussed above must be removed when they are incident to vertices in rounds r + 1 and r + 2, such edges incident to vertices v i and v j belonging rounds greater than r + 1 must be included since they will have finite weights.
Second, space-time edges incident to a single parity vertex v ∈ V (r+1) par are highlighted in red in Figs. 18b and 18c. Note that in round r + 1, such edges are incident to parity vertices colored in purple, which are also transition vertices. Such edges arise from two-qubit gate failures in round j ≥ r + 1 with the property that the errors introduced by the failure only flip the parity vertices in that round. In round j + 1, the error is detected by X-stabilizers belonging to both logical patches and the routing region. Errors introduced by such failures can flip the parity of the X ⊗ X measurement outcome. As such, v 2 (defined in Algorithm 1) should also include the number of highlighted red space-time correlated edges incident to transition vertices.
Third, space-time correlated edges highlighted in black have the same effect for correcting errors as all other space-time correlated edges belonging to the logical patches (highlighted in green). The reason is that they are incident to parity vertices in rounds j ≥ r + 2 and thus failure mechanisms leading to such edges cannot flip the parity of the X ⊗ X measurement outcome.
Lastly, we note that a two-qubit gate failure arising in round r + 1 and which results in a red highlighted space-time correlated edge will only highlight a single vertex (belonging to the logical patch in round r + 2) throughout the entire syndrome measurement history (assuming no other failures occur). This is due to the random outcomes of X-stabilizers in round r + 1 (so that vertices for such stabilizers cannot be highlighted in round r + 1). Since there is an asymmetry between the number of red space-time correlated edges incident to the left data qubit patch and those incident to the right data qubit patch, an asymmetry in the logical failure rate polynomials P (1,0,0,0) and P (0,0,1,0) (defined in Section III) will also arise.
Appendix C: Resource cost analysis of the Hubbard model Following [75], the total number of logical qubits used for simulating a Hubbard model of lattice size L is N Q = 2L 2 + L 2 2 + 2. If T gates are performed by catalysis, then an extra logical qubit is needed. We also add another logical qubit for the logical |0 required in the WTC/RFC protocol described in Section VI B. Using the core-cache model shown in Fig. 15, let N 1 be the number of logical qubits in the core, and N 2 the number of logical qubits in the cache. Adding the cost of the green padding shown in Fig. 14e, we now compute the total routing overhead cost in the core with h unit cells stacked in the vertical direction, and w unit cells stacked in the horizontal direction (h = w = 2 in Fig. 15d). We begin by defining the following functions which count the number of tiles in the green padded region of Fig. 14e TABLE I. For a Hubbard model simulation with lattice size L and unit cells of height h and width w in the core, we provide the minimum values of dx, dz and dm such that Eqs. (C7) to (C9) are satisfied with δ ∼ 1%. For the given parameters, we also provide the total number of physical qubits using Eq. (C6), and the number of logical qubits in the core and in the cache. The last column includes the multiplicative factor (defined in Eq. (C3)) that is added to the physical qubit overhead which takes into account routing costs. The values for NTOF and NT used in computing µ are obtained from Ref. [75]. All resource costs exclude contributions from the magic state factory.
region into four sections The total number of tiles in the green padded region is then given by The total routing overhead in the core is then O (core) (dx,dz,h,w) = wh(2d z + d x + 1)(3d x + 1) + S TRGP 4whd z d x . (C2) In the cache, the routing cost adds an additional 1 + N2−1 N2dx ∼ 1 + 1 dx multiplier when using surface code patches of distance d x and d z . Hence, the total routing costs including both the core and the cache is given by where we definedÕ For the Hubbard model, the total number of logical qubits N TLQ which excludes those used in the magic state factory is with since there are 4wh logical qubits in the core. The total number of physical qubits used in the algorithm is then where we used the fact that a surface code patch can be realized using a rectangular region of size 2d x d z .
We now compute the algorithm runtime. Let µ be the total number of injected magic states in the core and Pauli measurements in the algorithm. Recall that in Section II A, µ was shown to be given by µ = 4N TOF + N T . The time T b required to inject magic states via lattice surgery is thus T b = µ(d m + 1)T surf where T surf is the time required to measure the stabilizers and reset ancillas during one surface code syndrome measurement cycle. Using Eqs. (4) and (6) In Eq. (C7), we pessimistically take d x = (d x + 2 + h(3d x + 1))(d x + 2 + w(2d z + d x + 1)) − 4whd x d z , which is the full area of the routing space in the core. By doing so, we consider a worst case scenario where the full routing space is used to perform lattice surgery for each injected magic state. In Eq. (C8), we used P Z L = P (1,1,0,0) since the difference with P (0,1,1,0) is negligible. We also ignore higher order contributions arising from lattice surgery for the reasons explained in Section III. Lastly, in Eq. (C9) we pessimistically set F A = (d x + 2 + h(3d x + 1))(d x + 2 + w(2d z + d x + 1)) to be the full area in the core. Such an assignment is done to take into account the possibility that the full routing space can used when preforming lattice surgery after injecting a magic state. Such a scenario would lead to a large d z distance, which also includes contributions from the logical qubits.
In Table I we provide overhead costs associated with performing a Hubbard model simulation of lattice size L. Given the chosen values of h and w for a unit cells in the core, we first compute the minimum required values of d x , d z and d m by solving Eqs. (C7) to (C9) with δ ∼ 1%. We then compute the required number of physical qubits using Eq. (C6) and give the number of logical qubits in the core and in the cache. The last column includes the multiplicative factor resulting from the routing overhead costs. As can be seen, having more logical qubits in the core relative to those in the cache can substantially increase routing overhead costs.
Apart from the chosen value of d m which satisfies Eq. (C7), the algorithm runtime will depend on several factors. The first factor is the ratio of logical qubits used in the cache and in the core. Such a ratio will affect how many times one needs to read from, and write to the cache during the algorithm runtime.
The second factor involves whether multi-qubit Pauli operators with Y terms are measured using our twist-free approach or with twists. Lastly, the third factor includes runtime savings which can be achieved using our temporal encoding scheme for fast lattice surgery. As such, a more careful analysis of the algorithm runtimes is left for future work. We also leave the inclusion of resource costs associated with the magic state factories to future work. However, from the results of Refs. [33,76], we expect contributions from the magic state factories to only have a mild effect on the total resource overhead costs shown in Table I.