Surface code compilation via edge-disjoint paths

We provide an efficient algorithm to compile quantum circuits for fault-tolerant execution. We target surface codes, which form a 2D grid of logical qubits with nearest-neighbor logical operations. Embedding an input circuit's qubits in surface codes can result in long-range two-qubit operations across the grid. We show how to prepare many long-range Bell pairs on qubits connected by edge-disjoint paths of ancillas in constant depth that can be used to perform these long-range operations. This forms one core part of our Edge-Disjoint Paths Compilation (EDPC) algorithm, by easily performing many parallel long-range Clifford operations in constant depth. It also allows us to establish a connection between surface code compilation and several well-studied edge-disjoint paths problems. Similar techniques allow us to perform non-Clifford single-qubit rotations far from magic state distillation factories. In this case, we can easily find the maximum set of paths by a max-flow reduction, which forms the other major part of EDPC. EDPC has the best asymptotic worst-case performance guarantees on the circuit depth for compiling parallel operations when compared to related compilation methods based on swaps and network coding. EDPC also shows a quadratic depth improvement over sequential Pauli-based compilation for parallel rotations requiring magic resources. We implement EDPC and find significantly improved performance for circuits built from parallel cnots, and for circuits which implement the multi-controlled $X$ gate.


Introduction
Quantum hardware will always be somewhat faulty and subject to decoherence, due to inevitable fabrication imperfections and the impossibility of completely isolating physical systems. For large computations it becomes a certainty that faults will occur among the many qubits and operations involved. Fault-tolerant quantum computation (FTQC) can be implemented despite this by encoding the information in a quantum error correcting code and applying logical operations which are carefully designed to process the encoded information with an acceptably low effective error rate.
The surface code [Kit03; BK98] provides a promising approach to implement FTQC. Firstly, it can be implemented using geometrically local operations on a patch of qubits in a 2D grid, which is the natural setting for many hardware platforms including superconducting [Fow+12;Cha+20a] and Majorana [Kar+17] qubits. Secondly, the logical qubits it encodes remain protected even for relatively high noise rates, with a threshold of around 1% [WFH11]. Thirdly, a sufficiently general set of elementary logical operations can be performed fault tolerantly on qubits encoded in the surface code using lattice surgery [Hor+12]. By tiling the plane with surface code patches, a 2D grid of logical qubits is formed, where the elementary operations are geometrically local; see Figure 1. When combined with magic state distillation [BK05] these operations become universal for quantum computing. Indeed this approach, which we will refer to as the surface code architecture, is seen as among the most promising by many research groups and companies working in quantum computing [Fow+12;Cha+20b;YK17;FC16].
In this work, we seek to minimize the resources required to fault-tolerantly implement a quantum algorithm using the surface code architecture, which we will refer to as the surface code compilation problem. For concreteness, we will assume that the input quantum algorithm is expressed as a quantum circuit composed of preparations and destructive measurements of individual qubits in the Z or X basis, controlled-not (cnot), Pauli-X, -Y , and -Z, Hadamard (H), Phase (S) and T gates. Our results can be easily generalized to broader classes of input quantum circuits. The output is the quantum algorithm executed using the elementary logical surface code operations shown in Figure 1. Ultimately, we would like to minimize the physical space-time cost, which is the product of the number of physical qubits and the time required to run an algorithm. To avoid implementation details, we instead minimize the more abstract logical space-time cost, which is the number of logical qubits (the circuit width) multiplied by the number of logical time steps (the circuit depth) of the algorithm expressed in elementary surface code operations. The logical and physical space-time costs are expected to be 1-to-1 and monotonically related (see Appendix B), such that minimizing the former should minimize the latter.
A well-established approach to implement surface code compilation is known as sequential Pauli-based computation [Lit19], where non-Clifford operations are implemented by injection using Pauli measurements, and Clifford operations are conjugated through the circuit until the end. The circuit that is run in this approach then consists of a sequence of high-weight Pauli measurements which Single-qubit preparation in the X basis (i), and the Z basis (ii). Destructive single-qubit measurement, which moves the patch outside of the code space, in the X basis (iii), and the Z basis (iv) take 0 steps. 1 logical time step: Two-qubit measurement of XX (v) and ZZ (vi). A move of a logical qubit from one patch to an unused patch (vii). Two-qubit preparation (viii) and destructive measurement (ix) in the Bell basis. 3 logical time steps: A Hadamard gate, which uses three ancilla patches (x). See Appendix A for further details.
can have overlapping support leading them to be measured one after the other. For large input circuits this can be problematic because highly parallel input circuits can become serialized with prohibitive runtimes. A major challenge to solve the surface code compilation problem is that quantum algorithms typically involve operations between logical qubits that are far apart when laid out in a 2D grid. One approach to deal with a long-range gate is to swap logical qubits around until the pair of interacting qubits are next to one-another [CSU19]. However, this can result in a deep circuit, see Figure 2a. A more efficient approach is to create long-range entanglement by producing Bell pairs, which for example can be used to implement a long-range cnot with a constant-depth circuit [Bri+98; LO17; Jav+17] (see Figure 2b). Both of these approaches can be implemented with the elementary operations of the surface code.
Moreover, algorithms typically consist of many long-range operations that can ideally be performed in parallel. For swap-based approaches, this can be done by considering a permutation of the logical qubits which is implemented by a sequence of swaps [Lao+18; Mur+19; ZW19]. Finding these swap circuits reduces to a routing problem on graphs [CSU19; SHT19]. There are efficient algorithms that solve this problem for certain families of graphs [ACG94; CSU19], but finding a minimal depth solution is NP-hard in general [BR17]. Alternatively, linear network coding can be used to prepare many long-range pairs in constant depth [LOW10; Kob+09; Kob+11; SLI12; HPE19; BH20], and then these Bell  pairs can be used to implement operations on pairs of distant qubits. But a major barrier for using linear network coding is the lack of known efficient algorithms to find linear network codes.
In this paper, we provide a solution to the surface code compilation problem which generalizes the use of entanglement for long-range cnots discussed above to the implementation of many long-range operations in parallel. In particular, we propose the Edge-Disjoint Paths Compilation (EDPC) algorithm, which is a computationally efficient classical algorithm tailored to the elementary operations of the surface code architecture. We find evidence that our EDPC algorithm significantly outperforms other approaches by performing a detailed cost analysis for the execution of a set of quantum circuits benchmarks.
EDPC reduces the problem of executing quantum circuits to problems in graph theory. Logical qubits correspond to graph vertices, and there is an edge between qubits if elementary surface code operations can be applied between them. We show how to perform multiple long-range cnots in constant depth along a set of edge-disjoint paths (EDP) in the graph. In other words, long-range cnots can be performed simultaneously, in one round, if their controls and targets are connected by edge-disjoint paths. This leads to the well-studied problem of finding maximum EDP sets [Kle96]. The ability to perform longrange cnots along with the elementary operations allows compilation of Clifford operations. We also give a construction for EDP sets that are asymptotically optimal in the depth of worst-case sets of independent cnots.
The final operations that complete our gate set for universal quantum compu-Input circuit (compiled depth) Algorithm 1 cnot n/2 parallel cnots k parallel rotations Table 1: A comparison in the depth of surface code compilation algorithms (that use Θ(n) space) for various input circuits of width n. We compare the worst-case performance for a single long-range cnot gate, for cnot circuits with n/2 parallel cnot gates, and for k rotations, with k ∈ N, that need to be performed at the boundary.
tation with the surface code are T gates. The T gates are not natural operations on the surface code, but can be implemented fault-tolerantly by consuming specialized resource states, called magic states. Magic states can be produced using a highly-optimized process called magic state distillation, which we assume occurs independently of the computation on our code. We assume that logical magic states are available in a specified region of the grid. EDPC reduces magic state delivery to simple Max Flow instances that have known efficient algorithm [FF56]. We compare the depth of input circuits compiled using surface code compilation algorithms in the literature and EDPC in Table 1. The outline of the paper is as follows. In Section 2, we construct key higherlevel components from the basic surface code operations in Figure 1 including simple long-range operations. These long-range operations allow us to perform many parallel cnot operations given vertex-disjoint and edge-disjoint paths that connect the data qubits in Section 3. Because of its importance to the algorithms, there we also compare the state of the art graph algorithms for finding vertex-disjoint or edge-disjoint sets of paths and analyze their relation to our algorithms. We complete our gate set by giving an algorithm for efficient remote rotations using magic states at the boundary in Section 4. Putting parallel longrange cnot and remote rotations together, we construct our circuit compilation algorithm, EDPC, in Section 5. Finally, we compare the performance of EDPC to prior surface code compilation work in Section 6, note its connections to network coding, and give numerical results comparing the space-time performance with a swap-based compilation algorithm.

Key circuit components from surface code operations
Recall that our goal in this work is to develop an efficient classical compilation algorithm which re-expresses a quantum algorithm into one that uses the ele-mentary operations of the surface code with a low logical space-time cost. In Appendix A we give an overview of the surface code and justify the resource costs of the elementary operations shown in Figure 1. The initial quantum algorithm is assumed to be expressed as a circuit diagram involving preparations and measurements of individual qubits in the computational basis, controlled-not (cnot), Pauli-X, -Y , and -Z, Hadamard (H), Phase (S) and T gates. In this section we build and calculate the cost of some key circuit components from the elementary surface code operations in Figure 1. The contents of this section are reproductions or straightforward extensions of previously-known circuits.

Single-qubit operations
Some of the operations of the input circuit can be implemented directly with elementary surface code operations, namely the preparation and measurement of individual qubits in the measurement basis, and the Hadamard gate (provided three neighboring ancillary patches are available as ancillas, see Figure 1). Pauli operations do not need to be implemented at all since they can be commuted through Clifford gates and arbitrary Pauli gates [Kni05] and can therefore be tracked classically and merged with the final measurements. For this reason, while we occasionally explicitly provide the Pauli corrections where instructive, we often show equivalence of two circuits only up to Pauli corrections. The remaining single-qubit operations in the input circuit, namely the S and T gates, can be implemented using magic states and is addressed in Section 4.

Local cnot and swap gates
An important circuit component is the cnot gate, which can be implemented as shown in Figure 3a [ZBL08]. The qubits involved in this example are stored in adjacent patches, i.e., it is local. Another useful operation is a swap of a pair of qubits stored in nearby patches. The surface code's move operation shown in Figure 1 gives a straightforward way to implement this as shown in Figure 3b. With these implementations, the cnot requires one ancilla patch, while swap requires two. Both are depth 2.

Long-range cnot using swap gates
Typical input circuits for surface code compilation will involve cnot operations on pairs of qubits that are far apart after layout. A very intuitive approach to apply a long-range cnot(q 1 , q 2 ) gate is shown in Figure 4. This involves making use of swap gates to first move the qubits q 1 and q 2 so that they are near one another, and then use the local cnot gate in Figure 3a. Let the path where v 1 = q 1 and v k = q 2 . As each swap has depth 2, we get a circuit of depth 2 k−1 2 since we can perform swaps on either end simultaneously. Afterwards, the two qubits are adjacent and we simply perform a cnot in depth 2.  Figure 3: A cnot gate can be implemented in depth 2 using ZZ and XX joint measurements with a |+ ancilla state, followed by classically controlled Pauli corrections. The swap gate can be implemented using four move operations and two ancillas in depth 2.
1 1 2 2 3 3 k-1 k+1 k-1 2 2 2 P P Figure 4: A non-local cnot can be implemented using swaps which takes depth 2 k−1 2 using a zig-zag of ancilla patches along the path P of length k. The figure shows the case when k is odd and swaps depth is 2(k − 1). The patches on the path can store other logical information, which will simply be moved during the swap gates. The patches adjacent to the path are ancillas which are used to implemented the swap gates.
A lower bound on the depth it takes to perform a long-range cnot gate using swaps is proportional to the length of the shortest q 1 -q 2 path. To move a qubit k patches using swaps takes depth exactly 2k. Therefore, to move control and target to the middle of the shortest path connecting them, it must take time proportional to at least half the length of the path.

Long-range cnot using a Bell pair
A circuit component that we make extensive use of in this paper is the long-range cnot using a Bell pair [LO17]. This allows us to apply cnots in depth 2 between any pair of qubits (provided there is a path of ancilla qubits which connects them).
To understand the construction, we first show in Figure 5a how to prepare a longer-range Bell pair from two Bell pairs. By iterating this construction one can form a circuit to prepare a long-range Bell pair at the ends of any path of adjacent ancilla patches in depth 2. Next, we show in Figure 5b how to implement a cnot operation between qubits stored in patches neighboring a  pair of patches storing a Bell pair. Putting these together, using a path of ancilla patches between a pair of qubits, a long-range cnot can be implemented in depth 2 in a two-step circuit shown in Figure 5c and Figure 5d respectively. This approach can be used to implement the cnot in depth 2 circuit using any path from the control to the target qubit which starts with a vertical edge and ends with a horizontal edge. There is also flexibility in the precise arrangement of the Bell pairs and Bell measurements along the path using the circuits in Appendix C.
Note that here we have focused on implementing a long-range cnot by constructing and consuming a Bell pair. However a similar strategy (of first preparing a long-range Bell pair in the patches at the ends of a path of ancillas) can be used to implement other long-range operations such as teleportation.

Parallel long-range cnots using Bell pairs
Here, we generalize the use of Bell pairs from the setting of compiling an individual non-local cnot gate into surface code operations to the setting in which a set of parallel non-local cnot gates are compiled. In Figure 1 and the circuit components in Section 2, ancilla qubits are used to perform some operations on data qubits. To consider the compilation on large sets of qubits, we must specify the location of data and ancilla qubits: here we assume a 1 data to 3 ancilla qubit ratio, as illustrated in Figure 7.
In Section 3.1 we discuss some relevant background on sets of vertex-disjoint paths (VDP) and sets of edge-disjoint paths (EDP) in graphs. Then in Section 3.2 we define the VDP subroutine and the EDP subroutine that apply parallel cnot gates at the ends of a particular type of VDP or EDP set. In Section 3.3, we show how to use the EDP subroutine to compile more general cnot circuits and prove bounds on the performance of this approach.

Vertex-disjoint paths (VDP) and edge-disjoint paths (EDP)
In Section 2.4 we saw that a long-range cnot could be implemented with the use of a Bell pair produced with a path of ancilla qubits connecting the control and target of the cnot. A barrier to implement multiple cnots simultaneously can arise when an ancilla resides in the paths associated with multiple different cnots. This motivates us to review some relevant theoretical background concerning sets of paths on graphs. Given a graph G, a set of paths P is said to be a vertex-disjoint-path (VDP) set if no pair of paths in P share a vertex, and an edge-disjoint-path (EDP) set if no pair of paths in P share an edge. Note that a set of vertex-disjoint paths is also edge-disjoint. Further consider a set of terminal pairs T = {(s 1 , t 1 ), . . . , (s k , t k )} for terminals s i , t i ∈ V (G), the vertices of G, and i ∈ [k]. We then say that a set of paths P is a VDP set for T (respectively an EDP set for T ) if P is a VDP set (respectively an EDP set), and each path in P connects a distinct pair in T . These path sets do not necessarily connect all pairs in T . In what follows, we pay special attention to the square grid graph (see Figure 9a). The grid graph is relevant for qubits in the surface code as shown in Figure 1, where the vertices correspond to code patches and edge connect vertices associated with adjacent patchs 1 .
The problems of finding a maximum (cardinality) VDP set for T or a maximum EDP set for T have been well-studied and there are known efficient algorithms capable of finding approximate solutions to each. Unfortunately, on grids it is particularly hard to approximate the maximum VDP set. In particular, for N := |V (G)| there exist terminal sets for which no efficient algorithm can find an approximate solution to within a 2 O(log 1− N ) factor of the maximum set size for any > 0, unless NP ⊆ RTIME(N poly log N ) [CKN18]. However, efficient algorithms are available if one is willing to accept a looser approximation to the optimal solution. For example, a simple greedy algorithm is an O( √ N )approximation algorithm for finding the maximum VDP set [KS04; KT06a], i.e., it produces a VDP set to within an O( √ N ) multiplicative factor of the optimal solution for any graph, not just the grid. For grids, the best efficient algorithm that is known is anÕ(N 1/4 )-approximation algorithm [CK15], whereÕ(·) hides logarithmic factors of O(·).
The situation is better for approximation algorithms of the maximum EDP set: There is a Θ( √ N )-approximation algorithm [CKS06] for any graph, and on grids Aumann and Rabani [AR95] showed an O(log N )-approximation algorithm that was later improved to an O(1)-approximation algorithm [KT95;Kle96]. In practice, these algorithms can be technical to implement and can have large constant prefactors in their solutions that can be prohibitive for the instance sizes that we consider. A simple greedy algorithm forms a O( √ N )-approximation algorithm [KS04] for finding a maximum EDP set on the two-dimensional grid and does not suffer from the constant prefactors of the asymptotically superior alternatives. The dominant runtime complexity of this greedy algorithm is mainly in finding shortest paths for each terminal pair, giving a O(|T |N log N ), runtime upper bound by Dijkstra's algorithm 2 .
It is informative to consider the comparative size of the maximum EDP and VDP sets for the same terminal set T . Since any VDP set is also an EDP set, the size of the maximum VDP set for T cannot be larger than the maximum EDP set for T . Moreover, one can construct some cases of T on the grid [Kle96] in which the maximum EDP set is a factor √ N larger than the maximum VDP set [Kle96]. For example, consider the set of terminal pairs T = {((i, 1), (L, i)) | i ∈ [L]} of an L × L grid graph, where vertex (i, j) denotes the vertex in row i and column j. All terminals can be connected by edge-disjoint paths but the maximum VDP set is of size one.
In Section 3.2, we show that both VDP and EDP sets for T can be used to form constant-depth compilation subroutines for disjoint cnot circuits. Ultimately, as will become clear in Section 3.2, each path in the EDP or VDP sets for T allows us to implement one more cnot gate in parallel by a compilation subroutine. In this work, we focus on EDPs rather than VDPs for two main reasons. Firstly, as mentioned above, better approximation algorithms exist for finding maximum EDP sets than for finding maximum VDP sets on the grid. Although, in practice, we make use of the greedy O( √ N )-approximation algorithm for finding maximum EDP sets in this work. Secondly, as was also mentioned above, the maximum EDP set is at least as large as the maximum VDP set.
An important open problem that could ultimately influence the performance of the surface code compilation algorithm we present in this work is whether an alternative approximation algorithm for finding maximum EDP sets can be used that performs better in practical instances.

Long-range cnot subroutines using VDP and EDP
Here we present one of our main technical contributions, namely a description of how to implement a set of long-range cnots at the end of VDP and EDP sets using surface code operations. This is central to our overall surface code compilation algorithm presented in Section 5.
Consider the L × L square grid graph G (see Figure 9a) Here, vertices correspond to qubits stored in surface code patches, and edges connect qubits on adjacent patches (see Figure 1). We color the vertices of G with three colors: black, grey, and white (see Figure 7). All vertices with both even row and even column index are colored black and correspond to data qubits (where data qubits correspond to qubits in the input circuit). The vertices (corresponding to ancilla qubits) with both odd row and odd column index are colored white, and all remaining vertices are colored grey. This gives us a 1 : 3 data qubit to ancilla qubit ratio. We set n to equal the number of black vertices, i.e., the number of data qubits. Due to the designation of some vertices as data qubits and others as ancilla vertices in our layout, and due to the asymmetry of two-qubit operations along horizontal and vertical edges in Figure 1, we add some restrictions to the paths we consider. We define an operator path to be a path P = v 1 v 2 . . . v k , for k ∈ N, such that v 1 and v k correspond to data qubits and its interior v 2 . . . v k−1 are all ancilla qubits. Moreover, v 1 to v 2 must be a vertical edge, and v k−1 to v k must be a horizontal edge. Then an operator VDP (resp. EDP) set is a set of vertex-disjoint (resp. edge-disjoint) operator paths. In addition, we require that the ends of the paths in the operator EDP set do not overlap. With the coloring assignments of the grid graph G, it is easy to see that the first and last vertex of an operator path are colored black. In what follows, we show how we can implement cnots between the data qubits at the ends of the paths in an operator VDP (EDP) set in constant depth.
First consider an operator VDP set P. It is straightforward to see that we can simultaneously apply long-range cnots along each P ∈ P as in Figure 5 in depth 2. We call this the vertex-disjoint paths subroutine (VDP subroutine). Now consider an operator EDP set P. An EDP set can have intersecting paths, and the ancilla qubits at intersections appear in multiple paths, preventing us from simultaneously producing Bell pairs at their ends. We circumvent this by producing Bell pairs across a path in two stages by splitting the path into segments; see Figure 6. We will show that P can be fragmented into two VDP sets P 1 and P 2 that, together, form P. More precisely, each path P ∈ P can be built by composing paths contained in P 1 and P 2 such that each path in either P 1 or P 2 appears in precisely one path in P. We say that the paths in P 1 and P 2 are segments of paths in P. This forms the basis of the edge-disjoint paths subroutine (EDP subroutine), which is presented in Algorithm 3.1 and illustrated with an example in Figure 7.
We show the following Lemma, which restricts the adjacency of crossing vertices. As will become clear later, the adjacent crossing vertices impose systems  of constraints on fragmenting P, and their restricted adjacency of any operator EDP set ensures a fragmentation into two VDP sets always exists.
Lemma 3.1. Given an operator EDP set P, a crossing vertex is a vertex contained in more than one path in P. Let the set of crossing vertices be V c , then the induced subgraph G[V c ] contains only three kinds of connected components: 1. Isolated vertices.

2.
A horizontal path, where each vertex (i, j) in the connected component can only be adjacent to (i − 1, j) and (i + 1, j).

A vertical path,
where each vertex (i, j) in the connected component can only be adjacent to (i, j − 1) and (i, j + 1).

Proof. We consider all possible colors of a vertex
Black vertices cannot be crossing vertices by definition of an operator EDP set so cannot be contained in V c . It is then easy to see that white vertices in V c satisfy the Lemma. Therefore, the only relevant case is when (i, j) is a grey vertex. The vertices (i + 1, j) and (i, j + 1) are white and (i + 1, j + 1) is black. We show that these (a) cnot gates and edge-disjoint paths.  The EDP subroutine implements a set of parallel cnots connected by an operator EDP set. We assume a qubit ratio of 1 to 3 of data (black) to ancilla (gray and white). (a) The input to the EDP subroutine is a set of cnots and an associated EDP set. (b) We fragment the EDP set into two VDP sets consisting of segments of the original paths, and implement the compiled circuit over two depth-2 stages, one for each of these sets. (c) During the first stage we prepare a Bell pair between the ends of the segments in the first VDP set. (d) During the second stage we perform joint Bell measurements between the ends of segments in the second VDP set, producing long-range Bell pairs on ancillas adjacent to the control and target of each cnot. Then, long-range cnots can easily be applied by using the long-range Bell pairs (Section 2.4). See Figures 6 and 8 for further details of the long-range operations used here.
(a) Long-range cnot in two stages (d) Long-range teleport with ZZ meas.
(e) Long-range teleport with XX meas.  Figure 7. For each segment that is scheduled in phase 1, we use (b) and (c); and for each supbath that is scheduled in phase 2, we use (d) and (e). In (d) variables x 0 and z 0 equal to the total parity of all long-range Bell measurements applied during stage 2 on the cnot path. Each of these operations takes depth 2. (d) and (e) share the variables a and c.
Algorithm 3.1: EDP subroutine: to apply cnots to the data qubits at the endpoints of a set of edge-disjoint paths P, where the interior of each path is supported on ancilla qubits. The depth is at most 4.
Input : An operator EDP set P 1 P 1 , P 2 ← fragment P in two VDP sets of segments // Theorem 3.2 2 for segment P ∈ P 1 : 3 if P connects two data qubits then 4 execute long-range cnot along P 5 else 6 execute phase 1 operation along P (Figure 6a, or 8b, or 8c) 7 for segment P ∈ P 2 : 8 if P connects two data qubits then 9 execute long-range cnot along P 10 else 11 execute phase 2 operation along P (Figure 6b, or 8d, or 8e) white vertices cannot both be crossing vertices. Suppose that they are, then both edges between the white vertices and the black vertex, ((i + 1, j), (i + 1, j + 1)) and ((i, j + 1), (i + 1, j + 1)), are in P. This is a contradiction with the fact that the interior of operator EDP paths cannot contain a black vertex so it must be at the end of two paths, but an operator EDP set cannot contain two paths ending at the same vertex. By the same argument applied to the other white neighbors of (i, j) we see that only (i − 1, j) and (i + 1, j) or (i, j − 1) and (i, j + 1) can both be crossing vertices, and the claim follows.
We now prove that P can be fragmented.
Theorem 3.2. We can fragment an operator EDP set P to produce vertexdisjoint sets of segments P 1 and P 2 . If P is vertex-disjoint, then P 1 = P and P 2 = ∅.
Proof. We assign edges for inclusion in segments in P 1 or P 2 by an edge labelling l(e) : E(G) → {1, 2}. Given a labelling of all edges e in the paths of P, we can assign edges l(e) = b to segments in P b . Therefore, given a labelling of all edges in paths in P, it is easy to construct P 1 and P 2 . We now label all edge in the paths in P and prove that their labelling guarantees the vertex-disjointness property of P 1 and P 2 .
We constrain the labeling around every crossing vertex v so that the VDP property is satisfied. Clearly, v is contained in the interior of exactly two paths, P 1 and P 2 . Let v be contained in edges e 1 and e 1 of P 1 , and edges e 2 and e 2 of P 2 , then we impose the constraints l(e 1 ) = l(e 1 ) (2) l(e 1 ) = l(e 2 ) (4) guaranteeing the vertex-disjointness of segments at v since a segment of P 1 must span both e 1 and e 1 , and a segment of P 2 must span both e 2 and e 2 with a different label. We show there always exists a feasible solution given these constraints. If we consider the graph G[V c ] induced by crossing vertices V c , then we see that every connected component in G[V c ] gives a system of constraints. The adjacency of G[V c ], by Lemma 3.1, is such that each system has one degree of freedom, which we decide arbitrarily.
Finally, for every vertex-disjoint path P ∈ P, assign l(e) = 1 to all edges e in P . All remaining edges can be labeled arbitrarily.
The depth of a cnot circuit produced by the EDP subroutine for an operator EDP set P is at most 4. If P happens to be vertex-disjoint, then the depth is 2 since all paths are assigned to phase 1 by Theorem 3.2.

Compiling parallel cnot circuits with the EDP subroutine
In this section we consider how to compile input parallel cnot circuits using the EDP subroutine. We define the terminal pairs T ⊆ V (G) × V (G) to be the pairs of control and target qubits for each cnot gate in the parallel cnot circuit.
To use the EDP subroutine, we need to find operator EDP sets P 1 , . . . , P k that connect all terminal pairs in T . We will refer to any such set {P 1 , . . . , P k } as a T -operator set. The depth of the compiled implementation is minimized when the size k of the T -operator set is minimized.
There are reasons to believe that the compilation strategy for parallel cnotcircuits formed by finding a minimal T -operator set and applying the EDP subroutine should produce low-depth output circuits. For sparse input circuits, i.e. those with a small number of cnots, one can expect a small T -operator set to exist, giving a low depth output. On the other hand, we now prove that there are dense cnot circuits for which the EDP subroutine with a minimal size T -operator set produces a compiled circuit with optimal depth (up to a constant multiplicative factor).
Theorem 3.3. Let a parallel input cnot circuit with corresponding terminal pairs T be given, and let the n qubits of the input circuit be embedded in a grid among 3n ancilla qubits according to the layout in Figure 7. For simplicity, we assume n is both even and the square of an integer. We can find a T -operator set of size at most 2 √ n − 1 in polynomial time.
Proof. For each cnot we construct an operator path and argue that all such paths can be grouped into O( √ n) disjoint EDP sets. For simplicity, in the following, we specify paths by a sequence of key vertices, with each consecutive pair of key vertices connected by the shortest path (which is a horizontal or a vertical line). We now construct an operator path for each cnot, where the associated control vertex is v = (v x , v y ) ∈ V (G) and the target vertex is u = (u x , u y ) ∈ V (G). We can always form an operator path to connect u and v given by the following sequence of five key vertices v, This path consists of one vertical end segment, one horizontal interior segment, one vertical interior segment, and finally a horizontal end segment.
Having assigned a path to each cnot, we now show that any of these operator paths can share an edge with at most 2( √ n − 1) of the other paths. Since the operator paths have distinct endpoints, two different paths cannot share an edge on either of their end segments v, (v x , v y − 1) and on (u x − 1, u y ), u. Therefore pairs of these operator paths can only share an edge on their interior segments.
The horizontal interior segment of the operator path from v to u can share an edge with at most √ n − 1 other paths. To see this, consider an operator path that shares at least one horizontal edge with the operator path from v to u. Explicitly, that means the Since the terminals are unique, there can only be √ n − 1 other cnots with the control sharing the v y coordinate. An analogous argument applies for vertical segments, such that the operator path from u to v can share an edge with at most 2( √ n − 1) other operator paths.
Let us construct a graph H where each vertex represents an operator path as constructed above. We connect two vertices in H if the associated paths share an edge. Every vertex in H has degree at most 2( √ n − 1), therefore, H is (2 √ n − 1)-colorable using the (polynomial time) greedy coloring algorithm.
We construct a T -operator set of size 2 √ n − 1 by grouping the paths associated with each color in a set of edge-disjoint paths.
We now show a general lower bound on compiling parallel cnot circuits to the surface code architecture. Our strategy will be to consider a parallel cnot circuit with control data qubits in an area with small boundary that generates an amount of entanglement across the boundary proportional to the area for a given initial state. However, each elementary surface code operation is local such that only those operations acting at the boundary can increase the entanglement across it. The depth of any implementation of the cnot circuit is then lower bounded by the entanglement that it generates over the boundary size [DBT21;Bap+22].
Theorem 3.4. Consider a surface code architecture of n data qubits embedded in a grid where all ancilla qubits are in the |0 state. For any positive integer k ≤ n/2, there exists a parallel cnot circuit of k cnot gates with associated terminal pairs T that needs depth Ω( √ k) to be implemented on the surface code architecture.
Proof. Consider a cnot circuit with terminal pairs T with control qubits on data vertices in a square region, V L , and target qubits on vertices outside V L . We initialize the 2k data qubits associated with T to a product state |+ k |0 k , with |+ on control qubits and |0 on target qubits (the remaining data qubits are initialized in an arbitrary product state and ignored). After applying the cnot circuit, we obtain k Bell pairs. Therefore, the (von Neumann) entropy of the reduced state of the data qubits in V L has increased from 0 to k.
Consider a circuit C of depth d that implements the parallel cnot circuit. Any elementary operation of the surface code acting only within V L or within V L := V (G) \ V L or classical communication (together, LOCC) cannot increase the entropy of the state on V L . Moreover, as we show below, each elementary operation that acts both on V L and onV L can increase the entropy by at most a constant 4. We can therefore upper bound the increase in entropy due to C by 4d times the number of vertices adjacent to V L , which is proportional to √ k. To attain the k increase in entropy, we therefore need that d = Ω( √ k). We now bound the increase in entropy of any elementary operations acting on V L andV L to at most 4. All such elementary operations are built from a single XX or a ZZ measurements and single qubit operations Appendix A, which cannot increase the entropy. It is possible to implement XX and ZZ measurements acting on V L andV L using two cnots and operations acting only within V L or withinV L . The increase in entropy in V L by a cnot operation is bounded by 2 [Ben+03, Lemma 1]. Therefore, XX measurements, ZZ measurements, and indeed any elementary operation of the surface code can increase the entropy by at most 4.
In practice, it can be difficult to find minimal-size T -operator sets. However, when the minimal size T -operator set is k, in the following theorem we show that a T -operator set {P 1 , . . . , P l } with size at most l = O(k log|T |) can be found by a greedy algorithm that finds iteratively finds the maximum operator EDP set for remaining terminals in T .
Theorem 3.5. On the grid of n vertices, the greedy algorithm for finding Toperator sets repeats the following two steps , for i = 1, . . . , until there are no more terminal pairs to connect: 1. find a maximum operator EDP set P i , 2. remove all terminal pairs in P i from T .
The set {P 1 , . . . , P k } is a T -operator set and is an O(log|T |)-approximation algorithm for finding minimum-size T -operator sets.
Proof. We base our proof on [AR95]. Assume that the minimum-size T -operator set is {Q 1 , . . . , Q K } for some size K. Then there is an operator EDP set Q i , for i ∈ [K], such that |Q i | ≥ |T |/K. Therefore, the number of unconnected terminal pairs is reduced by at least a factor (1 − 1/K) each iteration and it will require at most O(K log|T |) iterations to connect all terminal pairs [Joh74].
To make use of Theorem 3.5 we would ideally like to have an algorithm to find maximum operator EDP sets on the grid, however the efficient algorithms we discussed in Section 3.1 fall short of this in two ways. Firstly they find EDP sets rather than operator EDP sets, and secondly they provide approximate maximum sets rather than maximum sets. Fortunately, we find an equivalence  between operator EDP sets on the grid and EDP sets on a graph that we call the T -operator graph (see Figure 9b). The T -operator graph is a copy of the grid graph but with all vertices corresponding to control qubits in T only having vertical outgoing edges, and with all vertices corresponding to target qubits in T only having horizontal incoming edges, and all remaining vertices corresponding to data qubits are removed. An EDP set for terminal pairs T on the T -operator graph is an operator EDP set on the grid. It is easy to see that a maximum operator EDP set for T on the grid is equivalent to a maximum EDP set for T on the T -operator graph. Using an approximation algorithm for finding the maximum operator EDP set also still gives approximation guarantees for minimizing the T -operator set, as shown in the following Corollary.
Corollary 3.6. The greedy algorithm for finding minimum T -operator sets, but with a κ-approximation algorithm for finding maximum operator EDP sets, gives an O(κ log|T |)-approximation algorithm for finding minimum T -operator sets.
Proof. We modify the proof of Theorem 3.5 such that every iteration we connect a (1 − κ/K) fraction of unconnected terminal pairs using the κ-approximation algorithm for finding maximum operator EDP sets. Therefore we obtain a O(κ log|T |)-approximation algorithm for findining minimum T -operator sets.
The equivalence between operator EDP sets on the grid and EDP sets on the T -operator graph motivates us to seek an efficient algorithm to find approximate maximum EDP sets on the T -operator graph as a key part of our EDPC algorithm. The algorithms we discussed in Section 3.1 come close to doing this, but some of them are intended for finding approximate maximum EDP sets on the grid rather than on the T -operator graph and even if they are adapted, the guarantees of the size of the approximate minimum EDP sets they produce Algorithm 3.2: Bounded T -operator set algorithm: An approximation algorithm for minimizing the T -operator set size that combines the theoretical guarantees from Theorem 3.3 with pragmatic performance using the greedy algorithm of Theorem 3.5.
Input : T terminal pairs 1 Q 1 ← the T -operator set given by Theorem 3.3 for T 2 Q 2 ← ∅ 3 while T = ∅ : // Greedy apx. In EDPC, we instead combine the theoretical worst-case bounds of Theorem 3.3 with the pragmatic performance of a greedy approach, which does not have a large constant overhead, in Algorithm 3.2. By Theorem 3.4 this gives us asymptotically tight performance in the worst-case. The runtime of this algorithm is dominated by O(|T |) iterations of approximately maximizing the operator EDP set in time O(|T |n log n). We leave it as an open question to find better approximation algorithms for finding maximum operator EDP sets that give improved performance outside the worst-case and that may also improve the runtime since less iterations over T are required.

Remote rotations with magic states
Thus far we have discussed the surface code compilation of all the input circuit operations listed in Section 1 except for the single-qubit rotation gates S = Z(π/4) and T = Z(π/8). In this section we design a subroutine for the compilation of parallel rotation circuits. The S and T gates can be implemented by using specially prepared magic states |S and |T , respectively. Magic states can be prepared using a highly-optimized process known as magic state distillation [Kni04], which distills many faulty magic states that are easy to prepare into fewer robust states. Still, producing both |S and |T involves considerable overhead. The |S state is used to apply the S-gate in a 'catalytic' fashion, whereby the state |S is returned afterwards. On the other hand, the state |T is consumed to apply the T -gate. The reason for this distinction is rooted in the (a) Long-range cnots for diagonal gates (b) Remote Z(θ) and X(θ) Figure 10: We assume the capability of performing S and T gates at the boundary qubits (red) where it is easy for us to supply the requisite S and T magic states. We can then execute S or T gates in the Z or X basis for our circuit by using long range cnots and the circuits in Figure 10b. For example, to execute S or T on qubits G3, E7 and HT H on G7, we apply long-range cnots between pairs (G3,G1), (E7,A7), (A5,G7) and then execute S or T on G1, A7, HTH on A5. We can continue applying other Clifford gates to qubits G3, E7, and G7 right after performing the long-range cnot, without waiting for the Z correction, since we can propagate the correction through Clifford operations.
fact that the S-gate is Clifford but the T -gate is non-Clifford.
In this work, we do not address the mechanism by which magic states are produced, but instead assume that these states are provided at specific locations where they can be used to implement gates. More specifically, we assume rotation gates S and T (and also Clifford variations of these such as X(π/8) = T x and X(π/4) = S x ) can be applied as a resource on specific ancilla qubits B ⊆ V (G) at the boundary of a large array of logical qubits (Figure 10a). This will allow sufficient space outside the boundary where highly-optimized magic state distillation and synthesis circuits can be implemented. Because a large number of magic states are used in the computation, we consider having magic state distillation adjacent to and concurrent with computation we are concerned with in this paper to be a reasonable allocation of resources.
We need a technique to apply remote rotations to data qubits which can be far from the boundary making use of the rotations that can be performed at the boundary. We make use of the property that any Z rotation (including T or S) has the same action when applied to either qubit in the state α|00 + β|11 . In particular, these two qubits need not be close to one another, so we can apply Z rotations remotely. A similar notion holds for X-rotations (including T x = HT H or S x = HSH) and α|++ + β|−− . Given a qubit q that needs to perform a Z rotation requiring a magic state, we apply a remote Z-rotation (Figure 10b): by performing a long-range cnot(q, q ) to a boundary ancilla q ∈ B prepared in |0 . Therefore we can apply the Z rotation remotely and use an X measurement on q to collapse the state back to one logical qubit. Similarly, for a qubit q that needs to perform an X rotation requiring a magic state, we apply a long-range cnot(q , q) to an ancilla q prepared in |+ on the boundary, giving Therefore we apply the X rotation remotely and collapse the state back by a single-qubit Z measurement of q . The task of compiling a parallel rotation circuit therefore reduces to applying a set of cnot gates from the boundary to the sites of the rotation gates. This can be achieved by finding an appropriate EDP set and running the EDP subroutine of Algorithm 3.1. Compared to the task of finding an EDP set for parallel cnot gates of Section 3, there is one simplifying condition here: Any boundary qubit can be used for each cnot when applying remote rotations. As we explain below, we can find the maximum EDP set for the compilation of remote rotations by solving the following (unit) Max Flow problem [KT06b].  ((s, u)) is maximized.
To understand why this yields a maximum EDP, we first point out that a solution for which f has binary values provides an EDP set by building paths from those edges e for which f (e) = 1. Moreover, this EDP set must be maximum, because a larger EDP set would imply a larger flow than f , which is the maximum flow by definition. Indeed the Ford-Fulkerson algorithm [FF56] solves Max Flow in runtime bounded by O(|E(G)||f |) and finds flow values f (e) ∈ {0, 1} on all e ∈ E(G) because of the unit capacity constraints, f (e) ≤ 1. Therefore, f corresponds to a maximum EDP set [KT06b, Section 7.6].
The remote rotation subroutine (Algorithm 4.1) executes a set of parallel single-qubit rotations. Each iteration can be performed in depth 4 using the EDP subroutine. On the surface code architecture, we can give strong guarantees on the number of iterations required to execute a set of parallel rotations by the Max Flow to min-cut equivalence.  We bound the runtime of the remote remote rotation subroutine by O(n 2 |G m |) as follows: At most O( |G m |) iteration of the while loop are necessary (see proof of Theorem 4.2). Each iteration, the call to max_rotations(G m ) is dominated by solving a Max Flow instance using the Ford-Fulkerson algorithm [FF56], which has a runtime bounded by O(n 2 ).
One could consider a number of generalizations and variations of this compilation subroutine for parallel rotation circuits. For instance, when the number of rotation gates is small, it may be useful to find VDP sets rather than EDP sets so that the VDP subroutine rather than the EDP subroutine can be applied. There is a different reduction to Max Flow in this case which can be obtained by replacing each vertex with two vertices, one with incoming edge and one with outgoing edges, connected by a directed edge with capacity 1. This guarantees only one flow can pass through every vertex.
Although we do not consider other single-qubit rotations in our input circuit for compilation, it is worth noting that any single-qubit rotation gate Z(θ) can be approximately synthesized to arbitrary precision [RS16] using |S and |T Algorithm 5.1: EDPC : a surface code compilation algorithm for any circuit C = g 1 . . . g . An operation g i is available if it has not been executed and all operations g j with overlapping support, for j < i, are executed.
Input : Circuit C with Paulis commuted to the end and merged with measurement 1 while available operations in C : for P ∈ Q : 7 run EDP subroutine (Algorithm 3.1) on P states along with the surface code operations shown in Figure 1. The approach used to apply S and T gates shown in Figure 10a can also be used to apply any rotation Z(θ) within the grid of surface codes by synthesizing the rotation at the boundary. However, if one considers more general rotations in the input circuit, the time needed for synthesis at the boundary will need to be accounted for and accommodated by other aspects of the overall surface code compilation algorithm. Another extension that can be considered is if multi-qubit diagonal gates are allowed in the input circuit. We show how X and Z rotations generalize to multi-qubit diagonal gates in Appendix D, although we do not use this in our surface code compilation algorithm.

EDPC surface code compilation algorithm
In this section we construct the EDPC algorithm for compiling universal input circuits into surface code operations by combining subroutines Algorithm 3.1 and Algorithm 4.1 for compiling long-range cnots and Z/X rotations respectively. First we provide a more formal definition of surface code compilation: The surface code compilation algorithm EDPC (Algorithm 5.1) combines combines the bounded T -operator set algorithm for parallel cnots with the remote rotation subroutine. Note that the input circuit is considered to be a sequence of operations rather than a series of time steps that specify the operations in each time step, such that l is the number of operations of the input circuit, not the depth.
We bound the classical runtime of EDPC given an input circuit with depth D acting on n qubits. It is useful to note that each of the D layers of the input circuit can be decomposed into a set of parallel rotations followed by a set of parallel cnots, each acting on at most n qubits. Recall that the remote rotation subroutine has a runtime bounded by O(n 2.5 ), whereas compiling a set of parallel cnots has a runtime of at most O(n 3 log n). Thus, EDPC has a runtime bounded by O (Dn 3 log n).
Circuits compiled by EDPC can be bounded in depth as listed in Table 1. Our claim for a single cnot is trivial. There are various modifications of EDPC that are worth considering. Firstly, the bounded T -operator set algorithm (Algorithm 3.2) can be improved by better algorithms for finding maximum operator EDP sets. Secondly, the requirement to execute all available gates before moving on to the next set could be relaxed. This could increase the number of long-range gates that are performed in parallel but would require careful scheduling with Hadamard gate execution, which may block some paths. Lastly, EDPC leans heavily on finding operator EDP paths and the EDP subroutine, but a similar surface code compilation algorithm could be constructed from operator VDP paths and the VDP subroutine instead. We believe that larger maximum EDP sets allows EDPC to apply more gates simultaneously (see Section 3.1), and more so if algorithms for approximation maximum operator EDP sets can adopted from EDP approximation algorithms [AR95;KT95]. Both of these features can give asymptotic improvements at only a 2× depth increase over the VDP subroutine. However, it is not difficult to construct instances where a VDP-based approach would give a lower depth, motivating a more nuanced trade-off between our EDP-based approach and a VDP-based approach.

Comparison of EDPC with existing approaches
In this section, we compare EDPC with other approaches in the literature. We first mention some of the features and short-comings of the well-established approach of Pauli-based computation Section 6.1. Then we address a more recently proposed compilation approach based on network coding in Section 6.2. In Section 6.3 we specify a swap-based compilation algorithm [CSU19] and use this as a benchmark for numerical studies of the performance of an implementation of EDPC in Section 6.4.

Surface code compilation by Pauli-based computation
One well-established surface code compilation approach is known as Pauli-based computation, which is described in [Lit19]. For an algorithm expressed in terms of Clifford and T gates, Pauli-based computation first involves re-expressing the algorithm as a sequence of joint multi-qubit Pauli measurements along with additional ancilla qubits prepared in T states. This re-expressed circuit has no Clifford operations, and the circuit depth can be straight-forwardly deduced from the input circuit since each T gate results in two [CC21] joint Pauli measurements 3 . This re-expression of the circuit essentially comes from first replacing each T gate by a small gate teleportation circuit consisting of an ancilla in a T state and a two-qubit joint Pauli measurement, and then commuting all Clifford operations to the end of the circuit. The main advantage of the Pauli-based computation approach is that all Cliffords are removed from the input circuit, resulting in no cost for cnot circuits in Table 1.
That said, this approach has a major drawback. When a Clifford circuit is commuted through a two-qubit joint Pauli measurement, it is transformed into Pauli measurements which can have support on all logical qubits. Therefore, the resulting circuit may contain measurements with large overlapping support that need to be performed sequentially (even when the T gates in the input circuit were acting on disjoint qubits during the same time step). The sequential nature of the joint measurements causes a fixed rate of T -state consumption that does not grow with the number of logical qubits and results in a Θ(k) depth for k parallel rotations, as listed in Table 1. The depth for parallel rotations is signifcantly higher than EDPC and could lead to a larger space-time cost for circuits with many T gates per time step.
A modified version of this Pauli-based computation compilation algorithm can be used to implement more T gates in parallel [Lit19, Section 5.1]. However, as highlighted in [CC21, Section V.A], this results in a significant increase of total logical space-time cost when compared to the standard Pauli-based computation compilation algorithm, even when disregarding the increased T -factory costs that would be needed to achieve a higher T state production rate.
In contrast with Pauli-based computation, one of our goals when designing the EDPC algorithm was to maintain the parallelism present in the input circuit, such that input circuits with higher numbers of T gates per time step are compiled to circuits with a higher T -state consumption rate.

Surface code compilation by network coding
Another approach to surface code compilation, based on the field known as linear network coding [Ahl+00], can be built from the framework put forward in Beaudrap and Herbert [BH20]. Similar to our EDPC algorithm, the essential idea in this compilation scheme is to generate sets of Bell pairs in order to implement operations acting on pairs of distant qubits.
In the abstract setting of network coding [LL04], one is given a directed graph G NC and a set of terminal pairs T = {(s 1 , t 1 ), . . . , (s k , t k )} for source terminals s i ∈ V (G NC ) and target terminals t i ∈ V (G NC ) for i ∈ [k]. Messages are passed through edges according to a linear rule. Namely, the value of the message associated with an edge is given as a specific linear combination of the values of those edges which are directed at the edge's head. One can consider the task of "designing a linear network code" by specifying the linear function at each edge in the graph such that when any messages are input via the source vertices s 1 , . . . , s k , then those same messages are copied over to the corresponding output via the target vertices t 1 , . . . , t k .
A number of works have considered how linear network coding theory can be applied to the quantum setting [LOW10; Kob+09; Kob+11; SLI12; HPE19]. Beaudrap and Herbert [BH20] gives a construction for a constant-depth circuit to generate Bell pairs across the terminal pairs T on a set of ancilla qubits corresponding to the vertices of G NC with cnots allowed on the edges of G NC . This is similar to, but not precisely the same scenario as we consider for surface code compilation in this paper since the basic operations are cnots rather than the elementary operations of the surface code, and since only ancilla qubits are considered without any data qubits. However, it should be quite straightforward to modify the approach in Beaudrap and Herbert [BH20] to form a surface code compilation algorithm. For example, one could use a layout similar to that which we use for EDPC in Figure 7, with G NC corresponding to a connected subset of ancilla qubits among a set of data qubits. The Bell pairs produced by the linear network coding approach could then be used to compile long-range operations between data qubits.
In such a network coding based compilation algorithm, the task of compiling an input circuit into surface code operations would largely rely on subroutines for (1) identifying T to implement the circuit's long-range gates, and (2) designing a linear network code for T . A major barrier to forming a usable compilation algorithm with linear network coding is that we are unaware of the existence of any efficient algorithm to design linear network codes, or even to identify if a given terminal pair set admits any linear network code. Even if such a linear network code can be found efficiently, there exist sets T for which network coding cannot provide a depth advantage over EDPC.
Any surface code compilation algorithm of cnot circuits with k parallel cnots, including EDPC and algorithms using network coding, is lower bounded in the worst case by Theorem 3.4 to a depth of Ω( √ k). This bound is loose when k is superconstant and sublinear in n since EDPC has a trivial upper bound of O(k) and a bound of O( √ n) by Theorem 3.3 on the compiled circuit depth. Therefore, it remains an open question whether network coding can give an advantage for the compiled circuit depth for such k. Figure 11: On a rotated L 1 × L 2 grid (here, 4 × 5), we can implement an odd-even pattern of swaps on data qubits (gray) using ancillas (white). Row-wise and column-wise swaps used in swap routing on a grid [ACG94] can be modified as shown above so that ancilla used for swaps do not overlap. Therefore, any arbitrary permutation on a rotated grid of can be implemented in space-time 4(L 1 + 1) + 2(L 2 + 1).

Surface code compilation by swap
Here we specify a swap-based compilation algorithm, stated in Algorithm 6.1, which we use to benchmark our EDPC against in Section 6.4. We assume the 1-to-1 ancilla-to-data qubit ratio as illustrated in Figure 11. This is more qubit-efficient than the 3-to-1 ratio we use for EDPC, and it allows the swap gadget in Figure 3b to be implemented between diagonally neighboring data qubits. The first step of the swap-based compilation algorithm is to assign each of the input circuit's qubits to a data qubit in the layout. Then, the gates in the input circuit are collected together into sets of disjoint gates. Before each set of gates, a permutation built from swap-gates is applied, which re-positions the qubits so that the gates in the set can be applied locally. We assume that the available local operations are the same as for our EDPC algorithm. In particular, we assume that the rotation gates (S, T , S x and T x ) can only be implemented at the boundary and that other single-qubit operations are performed as described in Section 2.1. One exception is that we make the simplifying assumption that the Hadamard can be performed without the need of three ancilla patches to simplify our analysis -this assumption could lead to an underestimate of the resources required for this swap-based compilation algorithm.
There are two main components of our swap-based algorithm which remain to be specified: how the permutations are implemented, and how we choose to separate the input circuit into a sequence of sets of disjoint gates. To permute the positions of data qubits, sequences of swap operations are used. Any permutation of the n vertices in a square grid can be achieved in at most 3 √ n rounds of Algorithm 6.1: swap compilation: We construct an algorithm based on the greedy depth mapper algorithm from [CSU19]. Let us implicitly define route(π), for mapping π, which finds a swap circuit for implementing partial permutations [CSU19]. We can compute the required partial permutation from the current mapping of qubits, and the given future mapping π.
Input : A circuit C with all Paulis commuted to the end and merged with measurement 1 function cost(mapping π, vertices v 1 , v 2 ): 2 return depth and edge attaining min e∈M depth(route(π + {v 1 → e 1 , v 2 → e 2 })) 3 while available gates in C : execute gates on qubits mapped by π since they are now local nearest-neighbor swaps [SHT19]. To do this involves three stages, with the first and third stages each involving rounds of swap-gates within rows only, and the second stage involving rounds of swap-gates within columns only. A round of swap-gates within either rows only or within columns only are implemented with surface code operations as shown in Figure 11. This immediately shows that this approach is asymptotically tight for parallel circuits because the depth of a swapbased approach is lower bounded by the √ n diameter of the architecture grid for one long-range cnot or rotation gate from the center of the grid. Therefore a parallel input circuit is compiled by the swap-based algorithm to an output circuit with depth Θ( √ n), including all the examples in Table 1.
There is considerable freedom in how to collect together gates from the input circuit into sets of disjoint gates. In our implementation in Algorithm 6.1, we use the greedy depth mapper algorithm from [CSU19], with a small modification to ensure that S and T gates are performed at the boundary. This algorithm also incorporates some further optimizations as described in [CSU19], including a partial mapping of qubits to locations, leaving the remaining qubits to go anywhere in an attempt to minimize the swap circuit depth.

Numerical results
Here we numerically compare the performance of EDPC with the swap-based compilation algorithm (Algorithm 6.1) when applied to a number of different input circuits. Note that our implementation of the EDPC compilation algorithm here differs slightly from that given in Algorithm 5.1, by greedily executing cnots earlier where possible. See Appendix E for details of the implementation.
Our first input circuit example consists of random parallel cnot circuits of different gate densities. The density n cnot of a circuit is how many of the data qubits are involved in a cnot gate in any such set. Therefore, n cnot = 0.1n means that 10% of all qubits (n) are performing a cnot gate in each set. For each data point, we sample 10 random circuits and plot the mean space-time cost in Figure 12 with the standard error of the mean in the shaded region. The runtime of the swap protocol was bounded by 2 days, which was insufficient for larger instances of these random circuits at high densities.
We also consider a more structured input circuit, namely implementing half of a multi-controlled-X gate, c k not. We consider decompositions of c k not for k integer powers of 2, but only compile the first half of the circuit, given in Figure 13a. A T -efficient implementation of c k not uses measurement and feedback for uncomputation [Jon13], which are not captured in our model (see Section 7). We plot the space-time cost of compiling the half c k not in Figure 13b. We see that the dependence on the number of qubits k is worse for swap-based compilation, and results in a larger space-time cost starting at 64 qubits. Unfortunately, the swap-based compilation is quite slow: we ran the algorithm for at most 3 days and 9 hours at each data point and were only able to obtain results up to 128 qubits. However, the data we were able to obtain indicates a cross-over for compiling c k not circuits. The swap-based compilation has better space-time performance for small instances, while EDPC has a better space-time performance for compiling large c k not circuits.

Conclusion
In this paper, we have introduced the EDPC algorithm for the compilation of input quantum circuits into operations which can be implemented faulttolerantly with the surface code. The heart of this algorithm lies in the EDP subroutine, which can implement both sets of parallel long-range cnot gates and sets of parallel rotations in constant depth using existing efficient graph algorithms to find sets of edge-disjoint paths. EDPC has advantages over other compilation approaches including Pauli-based computation, network coding based compilation, and swap-based compilation. We numerically find that EDPC significantly outperforms swap-based circuit compilation in the space-time cost  Figure 13: We compare the space-time cost of compiling a T -gate optimized circuit decomposition for a half c k not circuit to the surface code using EDPC and swap compilation. We see in the log-log plot (b) that dependence of the space-time cost on n gives a higher scaling dependence in the case of swap compilation than EDPC. This results a lower space-time cost for EDPC starting from 64 qubits. of random cnot circuits for a broad range of instances, and for larger c k not gates. However, many details of EDPC can be improved, as it is only a first step towards using long-range operations for surface code compilation.
EDPC requires sets of constrained edge-disjoint paths, which we call operator paths and run almost entirely along ancilla qubits. Better algorithms for finding maximum sets of edge-disjoint operator paths could improve EDPC. It seems likely that an O(log n)-approximation algorithm for finding maximum EDP sets on grids [AR95] can be modified to give an algorithm for finding maximum sets of edge-disjoint operator paths on grids. A polylogarithmic approximation algorithm for this task would imply an approximation algorithm for minimizing the depth, up to a polylogarithmic factor, of compiling parallel cnots using the EDP subroutine. In practice, it is, however, also important to find approximation algorithms with reasonable constant prefactors.
The runtime complexity of EDPC for an input circuit of depth D acting on n qubits is O(Dn 3 log n). This is significantly faster than the swap-based compilation in Section 6.3, which was found to be O(Dn 5 ) in Childs, Schoute, and Unsal [CSU19]. We found that our implementation of the swap-based compilation implementation runtime is much slower than that of EDPC on small instances, and found that the swap-based algorithm had impractically long runtimes when applied to circuits beyond a few hundred qubits, the regime of large-scale applications of quantum algorithms [Rei+17;GE21]. Potential ways to further improve EDPC's runtime include using a dynamical decremental all-pair shortest path algorithm in the greedy approximation of the maximum EDP set, or by finding faster and better approximation algorithms for finding the maximum set of edge-disjoint operator paths.
Any diagonal gates in the Z (or X) basis can be performed remotely on the boundary, including CCZ gates [GF19] (see Appendix D). Therefore, our results on applying Z(θ) rotations can be extended to diagonal gates, which will benefit circuit depth.
Even with the capability to perform long-range operations it may still be helpful to localize the quantum information on some part of the architecture such as by permuting the data qubits. In particular, the size of the EDP set is bounded above by the minimum edge cut separating the terminals. Therefore, it may be beneficial to first redistribute quantum information where it is needed to ensure large EDP solutions exist. It is straightforward to construct a long-range move of a data qubit to an ancilla in depth 2 from a long-range cnot, by performing the cnot targeting a |0 ancilla state and measuring the source in the X basis up to Pauli corrections. It also is straightforward to adapt the EDP subroutine to perform sets of these long-range moves along operator paths, now ending at the ancilla, in depth 4. The depth to permute only a few qubits a long distance can be improved significantly by this technique. For example, a swap of the two corners of an L × L grid architecture takes O(1) depth using long-range moves, as opposed to Ω(L) depth using conventional swaps. It remains an open question how to trade off permuting data qubits (using swaps or long-range moves) and directly using long-range cnots.
We have assumed that classical feedback is not present in the input circuit for clarity of presentation. EDPC can readily be extended to the setting of classical feedback in the input circuit to form a "just-in-time" surface code compilation algorithm. To do so, a larger computation would be broken up into a sequence of circuit executions without classical feedback, where prior measurement results specify the next circuit to compile and execute.  Figure 14: (a) A d = 5 surface code patch implemented in a grid of data physical qubits (black disks), and ancilla physical qubits (white disks). Error correction is implemented with single-qubit operations and cnot between pairs of qubits connected by a dashed edge. Z and X type stabilizers are associated with alternating red and blue faces. (b) A decoding graph that is defined by associating an edge with each qubit and a vertex for each stabilizer. If stabilizers are measured perfectly, Z errors on data qubits (marked in red) can be corrected by finding a minimum weight matching (green edges) of vertices associated with unsatisfied X stabilizers (yellow disks).

A Surface code architecture
Here we review some basic details of the surface code focusing on the elementary logical operations shown in Figure 1. This is intended as a high-level overview to provide some intuition of how the logical operations in Figure 1 arise and what their resource costs are. For more thorough reviews of surface codes see Refs. [Bom13;Fow+12;Bro+17].
To implement the surface code, we assume physical qubits are laid out on the vertices of a 2D grid, with nearest-neighbor interactions allowed. For concreteness, we will describe here an implementation of lattice surgery with the rotated surface code with half-moon boundary [LR14], although our EDPC algorithm can use other implementations. A single surface code patch encodes a single logical qubit in 2d 2 − 1 physical qubits, where the odd parameter d is known as the code distance which corresponds to the level of noise protection; see Figure 14(a). For clarity, within this section of the appendix we refer to physical qubits and logical qubits explicitly, however in other sections we often drop the word "logical" when referring to logical qubits for brevity.
We designate every odd physical qubit as a data physical qubit in the patch, and every even physical qubit as an ancilla physical qubit to facilitate a stabilizer measurement; see Figure 14(a). The code space of a surface code consists of those states of the data physical qubits which are simultaneous +1 eigenstates of the set of stabilizer generators. The stabilizer generators can be associated with faces and are either X ⊗ X ⊗ X ⊗ X or Z ⊗ Z ⊗ Z ⊗ Z operators for the bulk (interior) of the code or X ⊗ X or Z ⊗ Z operators on the boundary. We can see that the logical Z operator, Z L , defined as any path of single-qubit Z operators on physical qubits connecting the rough boundaries, commutes with all stabilizers. Similarly, the logical X operator, X L , is a path of X operators connecting the smooth boundaries.
For quantum error correction, it is necessary to repeatedly measure stabilizer generators. Stabilizer generators can be measured by running small circuits consisting of the preparation of the ancilla physical qubit, cnots between the ancilla physical qubit and the data physical qubits, followed by measurement of the ancilla physical qubit. Error correction can be performed by associating qubits with edges and stabilizer generators with vertices of a so-called decoding graph; see Figure 14b. A classical algorithm known as a decoder is used to infer a set of edges (specifying the support of the X or Z correction) given a subset of vertices (corresponding to unsatisfied Z or X stabilizers, that is stabilizer generators with measurement outcome -1). Figure 14b shows an example of this in the setting of perfect stabilizer measurements, although this can be generalized to handle faulty measurements by repeating measurements.
Logical operations can be implemented fault-tolerantly on logical qubits encoded in surface codes. For example, a destructive logical X measurement of a patch is implemented by measuring all data qubits in the X basis, and then using a decoder to process the physical outcomes and reliably identify the logical measurement outcome. Another important logical operation is the non-destructive measurement of a logical joint Pauli operator using an approach known as lattice surgery [Hor+12] as shown in Figure 15a. To simplify lattice surgery by lining up the boundary stabilizers of neighboring patches, we consider a tiling of the plane using two versions of distance d surface code patches as shown in Figure 15b which forms a grid of logical qubits. Logical Z L ⊗ Z L can be measured between vertical neighbor patches while X L ⊗ X L can be measured between horizontal neighbor patches.
The allowed fault-tolerant logical operations that we assume throughout the paper and the resources they require are listed in Figure 1. These are largely based on the rules specified in [Lit19]. Here we justify the resource requirements for the logical operations in Figure 1 not covered in [Lit19] on a distance-d surface code. For space analysis, we work in units of full surface code patches such that if any qubits from a patch are needed to implement an operation the full patch is counted. We show how to implement the operations in terms of more elementary Pauli measurements. The move operation can be implemented in depth 1 with the target qubit as ancilla, as shown in Figure 16. The Hadamard can be implemented in depth three with three ancilla patched along with the move operation as shown in Figure 17. Finally, Bell measurement and preparation can be implemented in depth 1 as shown in Figure 18.
It is worth mentioning that there is considerable freedom in the detailed choice and implementation of the surface code which could have an impact on the space-time cost of logical operations, both at the physical level but also in some cases at the logical level. For example the Hadamard could be performed  Figure 15: (a) A logical Z L ⊗ Z L measurement is performed by lattice surgery in the following steps: (i) Stop measuring the weight-two stabilizers along the horizontal boundary between the patches. (ii) Reliably measure the bulk faces for a single vertically-extended patch. Note that Z L ⊗ Z L can be inferred from the product of the outcomes of the newly measured red faces. This temporarily merges the patches to form a single extended surface code patch. (iii) Reliably measure once more the weight-two faces along the horizontal boundary between the patches. This separates the pair of patches. (b) Two types of patches tile the plane, with Z L ⊗ Z L measurements possible between vertically neighboring patches, and X L ⊗ X L measurements possible between horizontally neighboring patches. Figure 16: The move operation can be implemented in depth 1 by local and neighboring Pauli measurements. A horizontal move can be implemented by preparing a single-qubit patch in |0 , applying joint XX measurement, and then measuring the original patch in the Z basis (up to Pauli corrections). The vertical move follows from applying a Hadamard to the source qubit |φ and a Hadamard on the output. Simplifying the circuit gives the right-hand side in the Figure, with a ZZ measurement that is available vertically.  Figure 18: We can implement Bell preparation and measurement in terms of single and two-qubit Pauli measurements in depth 1 as given in Figure 18 [Lit19]. (a) A Bell pair can be prepare from a (horizontal) joint XX measurement of |00 or a (vertical) joint ZZ measurement of |++ , up to Pauli corrections. (b) A destructive Bell measurement can be implemented by a joint XX measurement followed by individual Z basis measurements, or by a joint ZZ measurement followed by individual X basis measurements.
using just one logical ancilla patch if each patch was padded with extra qubits. We do not explore these alternatives here, but note that our EDPC algorithm can still be applied if these alternatives are used.
B Logical space time cost as a proxy for physical space time cost Here we provide a justification for our use of logical space time cost as a proxy for physical space time cost. As we have seen in Figure 1 and Appendix A, logical operations implemented with the surface code require physical time that scales as d and physical space that scales as d 2 . For a logical circuit written in terms of a total of A logical elementary logical operations implemented using surface codes of distance d, the physical space-time cost A physical is approximately The probability of any of these elementary operations resulting in a logical failure scales as p fail ∼ (p/p * ) d/2 , where the fixed system parameters are the physical error rate p, and the fault-tolerant threshold for the surface code p * . Moreover, we assume p fail ∼ 1/A logical to ensure that the logical circuit is reliable with as small a code distance as possible. This suggests that the code distance behaves as d ∼ 2 log A logical log p * − log p .
Therefore we see that the physical and logical space time costs are monotonically related, i.e., A physical ∼ A logical (log A logical ) 3 . (10)

C cnot via Bell operations
We list more variations of the standard cnot gate (Figure 3a) that use intermediate Bell preparation and measurements on ancillas in Figure 19. By choosing the right subcircuit, we see that the long-range operations in Figure 8 implement a cnot gate.

D Remote execution of diagonal gates
A gate D diagonal on k source qubits in the computational basis can be executed on k ancilla by first entangling these ancilla qubits using cnots. We call this remote execution. Let the computational basis be | , for ∈ [2 k ], then D| = exp(iφ )| . We saw one use for remote gates in applying rotations at the boundary requiring magic states (Section 4). We execute D remotely as follows (see Figure 20). First, we initialize the ancilla in the state |0 ⊗k . Let the source qubits be in some pure state α | ,  Figure 19: Various implementations of a cnot gate with intermediate ancilla qubits and Bell operations. In particular, we are able to apply the control and the target either before (green) or after (teal) Bell preparation and measurement steps, while keeping the depth at 2. for α ∈ C. then we apply k transversal cnot gates controlled on source qubits so that the overal state becomes α | ⊗ | . We now apply D to the ancilla instead (1 ⊗ D) α | ⊗ | = α exp(iφ )| ⊗ | .
We now disentangle the ancilla by measuring them in the X basis. Let the measurement give outcomes x ∈ {0, 1} k , then the state on the source qubits is mapped to where (x, ) is the inner product modulo 2 between x and the binary representation of . Applying a Z correction to each qubit j ∈ [k] controlled on measurement result x j maps the state to α exp(iφ )| as required. This technique can be extended to any unitary operator U since it can be unitarily diagonalized as U = V DV † by the spectral theorem, for V unitary and D diagonal operators. A particularly simple case are unitary operators that are diagonal in the Hadamard basis, where V = H ⊗k . We write U = H ⊗k DH ⊗k on the source qubits and apply remote execution of D using our techniques above. We then simplify the circuit to obtain Figure 20b.

E EDPC implementation
Here we provide Algorithm E.1, which specifies the implementation of EDPC used for our numerical results presented in Section 6.4, here called EDPCI for clarity. EDPCI differs slightly from EDPC (Section 5) and we will highlight the differences. Up until Line 7 in Algorithm E.1, EDPCI is the same as EDPC. Then, EDPCI greedily attempts to execute long-range cnots earlier than would occur in EDPC. In particular, EDPC only executes cnots after all available rotations have been executed, whereas EDPCI finds a set P c on Line 9 such that P c ∪ P m forms an EDP set. Now EDPCI concurrently executes long-range cnots using any edges left over from remote rotations. Moreover, we note that EDPC uses the bounded T -operator set algorithm (Algorithm 3.2) to execute a parallel cnot circuit, which additionally finds a bounded T -operator set Q 1 on Line 1, whereas EDPCI only finds Q 2 from the bounded T -operator set algorithm if given a parallel cnot circuit. As a consequence of this difference, while a parallel input cnot circuit is guaranteed to compile to an output circuit upper bounded by O( √ n), EDPCI does not have this guarantee.
Algorithm E.1: EDPC implementation: The EDPC algorithm (Algorithm 5.1) differs from our implementation in that it greedily tries to execute cnots earlier.
Input : Circuit C with Paulis commuted to the end and merged with measurement remove edges in each P ∈ P m from G 9 P c ← approximate max operator EDP set on G with G c 10 execute concurrent remote rotations along P m and long-range cnots along P c using EDP subroutine 11 remove executed rotations from G m and cnots from G c