Scalable surface code decoders with parallelization in time

Fast classical processing is essential for most quantum fault-tolerance architectures. We introduce a sliding-window decoding scheme that provides fast classical processing for the surface code through parallelism. Our scheme divides the syndromes in spacetime into overlapping windows along the time direction, which can be decoded in parallel with any inner decoder. With this parallelism, our scheme can solve the decoding throughput problem as the code scales up, even if the inner decoder is slow. When using min-weight perfect matching and union-find as the inner decoders, we observe circuit-level thresholds of $0.68\%$ and $0.55\%$, respectively, which are almost identical to $0.70\%$ and $0.55\%$ for the batch decoding.

Fault-tolerance theory allows scalable and universal quantum computation provided that the physical error rates are below a threshold.Aharonov and Ben-Or in their seminal work [1] give a fault-tolerance scheme with no classical operations.However, the resulting threshold is far from feasible.Multiple fault-tolerance architectures have since been proposed [2][3][4], but their estimations for the threshold and resource overhead mostly assume instantaneous classical computations.
One prominent architecture [5,6] uses the surface code [7] and achieves universality via magic-state distillation [8].In particular, the implementation of a non-Clifford gate using a magic state typically involves a classically controlled Clifford correction.Therefore, the error correction between consecutive non-Clifford gates should be fast enough to keep up with the rapidly decohering quantum hardware, so that the error syndromes do not backlog [9].This requires an adequately high decoding throughput-the amount of error syndromes that can be processed by a decoder in unit time.
Many decoding schemes for the surface code have high thresholds [10,11], yet they do not address the inadequate decoding throughput problem.On the other hand, local decoding schemes [12][13][14][15][16][17][18][19] are fast and scalable to a certain degree, but their speed comes at the expense of accuracy.The accuracy of local decoders can be improved by appending a global decoder [20][21][22][23][24][25] while still pursuing relatively high decoding throughputs.Other schemes based on specialized hardware [26][27][28][29] have also been proposed.However, to the best of our knowledge, none of the approaches mentioned above have demonstrated adequate accuracy, throughput, and scalability simultaneously.
In this work, we introduce the sandwich decoder for the surface code, which solves the throughput problem using parallelism.Our work is inspired by the idea of "overlapping recovery" in [5] (later rediscovered in [30]), which we reformulate as the forward decoder.Both the sandwich and forward decoders are sliding-window decoders, * These two authors contributed equally.
i.e., they divide the error syndromes in spacetime into overlapping windows in the time direction.However, the forward decoder needs to process these windows sequentially which results in a limited throughput.Meanwhile, our sandwich decoder removes the dependency between the windows so that they can be handled in parallel, e.g., using separate classical processing units [31].Adjacent sandwich windows may diagnose differently upon the same syndromes, which would compromise the faulttolerance property of the decoding scheme.We reconcile such inconsistency by decoding the controversial syndromes in a further subroutine.
The parallelism of the sandwich decoder is a great advantage for scalability.An inherently sequential algorithm like the forward decoder can hardly take advantage of parallel computational resources, and thus will have difficulty maintaining adequate throughput when the code distance increases.Meanwhile, the sandwich decoder can solve the throughput problem as the code scales up, as long as it is given enough parallel processing units.Little communication is needed between processing units as there is no dependency between windows.Thus the throughput requirement can be easily satisfied by adding more cores or processors, which is much easier than pushing the processor clock speed.Furthermore, the number of parallel processing units needed only scales with the speed of the quantum hardware and the code distance, not with the length of the quantum computation.
We benchmark the sandwich decoder with the memory experiment for the distance-d rotated surface code [32] under circuit-level noise.In particular, we decode each window using the min-weight perfect matching [5] or union-find decoder [33], and observe numerical thresholds of 0.68% and 0.55%, respectively, for the logical error rate per d cycles of syndrome extraction.These values are almost identical to the corresponding thresholds for the batch decoders.It is reasonable to expect similar preservation of accuracy when using other inner decoders.In consequence, our sandwich decoder may allow one to prioritize the accuracy of the inner decoder, as throughput is assured simply with adequate computational resources.memory experiment that preserves the logical state |0 for the [[d 2 , 1, d]] rotated surface code; the argument for other variants of the surface code or logical basis states proceeds analogously.Specifically, we first initialize all the data qubits into the state |0 .Then, we repeatedly apply a syndrome-extraction circuit for n cycles and obtain syndromes σ X i , σ Z i ∈ {0, 1} of the X-and Z-type check operators, respectively, for i = 1, • • • , n.Finally, we measure all the data qubits in the Z basis and obtain outcomes µ ∈ {0, 1} d .
For the surface code, each cycle of detectors is the XOR of two consecutive cycles of syndromes.More precisely, let σ Z n+1 (µ) ∈ {0, 1} be the syndromes of the Z-type check operators evaluated from the data qubit measurement outcomes µ.Define We further assume that the syndrome-extraction circuit is fault-tolerant [32,34] and the whole circuit of the memory experiment is afflicted with stochastic Pauli noises.Specifically, each gate, qubit idling, and initialization (resp., measurement) is modeled as the ideal operation followed (resp., preceded) by a random Pauli, referred to as a fault, supported on the involved qubit(s).
Under our assumptions about the circuit and noise model, detectors are 0 in the absence of faults; thus, any defects-detectors with value 1-indicate the presence of faults.Furthermore, the occurrence of each fault flips at most two detectors of each type (X or Z) [32].We de-fine a detector to be open if there is a fault which flips that detector but no other detector of the same type; otherwise, it is closed.See [34] for examples.
Figure 1(b) illustrates a Z-type decoder graph constructed as follows.First, add one vertex for each Z-type detector.Then, add an edge between two vertices (detectors) if there is a fault that flips them both.Finally, for each open detector, add an imaginary detector and an edge connecting them.We also assign each imaginary detector a binary value, such that each fault flips either zero or two Z-type detectors.Each edge in the decoder graph thus represents an equivalence class of faults that flip the same two detectors.
The notions "open" and "closed" can be extended to the boundaries of a decoder graph.For example, an X fault on any of the data qubits on the top and bottom boundaries of the lattice in Fig. 1(a) flips only one Ztype real detector (i.e., a bit in δ Z ) and by definition it is open.Hence, the top and bottom space boundaries of the 3D decoder graph in Fig. 1(b) can be referred to as open.Contrarily, there exists no fault that flips only one detector in δ Z 1 or δ Z n+1 (unless on a space boundary).Thus, the past (left) and future (right) time boundaries in Fig. 1(b) are closed.
One can similarly construct an X-type decoder graph; see Fig. 1(c).During the first (resp., last) cycle of syndrome extraction, a measurement fault on the ancilla of any X-type check operator flips only one detector in δ X 2 (resp., δ X n ).Hence, both time boundaries are open.When a fault flips only one real detector on a intersection of space and time boundaries, it might help to distinguish if the fault afflicts, for instance, an ancilla measurement or a boundary data qubit.Thus, in Fig. 1(c) we con- s / X y 2 V q y i p l V 8 g P W 2 w f A n q M D < / l a t e x i t >

○
< l a t e x i t s h a 1 _ b a s e 6 4 = " n u 3 A y E k j b P K t n d z r  4 1 7 e g t I i j S + w n 0 A h Z J x J t w R k a q W l X f c W E h i C + y 5 y 9 B P P M R 7 h D L h S X 0 M p + z G 3 n y L g 0 q + Z 5 3 r Q r r u M O Q c e J V 5 A K K X D e t N / 9 V s z T E C L k k m l d 9 9 w E G x l T K M y a v O y n G h L G e 6 w D d U M j F o J u Z M P n c r p p l B Z t x 8 p U h H S o / p 7 I W K h 1 P w x M Z 8 i w q 0 e 9 g f i f nect such open real detectors to two imaginary detectors rather than one.
Given that we prepare the logical state |0 , we focus on the Z-type decoder graph as depicted in Fig. 1

(b).
Sliding-window decoders.-Awindow consists of a number of consecutive cycles of detectors in the decoder graph; see Fig. 2(a).Both the forward and sandwich decoders work on one window at a time.Within each window, an inner decoder finds a set of edges from the decoder graph that can annihilate all defects.(Formally, a defect is annihilated if it is incident to an odd number of edges in the set.)These edges are regarded as corrections.The corrections assembled from all windows should collectively annihilate every observed defect in the whole decoder graph.Besides the inner decoders, the efficacy of our sliding-window decoders also relies on a crucial feature that any two consecutive windows overlap.These overlaps allow each window to only retain a relatively trustworthy subset of the corrections for later assembly and discard the rest.
Hereafter, we associate each window with a subgraph of Fig. 1(b) consisting of the real detectors enclosed in the window and certain imaginary detectors.Concretely, the first window contains detectors from δ Z 1 to δ Z w , the second window from δ Z 1+s to δ Z w+s , and so forth.That is, windows, each with length w, proceed rightwards with step size s < w.(The final window contains detectors up to δ Z n+1 .)Analogous to the construction of Figs.1(b) and 1(c), here imaginary detectors are used such that each fault flips zero or two detectors in a window graph.
For instance, the top and bottom space boundaries of each window are open.Importantly, in the initial window of both decoders, the past time boundary is closed; however, the vertices on the future time boundary are open and are connected with imaginary detectors to account for the faults that flip one detector in δ Z w and the other in δ Z w+1 .See Fig. 2(a).A key distinction between the two decoders is the time-boundary conditions of the remaining windows, which we now discuss separately.
Forward decoder.-Figure2(b) illustrates the forward decoder.For each window before the final one that spans the detectors from δ Z 1+is to δ Z w+is , let its past boundary be closed and future boundary be open.Also, let the core region denote the set of edges in the window that are incident to at least one vertex ranging from δ Z 1+is to δ Z (i+1)s and the buffer region denote the remaining edges in the window.For the final window, both time boundaries are closed and all edges belong to the core.The core regions in two adjacent windows overlap in exactly the vertices on the past boundary of the later window.
The forward decoder processes the windows in temporal order.Within each window starting from δ Z 1+is , the inner decoder first finds a set of corrections that can annihilate the existing defects but only retains the corrections in the core region.If the current window is the final one, all observed defects will have been annihilated and the decoder will terminate.Otherwise, only the defects from δ Z 1+is to δ Z (i+1)s will be annihilated, and the detectors on δ 1+(i+1)s will be updated and deferred to the next window.
Intuitively, the corrections found in the core become more reliable with larger buffer region, as the future faults outside the window are less likely to affect the core.A window needs no buffer preceding the core because all past defects have been reliably annihilated, rendering its past time boundary closed.As far as we know, the idea of forward decoder first appears in [5] with w = 2s, and is rediscovered in [30] with s = 1 but without specifying the future-boundary conditions of windows.A limitation for the forward decoder is that windows must be processed sequentially due to the defect updates.
Sandwich decoder.-Next,we introduce the sandwich decoder whose windows can be processed effectively in parallel; see Fig. 2(c).Without loss of generality, assume w + s is even and s ≥ 2.
Except for the past boundary of the initial window and the future boundary of the final window, all the time boundaries of windows are open.For each window that spans the detectors from δ Z 1+is to δ Z w+is , the core region contains all edges incident to at least one vertex ranging from δ Z (w−s)/2+1+is (δ 1 if initial window) to δ Z (w+s)/2−1+is (δ Z n+1 if final window).The core regions in two adjacent windows overlap in exactly one cycle of detectors, which we refer to as a seam.
The sandwich decoder has two subroutines.The first subroutine decodes all the windows.That is, each inner decoder finds corrections that can annihilate all defects within the window but only retains the ones in the core.Observe that the retained corrections collectively annihilate all defects in the whole decoder graph except on δ Z (w±s)/2+is -the seams between adjacent core regions.The second subroutine decodes all the seams.Within each seam, it first updates the existing defects using the corrections retained in adjacent cores and then finds another set of corrections to annihilate the updated defects using the same heuristic as that of the inner decoder.Note that the second subroutine can always find valid corrections for each seam since the top and bottom space boundaries of the two-dimensional decoder graph (i.e., a three-dimensional decoder graph with only one layer of detectors and both time boundaries closed) are open.The output of the sandwich decoder consists of corrections from the cores in the first subroutine and the seams in the second subroutine.
Each window can be decoded independently, and each seam is ready to be decoded once the two adjacent windows have been decoded.Therefore, these windows and seams are naturally parallelizable.The seams can in fact be generalized to three-dimensional windows; see [34].We also discuss in [34] other generalizations of the sandwich decoder, especially the potential to adapt it to the stability experiment [35] and lattice surgery [36].
Performance.-We benchmark our sandwich decoder for odd code distances d = 3, 5, • • • , 17 with step size s = (d + 1)/2 and window size w = 3s.We employ the < l a t e x i t s h a 1 _ b a s e 6 r a E L w l l 9 e J c 2 r s l c p e w 8 3 p Z q b x 1 G A U z i D C / D g F m p w D 3 V o A I U x P M M r v F m Z 9 W K 9 W x + L 1 j U r n z m B P 7 A + f w D T X Z J w < / l a t e x i t >  independent depolarizing noise model with rate p for the entire circuit.
We take the ansatz that each cycle of syndrome extraction-after applying the corrections-flips the logical qubit independently with probability p L (1).Assuming perfect data qubits initializations and final measurements, the probability of flipping the logical qubit after i cycles of syndrome extraction p L (i) satisfies We estimate the logical error rate per d cycles p L (d) via Monte Carlo simulations and observe thresholds of physical error rates p = 0.68% and 0.55% for the min-weight perfect matching and union-find decoders, respectively as the inner decoder.See Fig. 3.We also estimate p L (d) for batch decoding, i.e., decode all the syndromes of a memory experiment as a single window, and observe thresholds of 0.70% and 0.55% for the respective inner decoder.See [34] for simulation details and numerical analysis such as varying step size and window size, comparison with the forward decoder, etc.
Discussion.-We have validated the sandwich decoder for the surface code with memory experiment.In Supplementary Material [34], we provide a parallel divideand-conquer formalism for our sandwich decoder, which applies to general stabilizer codes and possibly general logical operations.For example, for lattice surgery one may need to divide syndromes into windows along the space direction, as well as the time direction.However, further theoretical justification or numerical validation is needed to fully evaluate our sandwich decoder in realtime decoding for fault-tolerant computation.
We would like to thank Xiaotong Ni for the insightful discussion and Hui-Hai Zhao for providing additional computational resource to speed-up the simulation.This work was supported by Alibaba Group through Alibaba Research Intern Program, and conducted when X.T. was a research intern at Alibaba Group USA.
Note added.-Anindependent work [37] was available to the public concurrently with ours.We focus on the memory experiment for the [[d 2 , 1, d]] rotated surface code [1] that preserves logical state |0 ; the argument for other variants of the surface code or logical basis states proceeds analogously.Specifically, we first initialize all the data qubits into |0 states.Then, we repeatedly apply a syndrome-extraction circuit for n cycles and of the X-and Z-type check operators, respectively, for i = 1, . . ., n.Finally, we measure all the data qubits in the Z basis and obtain outcomes µ ∈ {0, 1} d 2 .
For the surface code, each cycle of detectors is the XOR of two consecutive cycles of syndromes.More precisely, are syndromes of the Z-type check operators evaluated from the outcomes µ of final measurements on the data qubits.We further assume that the syndrome-extraction circuit is fault-tolerant (e.g., see Fig. S1) and the whole circuit of the memory experiment is afflicted with stochastic Pauli noises.Specifically, each qubit preparation, idle qubit, gate, and qubit measurement is modeled as the ideal operation followed or preceded by a random Pauli fault supported on the involved qubit(s).
Under our assumptions about the circuit and noise model: 1.All detectors are 0 in the absence of faults.We regard the detectors with value 1 as defects.

The occurrence of each fault flips at most two detectors of each type (X or Z). We define a detector to be open
if there is a fault that flips that detector but no other detector of the same type; otherwise, it is closed.
+ l q 0 b T j 5 z B n / g f P 4 A e x e M s g = = < / l a t e x i t >

2
< l a t e x i t s h a 1 _ b a s e 6 4 = " D S p T M 2 m 6 I N w V t 9 e Z 2 0 q x X v p u I 1 r 8 v 1 W h 5 H A c 7 h A q 7 A g 1 u o w z 0 0 o A U M E J 7 h F d 6 c R + f F e X c + l q 0 b T j 5 z B n / g f P 4 A e x e M s g = = < / l a t e x i t >

2
< l a t e x i t s h a 1 _ b a s e 6 4 = " D S p T < l a t e x i t s h a 1 _ b a s e 6 4 = " F D a K g v y l  < l a t e x i t s h a 1 _ b a s e 6 4 = " F D a K g v y l x x s C g / F a E + h K h Z D n 8 5 w = " > A A A B 6 H i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w F R K p f e y K b l y 2 Y B / Q h j K Z 3 r R j J 5 M w M x F K 6 B e 4 c a G I W z / J n X / j t I 2 g o g c u H M 6 5 l 3 v v 8 W P O l H a c D y u 3 t r 6 x u Z X f L u z s 7 u 0 f F A + P O i p K J I U 2 j X g k e z 5 R w J m A t m a a Q y + W Q E K f Q 9 e f X i / 8 7 j 1 I x S J x q 2 c x e C E Z C x Y w S r S R W u 6 w W H L s u u P W L 6 t 4 R W r l j F T q 2 L W     The goal of a decoding procedure is to annihilate all defects by finding a proper set of corrections-edges which give rise to the exact same defects.Formally, a defect is annihilated if it is incident to an odd number of edges in the set.

e x i t >
A < l a t e x i t s h a 1 _ b a s e 6 4 = " B J n 1 B x H 6 7 c U Z Q e l T v 7 s 0 R 4 H G t n w = " > A A A B 6 H i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w F R L R P n a l b l y 2 Y B / Q h j K Z 3 r R j J 5 M w M x F K 6 B e 4 c a G I W z / J n X / j t I 2 g o g c u H M 6 5 l 3 v v 8 W P O l H a c D y u 3 t r 6 x u Z X f L u z s 7 u 0 f F A + P O i p K J I U 2 j X g k e z 5 R w J m A t m a a Q y

A. Open and closed boundary conditions
In our memory experiment, each three-dimensional decoder graph has six boundaries: four space boundaries and two time boundaries.Each space boundary consists of all the real detectors adjacent to the top-, bottom-, left-, or right-most (in the directions of Fig. S2(a))1 data qubits of each layer respectively.The first and last layers of real detectors are the time boundaries, representing the detectors at the time of data qubit initialization and final data qubit measurements.
A  Remark 3. One way to intuitively justify the words "open" and "closed" is by looking at the forms of undetectable errors.For codes without space boundaries (such as the toric code), an undetectable error always looks like a cycle or a combination of cycles, either topologically trivial (in which case it will never cause a logical error) or not (in which case it may be a logical operator).For codes with space boundaries, an undetectable error can also be a path with both ends at the open boundaries, as if the path goes into and out of the code patch through those boundaries.
To understand open and closed time boundaries, let us continue with the quantum memory experiment for a logical qubit |0 .For the Z-type decoder graph, both time boundaries are closed since every fault flips exactly two Z detectors (including the imaginary detectors on the open space boundaries).Example 4. Suppose there is only a Z stabilizer measurement error during the last cycle of syndrome extraction, i.e., It follows from Eq. (S1) that there are only two defects in δ Z n and δ Z n+1 respectively.The edge connecting these two defects indicate the Z stabilizer measurement error.Example 5. Suppose there is only a data qubit measurement error at the end of the memory experiment.We have σ Z i = 0 for i ∈ {1, • • • , n}.As we calculate the final set of Z syndromes σ Z n+1 based on the data qubit measurement results µ, one flipped data qubit affects all the check operators that it involves.Therefore, σ Z n+1 has 1 or 2 non-trivial syndromes and δ Z n+1 has 1 or 2 defects.With the imaginary detectors on the open space boundaries, any single data qubit measurement fault flips exactly two detectors.
However, for the X-type decoder graph, the outcome of the first cycle of X stabilizer extraction is a random binary string even if there are no errors.Therefore, we need to make the initial time boundary open (i.e., allowing the bottom detection events to connect to some virtual vertices) in order to explain those non-trivial syndrome measurement results.Similarly, the ending time boundary also needs to be open.Since in the end after we measure the data qubits in the Z basis, the outcome µ does not provide any information about the X syndromes.
We emphasize that the open and closed boundary conditions are not just some mathematical tricks that marginally improve the performance of the decoder; on the contrary, to get meaningful results from the quantum memory experiment, one must correctly close or open the boundaries according to the context (e.g., the code or the specific type of errors).
As we focus on the memory experiment of preserving logical |0 , our goal is to prevent the logical Z operator from being flipped (an odd number of times).But if one of the time boundaries in the Z-type decoder is open, there will be low-weight (i.e.short) undetectable X errors with both endpoints on that time boundary.Such an undetectable X error can easily flip the logical Z operator, violating the principle that only at least d/2 physical errors can cause a failure.On the other hand, when both time boundaries are closed, everything makes sense: The only open boundaries are the top and bottom space boundaries, and low-weight X errors starting and ending at one of those boundaries can only flip the logical Z operator an even number of times.To flip the logical Z operator, an error must cross from one open space boundary to the other open space boundary, but then it is a logical operator with weight ≥ d, and all is well.will be offset by a certain number of layers, which we refer to as the step size.The following three steps for a general sliding-window decoder can give readers an intuitive understanding of how it works: 1. allocate all the detectors to multiple windows, 2. decode each window, and 3. combine all the individual corrections such that the overall corrections are still valid.
The idea of sliding-window decoders seems valid based on the intuition that to decode any part of the entire decoder graph in spacetime, we only need a relatively small amount of local information on errors, e.g., detectors in a window of size O(d).However, there exist very few implementations which can properly handle all three steps for sliding-window decoders (particularly, the last two) and nicely preserve the logical information throughout the quantum circuit as time increases.
In this paper, we rigorously define and analyze the sliding-window decoder framework, including the insights and limitations of an existing implementation [2], and propose two variants: the forward decoder (Section S2 C) and the scalable sandwich decoder (Section S2 D).They not only can address the aforementioned problems of a batch decoder, but also have many other desirable advantages.In summary: • (forward, sandwich) scale nicely in code distance, with almost equivalent threshold as that of a batch decoder, • (forward, sandwich) scale nicely in time as the number of QEC cycles increases, with almost equivalent performance regarding the logical error rate as that of a batch decoder, • (sandwich) each window can be decoded in perfect parallel, allowing multiple classical execution units to effectively increase the decoding throughput, • (sandwich) has a clear potential to perform logical operations on multiple qubits (in Section S5 A, we briefly discuss how one may approach the problem of generalizing the sandwich decoder to the case of lattice surgery).

B. Preliminaries
Before revealing more details of the sandwich decoder, we shall introduce some important concepts.
Inner decoder In a sliding-window decoding scheme, each window is treated as an independent decoding task with its own decoder graph.Thus, each window is fed into an inner decoder, a subroutine that generates corrections for that window only.Therefore, designing a sliding-window decoder includes choosing an inner decoder (e.g., UF or MWPM) and an appropriate decoder graph (e.g., open and closed boundary conditions) for each window.We describe our design choice of the inner decoder for our experiments in Section S3 A.

Artificial time boundaries
The idea of allocating detectors in a decoder graph into windows naturally creates two artificial time boundaries for each window (except for the first and the last ones, each of which has one real time boundary and one artificial time boundary).Note that these artificial time boundaries do not represent any real initialization or termination of the memory experiment.Therefore, they should naturally be open, indicating that there may still be detectors at the other side of each artificial time boundary unknown to the current window.
From the perspective of the current window, an isolated defect on such a time boundary is more likely to be caused by some faults from the future or the past.However, if the inner decoder regards this boundary as closed, it will be forced to generate corrections of a higher weight within the current window, which is more likely to cause a logical error in the final result.
However, with some modifications, an artificial time boundary of a window may also become closed.For an example, see the syndrome propagation procedure described in Section S2 C.

Core regions and buffer regions
In both forward and sandwich decoders, each inner decoder takes all the detectors within a window as input and return corrections for the entire window.However, since adjacent windows generally have an overlap, we do not need to apply all these corrections.Instead, each window only accepts a part of the assignments that are relatively reliable and disregards the rest.The former region (where corrections are accepted) is called the core region and the latter region (the rest of the window) is called the buffer region.The size of a core region in general equals the step size.The buffer regions are usually close to the open artificial time boundaries since their corrections are less trustworthy.We will be more precise about these two regions later.
Intuitively, the buffer regions between windows let them share precious contextual information on errors with each other.Therefore, having a large buffer is beneficial when we merge individual corrections generated by each inner decoder back to the entire decoder graph.The corrections accepted in the core region will be reliable enough only when each buffer region is large enough.
Correction consistency One important principle of surface codes is that the "correct" corrections are not unique: Any two sets of corrections that differ by one or more stabilizers are logically equivalent.This fact poses a challenge for sliding-window decoders: Even if each decoder window individually finds a "correct" set of corrections, there may not be an obvious method to combine them into a consistent set of corrections for the entire decoder graph.
An example of this is illustrated in Fig. S7(b): The windows labeled 1 ○ and 2 ○ return inconsistent corrections along the "seam" where we want to merge.The possibility of such an inconsistency means that the combined corrections may not annihilate all defects and that even a low-weight fault may cause a logical error.

C. Forward decoder
Our forward decoder is a generalization of the "overlapping recovery" method introduced in [3] as well as the sliding-window scheme used in LILLIPUT (a Lightweight Low Latency Look-Up Table decoder) proposed by Das et al. in [2].The main idea is to sequentially decode each window one at a time, propagating necessary information from the current window to the next window with the syndrome propagation procedure implicitly introduced in [3] and more explicitly described in [2].

Syndrome propagation
The syndrome propagation procedure solves the inconsistency problem by forcing the next window to output consistent corrections with the current window.The detectors on the oldest layer of the next window are updated according to the decoding results for the current window, which only accepts corrections old enough to be outside of the next window.
Let us ignore it for now and let s be the step size.The core region of each forward window includes the edges within the oldest s layers and the edges connecting the s-th oldest layer and the (s + 1)-th oldest layer.In this case, we say that the core region for a forward window has size s.What remains is the buffer region which exactly consists of all its overlapping edges with the next window.See Fig. S4(a) for an example.Hence, all the defects within the oldest s layers will be annihilated.The detectors in the (s + 1)-th oldest layer will be updated if some accepted corrections connect from the s-th oldest layer to the (s + 1)-th oldest layer.All other detectors in the remaining layers stay the same.Therefore, when the next window starts from the (s + 1)-th layer of the current window, only this layer is updated (Fig. S7(a)).
This way, for each window, since all the defects prior to it have been annihilated by previously accepted corrections, it has a closed past boundary and an open future boundary.The closed boundary means that the inner decoder will not change any corrections already accepted in the past, and thus consistency between windows is ensured.
Note that in LILLIPUT, it is unclear whether the future time boundary for each window is open or closed.But according to the arguments in Section S2 B, this boundary should be open by its nature.
Handling the last window The size of the last window depends on the total number of cycles in the memory experiment, meaning that it could be smaller than the regular intermediate windows.For the Z-type decoder graph in the memory experiment preserving logical |0 , both time boundaries of the last windows are closed.All the generated corrections are accepted, i.e., the entire window is the core region.
Step size Compared with LILLIPUT which fixes the step size to be 1, our implementation of the forward decoder allows flexible step size.The advantage of having a step size of 1 is that for any fixed window size w, the length of the buffer region b = w − 1 is maximized.However, moving forward only by one QEC cycle at a time has an obvious disadvantage in terms of the total time complexity of decoding all windows.In an experiment with a total of n surface code cycles, the number of sliding windows needed is O(n), and each window needs at least O(pd2 w) time to decode, making the total time complexity O(pd 2 nw).The extra factor of O(w) severely limits the possible code distance that can be implemented in practice and may force an implementation to use much more classical computational resource to achieve the desired throughput.This problem can be solved by moving forward s = O(w) cycles at a time, while still preserving the same amount of buffer b = w − s.Importantly, when b is fixed, increasing the size of each window w will increase the step size s and thus decrease the total time complexity (assuming a linear time complexity for the underlying decoder).For example, if we set s = b and w = 2b, then the total time complexity becomes i.e., only a constant overhead compared to the batch decoder.
Problem with parallelization A remarkable disadvantage of the forward approach is the strict dependency among all windows: One cannot start decoding the next window until the current window is finished.Therefore, once the decoding time of each window fails to keep up with s cycles of newly extracted syndromes where s is the step size, the latency accumulates.This is one of the reasons why LILLIPUT employs look-up tables (LUTs) as their inner decoders.On one hand, since the LUTs are generated off-line (or in advance) with MWPM, they are fast and accurate while decoding each window; however, the huge amount of memory needed to store these LUTs also restricts LILLIPUT to code distance up to 5 and window size up to 3.
For sliding-window decoders, it will be a huge boost to the throughput if we can decode all windows in parallel.However, this is impossible for a forward decoder since this data dependency causes the critical path 2 to run through all windows.
< l a t e x i t s h a 1 _ b a s e 6 4 = " n 7 b 2 2 B J m

D. Sandwich decoder
We propose an alternative approach to sliding-window decoders that also allows adequate buffer regions and solves the correction consistency problem, but has a much shorter critical path that does not increase with the total number of windows3 and thus enables parallelism between windows.We name this approach the sandwich approach, which comes with two interpretations, as we will mention below.
We shall introduce two types of sandwich windows.If we do not specify the type of a window, it should be clear from the context.
Type-1 windows: buffer regions in both directions Recall that syndrome propagation is the key ingredient in solving the correction consistency problem in the forward decoders, but it is also what causes the long critical path since it only proceeds in the forward direction.On the other hand, since the decoder graph formalism is symmetric with respect to the direction of time, each window should be capable of propagating syndromes in the backward direction as well.
It follows naturally that the core region of each window (we shall ignore the first and the last windows for now) can be "sandwiched" by two buffer regions, resulting in the definition of a type-1 sandwich window.If the step size is s, then the core region includes all edges within the middle s layers.There is also some freedom regarding the specific design of the core region for a type-1 sandwich window.For example, in the experiments readers shall see in this paper, we add the edges connecting across the latest layer in the middle s layers and its next layer to the core region as well.We say that the core region has size s.What remains are two separate buffer regions, each of which has size b and the window size is thus w = s + 2b.See Fig. S4(b) for an example.Each type-1 sandwich window overlaps with its next window and its previous window in 2b layers of detectors respectively.Both artificial time boundaries are open.
Same with the forward decoders, the sandwich decoders also have flexible step size.Most importantly, type-1 windows are not dependent on any other window, so all of them can be decoded in parallel.
Type-2 windows: merging corrections Since type-1 windows propagate syndrome information forwards and backwards, there must be another type of windows that receive syndrome information from both ends which we define as type-2 windows.Since syndrome propagation closes the corresponding time boundary of the receiving window, each type-2 sandwich window has both time boundaries closed, and the entire window is the core region.
Since each type-2 window is sandwiched between two type-1 windows, it is dependent on the decoding results of the two independent type-1 windows.But since all type-2 windows are independent from each other, they can be decoded in parallel.
After we apply the corrections from type-1 windows, there may exist defects not annihilated yet.The role of the type-2 windows is to reconcile this inconsistency by neutralizing all remaining defects.To illustrate this more clearly, we call the latest layer of detectors in a core region the right seam and the oldest layer the left seam.The core regions of two adjacent type-1 windows can have three patterns (Fig. S5) and we refer to this difference using a parameter called seam offset.
• When the seam offset is 0, meaning that the right seam of the former core overlaps with the left seam of the latter core.The type-2 windows thus become two-dimensional and include the updated detectors in these pairs of seams.
• When the seam offset is positive, the core regions do not have any overlap.The 3D type-2 windows thus include the detectors that are in neither type-1 windows and the updated detectors in the right and left seams pairs; • When the seam offset is negative, the core regions overlap in more than 1 layers.The 3D type-2 windows thus include the corrections for the overlapping defects from type-1 windows.
In each case, the size of a type-2 window is |t|+1 where t is the seam offset value.After decoding all type-2 windows and apply every correction, all defects will be annihilated.Handling the first and the last windows The first window should correct not only the middle s layers, but also all the layers before, since it does not have a previous window.Therefore, the first window has only one buffer of size b and the core region has size s + b.Similarly, the last window corrects the middle s layers as well as all the layers after them.It also has only one buffer of size b and the core region has size s + b.Comparison with the forward decoder Fig. S7 visualizes the difference between the forward approach and the sandwich approach.Note that, for simplicity of illustration and implementation, we set the seam offset to be 0, causing the type-2 windows to degenerate into 2D decoder graphs.
< l a t e x i t s h a 1 _ b a s e 6 4 = " n u

○
< l a t e x i t s h a 1 _ b a s e 6 4 = " s / X y 2 V q y i p l V 8 g P W 2 w f A n q M D < / l a t e x i t >

○
< l a t e x i t s h a 1 _ b a s e 6 4 = " n u 3 A y E k j

S3. MONTE CARLO SIMULATION OF THE THRESHOLD A. Union-find inner decoders
In most of our experiments, we use a union-find (UF) decoder, as proposed in [4], to decode each individual window (including the forward windows and both types of the sandwich windows).In our implementation, we use the weighted growth version of the decoder described in Section 5 of [4], although due to implementation considerations, our definition of the "boundary size" may be slightly different compared to the definition used in that paper.
We chose the UF decoder due to its low time complexity both in theory and in practice, but it was unclear whether the UF decoder is the best fit for a sliding-window scheme.The reason is that, in terms of its theoretical foundation, the UF decoder does not try to approximate the minimum-weight correction; instead, it tries to find an equivalence class that is likely to contain the actual error, and then it chooses an arbitrary correction in that equivalence class with a simple peeling decoder.This means that the updated detectors at the right and left seams obtained by applying Algorithm 1 provides one generalization of the sandwich decoder regarding disjoint core regions across windows (such as having non-negative seam offset).One can easily construct similar variant for overlapping core regions.Let denote the union of disjoint sets.
Algorithm 1 Generalized Sandwich Decoder GS(V, E, D) 1: if (V, E) consists of disconnected subgraphs, each of a small enough size then 2: Apply the inner decoder to each subgraph in parallel, return the union of the outputs 3: Apply the partition method to choose "cores" {C i ⊆ V } i with disjoint {∆(E, C i )} i , each of a small enough size 4: Apply the inner decoder in parallel to calculate corrections K i ⊆ ∆(E, C i ), for all i, with Algorithm 1 always terminates after finite recursions since | V | is strictly increasing.The emphasis of the algorithm is on how to parallelize the computation.It does not address the chance of failure, or the chance of logical error when valid corrections are output.Thus, its success and accuracy requires additional theoretical or empirical justifications.When it does not fail, however, the output is valid.To see this, first observe that Step 5 defines a valid input instance, i.e., all vertices involved in E are in V , and D ⊆ V .The former follows from the definition of V and E .To see the latter, note where Ci denotes the set complement of C i .It then follows from ∂K i ∩ C i = D ∩ C i as required in Step 4 that Note that for all i, Ci ∩ ∂(K i ) ⊆ V , for otherwise there will be a j, j = i, such that ∆(E, C i ) ∩ ∆(E, C j ) has an element in ∂(K i ), a contradiction to the disjointness of {∆(E, C i )} i .Therefore D ⊆ V .By an inductive argument, Thus, the algorithm always outputs valid corrections, provided it does not fail.The algorithm also leaves significant freedom in choosing the inner decoder and the partition method.Our sandwich decoder partitions the input graph along the time direction, which disconnects the graph, resulting in a depth-2 recursion.It also guarantees success based on the graph properties of the windows.An alternative inner decoder to what we experiment with is pre-computed lookup tables, when one sets the base input size to be small enough.It would be interesting to explore and evaluate the many design choices.
Stability experiment One omission from this paper is the stability experiment as described in [6].A technical problem we have encountered when trying to apply our scheme to the stability experiment is that, since the surface code patch used in the stability experiment has closed space boundaries on all sides, the decoder graph for a type-2 window would not have any open boundary at all.Yet it can still get an odd number of defects as the input if the two adjacent windows yield completely different corrections.Furthermore, it takes only O(w) errors to cause such an irreconcilable consistency, whereas the stability experiment is supposed to be able to tolerate any O(n) errors (remember that in our notations, n is the total number of cycles in the experiment and w is the number of cycles in a window).
One may argue that this is probably an inherent disadvantage of sliding-window schemes: The forward-window scheme does not have a decoder graph without any open boundary, but O(w) errors can cause a logical error instead.However, the real and also more interesting question is whether it is meaningful to divide the stability experiment into different windows in the direction of time.
We note that one of the main motivations for the stability experiment is to emulate the "space-like parts" that arise in various useful logical operations with lattice surgery, such as moving a qubit or doing a two-qubit parity measurement.Ideally, each of those "space-like part" should last only for O(d) surface code cycles, since adding more cycles has diminishing returns for suppressing time-like logical errors and is detrimental for suppressing spacelike logical errors (of the opposite X/Z type).Therefore, it makes less sense to divide the stability experiment into windows by time, as opposed to the memory experiment, which in practical scenarios can last much more than O(d) cycles (depending on the number of logical operations applied on a logical qubit).On the other hand, it is more plausible that the spatial span of a "space-like part" is significantly larger than d, depending on the physical distance on the surface code lattice between the qubits involved.Thus, it may make more sense to consider stability experiments on an elongated rectangular code patch and divide it into windows in a spatial direction instead.Such a sliding-window decoder would probably be much easier to formulate too, as it is not fundamentally different from a sliding-window decoder for the memory experiment, only with the roles of the time and one spatial dimension switched.
Lattice surgery By the same token, it is not a stretch to straightforwardly generalize our sliding-window decoders to some more useful operations in lattice surgery, like the aforementioned qubit movements and two-qubit parity measurements.For example, the two-qubit parity measurement has an overall "H"-shaped decoder graph, as opposed to the rectangular box-shaped decoder graph for the memory experiment (where the box is elongated in the temporal direction) or the stability experiment (where the box is elongated in a spatial direction), but it is still straightforward to divide the graph into 3D windows each with dimensions O(d) × O(d) × O(d), as shown in Fig. S14.Two "T"-shaped windows simply need to propagate seam syndromes in three directions instead of two.Other lattice surgery operations may add more complexity to the scheme-for example, a twist defect may require combining the X decoder graph and the Z decoder graph in some way-but it seems that the same principle should be able to handle everything.

B. Theoretical analysis
This paper analyzes sliding-window decoders mainly via simulation experiments accompanied by some intuitive justifications for some of our design choices.Many of those choices have been made because the alternative would cause the effective code distance to be less than d.For example, without any form of syndrome propagation, and assuming that the underlying decoder is a black box MWPM decoder which may output any correction with minimum weight, a constant-weight error would be enough to make the correction inconsistent and flip any logical operator representative.Similarly, without an adequate buffer region between windows, there exists an error with weight d/4 + O(1) that causes a logical error, making the effective code distance d/2 + O(1) instead of d.
However, we have not proven that the effective code distance of our current proposed scheme is indeed d.A rigorous proof would not only further justify our scheme, but also potentially uncover some alternative choices that could have been made without breaking the effective code distance guarantee.
Of course, the effective code distance is only half of the story, and arguably it is the less realistic half since surface codes are not usually supposed to work in the regime where the total number of faults is expected to be less than half of the code distance.Instead, the main strength of surface codes is their ability to correct most of the possible error configurations even when the total number of faults is much larger than the code distance.This can be well captured by the code threshold.A theoretical lower bound of the threshold of our scheme, even a very loose one, would be an interesting result.

2 <FIG. 1 .
FIG. 1. Decoder graphs of Z-type (b) and X-type (c) for a memory experiment with 3 syndrome-extraction cycles that preserves |0 of a distance-5 rotated surface code (a).Data qubits (black) reside on the plaquette corners.Check operators of Z-type (blue) and X-type (red) are measured with ancilla qubits (empty circles) on the plaquette centers.Blue and red vertices denote Z-and X-type real detectors, respectively; white vertices denote imaginary detectors.Each edge represents the set of faults that flip the incident detectors.Each decoder graph has two open space boundaries, whereas the X-type decoder graph also has two open time boundaries.
< l a t e x i t s h a 1 _ b a s e 6 4 = " L S 5 u m U t 0 C Z u R W 6 e Z l A t A e 1 N I I U 0= " > A A A B 8 X i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 k V U Y 8 F P X i S C v Y D 2 6 V k 0 2 w b m k 2 W J G s p S / + F F w + K e P X f e P P f m G 7 3 o K 0 P B h 7 v z T A z L 4 g 5 0 8 Z 1 v 5 3 C y u r a + k Z x s 7 S 1 v b O 7 V 9 4 / a G q Z K E I b R H K p 2 g H W l D N B G 4 Y Z T t u x o j g K O G 0 F o + u Z 3 3 q i S j M p H s w k p n 6 E B 4 K F j G B j p c f u j R w L r J Q c 9 8 o V t + p m Q M v E y 0 k F c t R 7 5 a 9 u X 5 I k o s I Q j r X u e G 5 s / B Q r w w i n 0 1 I 3 0 T T G Z I Q H t G O p w B H V f p p d P E U n V u m j U C p b w q B M / T 2 R 4 k j r S R T Y z g i b o V 7 0 Z u J / X i c x 4 Z W f M h E n h g o y X x Q m H B m J Z u + j P l O U G D 6 x B B P F 7 K 2 I D L H C x N i Q S jY E b / H l Z d I 8 q 3 o X V e / + v F K 7 y + M o w h E c w y l 4 c A k 1 u I U 6 N I C A g G d 4 h T d H O y / O u / M x b y 0 4 + c w h / I H z + Q P T I J E P < / l a t e x i t > + < l a t e x i t s h a 1 _ b a s e 6 4 = " L S 5 u m U t 0 C Z u R W 6 e Z l A t A e 1 N I I F K 7 y + M o w h E c w y l 4 c A k 1 u I U 6 N I C A g G d 4 h T d H O y / O u / M x b y 0 4 + c w h / I H z + Q P T I J E P < / l a t e x i t > + < l a t e x i t s h a 1 _ b a s e 6 4 = " L S 5 u m U t 0 C Z u R W 6 e Z l A t A e 1 N I I F K 7 y + M o w h E c w y l 4 c A k 1 u I U 6 N I C A g G d 4 h T d H O y / O u / M x b y 0 4 + c w h / I H z + Q P T I J E P < / l a t e x i t > + < l a t e x i t s h a 1 _ b a s e 6 4 = " L S 5 u m U t 0 C Z u R W 6 e Z l A t A e 1 N I I F K 7 y + M o w h E c w y l 4 c A k 1 u I U 6 N I C A g G d 4 h T d H O y / O u / M x b y 0 4 + c w h / I H z + Q P T I J E P < / l a t e x i t > + < l a t e x i t s h a 1 _ b a s e 6 4 = " f j 7 C N p 6 Y w c 5 9 3 E z a c D i b J G z F 2 b I = " > A A A B 7 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X Y B E 8 l U R E P R a 8 e K x g P 6 C N Z b P Z t G s 3 u 2 F 3 I p T Q / + D F g y J e / T / e / D d u 2 x y 0 9 c H A 4 7 0 Z Z u a F q e A G P e / b W V s h a 1 _ b a s e 6 4 = " f j 7 C N p 6 Y w c 5 9 3 E z a c D i b J G z F 2 b I = " > A A A B 7 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X Y B E 8 l U R E P R a 8 e K x g P 6 C N Z b P Z t G s 3 u 2 F 3 I p T Q / + D F g y J e / T / e / D d u 2 x y 0 9 c H A 4 7 0 Z Z u a F q e A G P e / b W V r 8 8 S 8 p H t n t m u 7 c n + c L V J I 4 M 2 S V 7 5 I C 4 5 J w U y A 0 p k h L h 5 I E 8 k R f y a j 1 a z 9 a b 9 T 5 u n b M m M z v k D 6 z P b 7 w J o w A = < / l a t e x i t > 1 ○ < l a t e x i t s h a 1 _ b a s e 6 4 = " D D l M f r t 9 E a S K P 7 e Z 0 o 2 r 8 8 S 8 p H t n t m u 7 c n + c L V J I 4 M 2 S V 7 5 I C 4 5 J w U y A 0 p k h L h 5 I E 8 k R f y a j 1 a z 9 a b 9 T 5 u n b M m M z v k D 6 z P b 7 w J o w A = < / l a t e x i t > 1 ○ < l a t e x i t s h a 1 _ b a s e 6 4 = " D D l M f r t 9 E a S K P 7 e Z 0 o 2 t D G R w Q ys = " > A A A C H H i c b V D J S g N B E O 1 x j X E b 9 e i l M Q h e H G b c v Q W 9 e I x g F k h C 6 O l U T J O e h e 4 a S R j m Q 7 z 4 K 1 4 8 K O L F g + D f 2 F l A j T 4 o e L x X R V U 9 P 5 Z C o + t + W j O z c / M L i 7 m l / P L K 6 t q 6 v b F Z 0 V G i O J R 5 J C N V 8 5 k G K U I o o 0 A J t V g B C 3 w J V b 9 3 O f S r d 6 C 0 i M I b H M T Q D N h t K D q C M z R Sy z 5 s K C Y 0 + F E / d Y 5 j z N I G Q h + 5 U F x C O / 0 2 9 5 1 z 4 9 L 0 I M u y l l 1 w H X c E + p d 4 E 1 I g E 5 R a 9 n u j H f E k g B C 5 Z F r X P T f G Z s o U C r M m y z c S D T H j P X Y L d U N D F o B u p q P n M r p r l D b t R M p U i H S k / p x I W a D 1 I P B N Z 8 C w q 6 e 9 o f i f V 0 + w c 9 Z M R R g n C C E f L + o k k m J E h 0 n R t l D A U Q 4 M Y V w J c y v l X a Y Y R 5 N n 3 o T g T b / 8 l 1 Q O H O / E 8 a 6 P C s W L S R w 5 s k 1 2 y B 7 x y C k p k i t S I m X C y T 1 5 J M / k x X q w n q x X 6 2 3 c O m N N Z r b I L 1 g f X 7 2 Q o w E = < / l a t e x i t > 2 ○ < l a t e x i t s h a 1 _ b a s e 6 4 = " V 9 U 2 g e A W M p F w d V u W O u o N y R j A l q w = " > A A A C H H i c b V D L S g N B E J y N 7 / h a 9 e h l M A h e X H a N z 5 v o x a O C U S E b w u y k k w y Z f T D T K 4 Z l P 8 S L v + L F g y J e P A j + j Z O 4 o C FIG. 2. (a) Decoder graph of a first window of length w = 3.As in Fig. 1(b), blue and white vertices denote Z-type real and imaginary detectors, respectively.Each edge represents the set of faults that flip the incident detectors.The past time boundary (red line) is closed; the top and bottom space boundaries (black dashed line) and the future time boundary (red dashed line) are open.Imaginary detectors near the future time boundary are equivalent to vertices in δ Z 4 .Forward decoder (b) and sandwich decoder (c).Observed defects (pink dot) are annihilated by the corrections (blue line).Corrections in the cores (brown region) are retained, whereas those in the buffers (transparent region) are discarded.Retained corrections can create updated defects (orange dot) on the seams between adjacent cores.Imaginary detectors reside near the open boundaries (dashed line), and only those incident to corrections are shown (empty circle).The forward decoder must decode the windows sequentially by the order labeled, whereas the sandwich decoder can in principle decode the windows in parallel.

6. 8 ⇥ 10 3 5 £ 10 °3 6 £ 10 °3 7 £ 10 °3 8 £ 10 °3Physical error rate p 10 °2 10 °1
Logical error rate per d cycles p L (d)< l a t e x i t s h a 1 _ b a s e 6 4 = " n i l G g d 3 C d U r M f q I W l l y Z a G I 9 P H g = " > A A A B + X i c b V D L S g N B E O z 1 G e N r 1 a O X w S B 4 M e y q U Y 8 B L x 4 j m A c k a 5 i d z C Z D Z h / M 9 A b C k j / x 4 k E R r / 6 J N / / G S b I H T S x o K K q 6 6 e 7 y E y k 0 O s 6 3 t b K 6 t r 6 x W d g q b u / s 7 u 3 b B 4 c N H a e K 8 T q L Z a x a P t V c i o j X U a D k r U R x G v q S N / 3 h 3 d R v jr j S I o 4 e c Z x w L 6 T 9 S A S C U T R S 1 7 Y r 5 U o H R c i 1 6 z x l 5 5 e T r l 1 y y s 4 M Z J m 4 O S l B j l r X / u r 0 Y p a G P E I m q d Z t 1 0 n Q y 6 h C w S S f F D u p 5 g l l Q 9 r n b U M j a n Z 5 2 e z y C T k 1 S o 8 E s T I V I Z m p v y c y G m o 9 D n 3 T G

5 . 5 ⇥ 10 3 4 £ 10 °3 5 £ 10 °3 6 £ 10 °3 7 £ 10 °3Physical error rate p 10 °2 10 °1FIG. 3 .
FIG. 3. Threshold plots for the sandwich decoder with min-weight perfect matching (a) and union-find (b) decoders as the inner decoder.For fixed code distance d = 3, 5, • • • , 17 and physical error rate p, we first vary the number of cycles of syndrome extraction n and simulate each memory experiment for 10 5 shots.Then, we collect the estimated logical error rates per shot for varying n and calculate the logical error rate per d cycles pL(d), depicted as dot.Error bars indicate 95% statistical confidence and dashed lines indicate the thresholds.

7 w 7 T 8 6 r 8 7 F 4 <
q L D n F x B n 8 g f P 5 A x P F i 4 o = < / l a t e x i t > l a t e x i t s h a 1 _ b a s e 6 4 = " D S p T I j 3

7 w 7 T 8 6 r 8 7 F 4 <
q L D n F x B n 8 g f P 5 A x P F i 4 o = < / l a t e x i t > l a t e x i t s h a 1 _ b a s e 6 4 = " s KE V s J v H d / a S r s 4 s b D w b s 9 A G W 8 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E t M e C F4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e

y d 2 5 4 1 4 <
U d r 9 y 8 z M M o w j G c w B l 4 U I M m X E M L 2 s A A 4 R G e 4 c W 6 t 5 6 s 1 1 V j w c o n j u A H r L d P n / G L 9 A = = < / l a t e x i t > l a t e x i t s h a 1 _ b a s e 6 4 = " 8 r M o O S h w 3 h z l P P i v D s f i 9 a C k 8 8 c w x 8 4 n z + 4 7 5 B h < / l a t e x i t > |0i < l a t e x i t s h a 1 _ b a s e 6 4 = "i F Q v h t 2 y C 7 R x v k / X x + + b 7 K y g S C c = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 1 E 1 G P B S 4 8 V 7 Q e 0 o W y 2 m 3 b p Z h N 2 J 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G d z O / / c S 1 E b F 6 x E n C / Y g O l Q g F o 2 i l B / f C 6 5 c r b t W d g 6 w S L y c V y N H o l 7 9 6 g 5 i l E V f I J D W m 6 7 k J + h n V K J j k 0 1 I v N T y h b E y H v G u p o h E 3 f j Y / d U r O r D I g Y a x t K S R z 9 f d E R i N j J l F g O y O K I 7 P s z c T / v G 6 K 4 a 2 f C Z W k y B V b L A p T S T A m s 7 / J Q G j O U E4 s o U w L e y t h I 6 o p Q 5 t O y Y b g L b + 8 S l q X V e + 6 6 t 1 f V W r 1 P I 4 i n M A p n I M H N 1 C D O j S g C Q y G 8 A y v 8 O Z I 5 8 V 5 d z 4 W r Q U n n z m G P 3 A + f w B Y x 4 0 0 < / l a t e x i t >

4 < l a t e x i t s h a 1 _ b a s e 6 4 =
2 p p I y b b K x T Q h f n 5 L / y d 2 5 4 1 U d r 9 y 8 z M M o w j G c w B l 4 U I M m X E M L 2 s A A 4 R G e 4 c W 6 t 5 6 s 1 1 V j w c o n j u A H r L d P n / G L 9 A = = < / l a t e x i t > " 9 v i h P g l c c c r D E 3 2 8 8 m + 2 h M b g T 7

3 < l a t e x i t s h a 1 _ b a s e 6 4 =
" a b M O n o K C n 8 Q F 1 7 t P w p G e P Y W O b 8 o = " > A A A B 8 H i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S I I Q k l E 1 G P B S 4 8 V 7 I e 0 o W y 2 k 3 b p Z h N 2 N 0 K J / R V e P C j i 1 Z / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w q e N U M W y w W M S q H V C N g k t s G G 4 E t h O F N A o E t o L R 7 d R v P a L S P J b 3 Z p y g H 9 G B 5 C F n 1 F j p 4 e m 8 q 6 g c C O y V y m 7 F n Y E s E y 8 n Z c h R 7 5 W 5 w X 5 9 3 5 m L e u O P n M E f y B 8 / k D s T i Q X A = = < / l a t e x i t > |+i < l a t e x i t s h a 1 _ b a s e 6 4 = " v N L O t o 1 W P 7 o s A J r 7 l L 8 P

5 o 8 ZExample 1 .
FIG. S1.Rotated surface code [[d2 , 1, d]] with d = 3 (a).Data qubits (black) reside on the plaquette corners.Check operators of Z-type (blue) and X-type (red) are measured using the circuits in (b) and (c), respectively, with ancilla qubits (empty circles) on the plaquette centers.First, prepare each ancilla in the |0 or |+ state; then, apply CNOT gates on qubit pairs connected by black links, in the order specified by the numbers on the plaquette corners; finally, measure each ancilla in the Z or X basis.

Figure
Figure S2(b) illustrates a Z-type decoder graph constructed as follows.First, add one vertex for each Z-type detector.Then, add an edge between two vertices (detectors) if there is a fault that flips both.Finally, for each open detector, add an imaginary detector and an edge connecting them.We also assign each imaginary detector a binary value, such that each fault flips either zero or two Z-type detectors.Each edge in the decoder graph thus represents an equivalence class of faults which flip the same two detectors.The goal of a decoding procedure is to annihilate all defects by finding a proper set of corrections-edges which give rise to the exact same defects.Formally, a defect is annihilated if it is incident to an odd number of edges in the set.
6 X b X m r G z m G P 2 A 9 f Y J I S a N L A = = < / l a t e x i t > B < l a t e x i t s h a 1 _ b a s e 6 4 = " q u 7 7 b 3 J t S g l d d Z 4 7 t o N 4 h P + 0 m 6 Y = " > A A A B 7 X i c d V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K 5 n E L e v E Y w T w g

1 <
6 j 8 6 L 8 7 p s z T n Z z C H 8 g P P 2 C d g F j 1 I = < / l a t e x i t > (a) < l a t e x i t s h a 1 _ b a s e 6 4 = " u P g w W f h O m J v X u w X w 2 j 4 l 3 f y N w v 8 = " > A A A B 8 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K o m I e i x 4 6 b G C / a B t L J v N p l 2 6 2 Y T d i V B K / 4 U X D 4 p 4 9 d 9 4 8 9 + 4 b X P Q 1 g c D j / d m m J k X p F I Y d N 1 v Z 2 1 9 Y 3 N r u 7 B T 3 N 3 b P z g s H R 0 3 T Z J p x h s s k Y l u B 9 R w K R R v o E D J 2 6 n m N A 4 k b w W j u 5 n f e u L a i E Q 9 4 D j l f k w H S k S C U b R S p x d y i b T v P X b 6 p b J b c e c g q 8 T L S R l y 1 P u l r 1 6 Y s C z m C p m k x n Q 9 N 0 V / Q j U K J v m 0 2 M s M T y k b 0 Q H v W q p o z I 0 / m V 8 8 J e d W C U m U a F s K y V z 9 P T G h s T H j O L C d M c W h W f Z m 4 n 9 e N 8 P o 1 p 8 I l W b I F V s s i j J J M C G z 9 0 k o N G c o x 5 Z Q p o W 9 l b A h 1 Z S h D a l o Q / C W X 1 4 l z c u K d 1 3 x 7 q / K 1 V o e R w F O 4 Q w u w I M b q E I N 6 t A A B g q e 4 R X e H O O 8 O O / O x 6 J 1 z c l n T u A P n M 8 f J d i Q m A = = < / l a t e x i t > Z l a t e x i t s h a 1 _ b a s e 6 4 = " z 4 o G J f T Y g o 6 J E p L T X 2 j G D 2 S f a h Y = " > A A A B 8 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q S R F 1 G P B S 4 8 V 7 A d t Y 9 l s J u 3 S z S b s b o Q S + i + 8 e F D E q / / G m / / G b Z u D t j 4 Y e L w 3 w 8 w 8 P + F M a c f 5 t g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e J U U m z R m M e y 6 x O F n A l s a a Y 5 d h O J J P I 5 d v z J 3 d z v P K F U L B Y P e p q g F 5

2 <
9 a 7 9 b F s L V j 5 z C n 8 g f X 5 A y d e k J k = < / l a t e x i t > Z l a t e x i t s h a 1 _ b a s e 6 4 = " u 4 J J d H j N j 5 f D T 7 s o R A r 9 3 9 B I y x 0 = " > A A A B 8 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q S Q q 6 r H g p c c K 9 o O 2 s W w 2 k 3 b p Z h N 2 N 0 I J / R d e P C j i 1 X / j z X / j t s 1 B W x 8 M P N 6 b Y W a e n 3 C m t O N 8 W 4 W 1 9 Y 3 N r e J 2 a W d 3 b /

3 <
8 4 u n 9 p l R A j u M p S m h 7 b n 6 e y I j k V K T y D e d E d E j t e z N x P + 8 X q r D W y 9 j I k k 1 C r p Y F K b c 1 r E 9 e 9 8 O m E S q + c Q Q Q i U z t 9 p 0 R C S h 2 o R U M i G 4 y y + v k t Z F 1 b 2 u u v d X l V o 9 j 6 M I J 3 A K 5 + D C D d S g D g 1 o A g U B z / A K b 5 a y X q x 3 6 2 P R W r D y m W P 4 A + v z B y j k k J o = < / l a t e x i t > Z l a t e x i t s h a 1 _ b a s e 6 4 = " + K A w y O L R I + L n T Q J N 1 z F A I N o T m F I = " > A A A B 8 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q S R S 1 G P B S 4 8 V 7 A d t Y 9 l s J u 3 S z S b s b o Q S + i + 8 e F D E q / / G m / / G b Z u D t j 4 Y e L w 3 w 8 w 8 P + F M a c f 5 t g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e J U U m z R m M e y 6 x O F n A l s a a Y 5 d h O J J P I 5 d v z J 3 d z v P K F U L B Y P e p q g F 5 9 a 7 9 b F s L V j 5 z C n 8 g f X 5 A y p q k J s = < / l a t e x i t > Z 4(b)< l a t e x i t s h a 1 _ b a s e 6 4 = " h q9 L G h k E O z N d G 0 r X r D s o 9 m t i m F 8 = " > A A A B 8 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k R 9 V j w 0 m M F + 4 F t L J v N p F 2 6 2 Y T d j V B K / 4U X D 4 p 4 9 d 9 4 8 9 + 4 b X P Q 1 g c D j / d m m J k X p I J r 4 7 r f z t r 6 x u b W d m G n u L u 3 f 3 B Y O j p u 6 S R T D J s s E Y n q B F S j 4 B K b h h u B n V Q h j Q O B 7 W B 0 O / P b T 6 g 0 T + S 9 G a f o x 3 Q g e c Q Z N V Z 6 6 I U o D O 1 X H z v 9 U t m t u H O Q V e L l p A w 5 G v 3 S V y 9 M W B a j N E x Q r b u e m x p / Q p X h T O C 0 2 M s 0 p p S N 6 A C 7 l k o a o / Y n 8 4 u n 5 N w q I Y k S Z U s a M l d / T 0 x o r P U 4 D m x n T M 1 Q L 3 s z 8 T + v m 5 n o x p 9 w m W

2 <
y u 8 O d p 5 c d 6 d j 0 X r m p P P n M A f O J 8 / J F a Q l w = = < / l a t e x i t > X l a t e x i t s h a 1 _ b a s e 6 4 = " P r h 2 l B U 8 s 3 k 2 B a 3 V u C T r k Z P U f l 8 = " > A A A B 8 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K o m K e i x 4 6 b G C / c A 2 l s 1 m 2 y 7 d b M L u R C i h / 8 K L B 0 W 8 + m + 8 + W / c t j l o 6 4 O B x 3 s z z M w L E i k M u u 6 3 s 7 K 6 t r 6 x W d g q b u / s 7 u 2 X D g 6 b J k 4 1 4 w 0 W y 1 i 3 A 2 q 4 FIG. S2.Decoder graphs of Z-type (b) and X-type (c) for a memory experiment with 3 syndrome-extraction cycles that preserves |0 of a distance-5 rotated surface code (a).Blue and red vertices denote Z-and X-type real detectors, respectively; white vertices denote imaginary detectors.Vertices which represent detectors from the same cycle constitute a layer.These layers are arranged in temporal order from left to right.Each edge represents the set of faults that flip the incident detectors.Each decoder graph has two open space boundaries.In addition, the X-type decoder graph has two open time boundaries.

Example 2 .
To understand open and closed space boundaries, let us consider the Z-type decoder graph with the layout specified in Fig.S2(b).Every X error on the data qubits of the top row flips only one real detector on the top space boundary.By our definition, it is not hard to see that each detector on the top space boundary is open and is thus connected to an imaginary detector.Therefore, the top space boundary is an open boundary.Similarly, the bottom space boundary is also open.Meanwhile, since every non-corner detector on the left and right space boundaries are closed, the left and right space boundaries are closed.
x 3 5 2 P e W n D y m U P 4 A + f z B 5 M T k X M = < / l a t e x i t > ) (a) < l a t e x i t s h a 1 _ b a s e 6 4 = " n 7 b 2 2 B J m y n R s v n O w b K 7 H b G y b x 3 5 2 P e W n D y m U P 4 A + f z B 5 M T k X M = < / l a t e x i t > ) (b) < l a t e x i t s h a 1 _ b a s e 6 4 = " n 7 b 2 2 B J m y n R s v n O w b K 7 H b G y b FIG. S4.Illustrations of the core region and buffer region(s) for a window.Only edges in the core region (shaded in brown) are drawn.(a) A forward window with size w = 5 and step size s = 2. (b,c) A type-1 sandwich window with size w = 6 and step size s = 2.Each buffer region has size b = 2.The sandwich decoder simulations in this paper adopt the design in (b) and the core and buffer design described in the main text corresponds to (c).Both (b) and (c) are valid and the core regions across different windows in each design are disjoint edge sets.
Fig.S6describes a general work-flow for implementing a sandwich decoder with multiprocessing units.<l a t e x i t s h a 1 _ b a s e 6 4 = "I V V 2 0 b K F l F T d N M 3 G 8 A e j N G x S x v Q = " > A A A B / H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H Y N U Y + o F 4 8 Y 5 Z H A h s w O s z B h 9 p G ZX h P c 4 B 9 4 1 R / w Z r z 6 L 9 7 9 E A f Y g 4 C V d F K p 6 k 5 3 l x d L o d G 2 v 6 3 c y u r a

1 < l a t e x i t s h a 1 _ 2 < l a t e x i t s h a 1 _
r c + r M 9 Z a 8 7 K Z g 5 h D t b X L 3 I 5 l R c = < / l a t e x i t > A b a s e 6 4 = " D M 6 f U b M s f D / d 3 P+ 7 Z U H 6 o Q O o 0 l c = " > A A A B / H i c b V B N T 8 J A E J 3 6 i f i F e v T S S E w 8 k Z Y Q 9 Y h 6 8 Y h R P h J o y H b Z w o b t t t m d m m C D / 8 C r / g F v x q v / x b s / x A V 6 E P A l k 7 y 8 N 5 O Z e X 4 s u E b H + b Z W V t f W N z Z z W / n t n d 2 9 / c L B Y U N H i a K s T i M R q Z Z P N B N c s j p y F K w V K 0 Z C X 7 C m P 7 y Z + M 1 H p j S P 5 A O O Y u a F p C 9 5 w C l B I 9 1 f d c v d Q t E p O V P Y y 8 T N S B E y 1 L q F n 0 4 v o k n I J F J B t G 6 7 T o x e S h R y K t g 4 3 0 k 0 i w k d k j 5 r G y p J y L S X T k 8 d 2 6 d G 6 d l B p E x J t K f q 3 4 m U h F q P Q t 9 0 h g Q H e t G b i P 9 5 7 Q S D S y / l M k 6 Q S T p b F C T C x s i e / G 3 3 u G I U x c g Q Q h U 3 t 9 p 0 Q B S h a N K Z 2 4 J 8 + D T O m 1 j c x R C W S a N c c s 9 L l b t K s X q d B Z S D Y z i B M 3 D h A q p w C z W o A 4 U + v M A r v F n P 1 r v 1 Y X 3 O W l e s b O Y I 5 m B 9 / Q J z z 5 U Y < / l a t e x i t > A b a s e 6 4 = " A v N L m y k X t N O c K + p 6 K H F k G j I v y Z w = " > A A A B / H i c b V B N T 8 J A E J 3 i F + I X 6 t F L I z H x R F p C 1 C P B i 0 e M 8 p F A Q 7 b L F j Z s t 8 3 u 1 A Q b / A d e 9 Q9 4 M 1 7 9 L 9 7 9 I S 7 Q g 4 I v m e T l v Z n M z P N j w T U 6 z p e V W 1 v f 2 N z K b x d 2 d v f 2 D 4 q H R y 0 d J Y q y J o 1 E p D o + 0 U x w y Z r I U b B O r B g J f c H a / v h 6 5 r c f m N FIG.S6.The proposed workflow of a sandwich decoder in a real quantum computing system.

r 8 8 1 ○
S 8 p H t n t m u 7 c n + c L V J I 4 M 2 S V 7 5 I C 4 5 J w U y A 0 p k h L h 5 I E 8 k R f y a j 1 a z 9 a b 9 T 5 u n b M m M z v k D 6 z P b 7 w J o w A = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " D D l M f r t 9 E a S K P 7 e Z 0 o 2 t D G R w Q ys = " > A A A C H H i c b V D J S g N B E O 1 x j X E b 9 e i l M Q h e H G b c v Q W 9 e I x g F k h C 6 O l U T J O e h e 4 a S R j m Q 7 z 4 K 1 4 8 K O L F g + D f 2 F l A j T 4 o e L x X R V U 9 P 5 Z C o + t + W j O z c / M L i 7 m l / P L K 6 t q 6 v b F Z 0 V G i O J R 5 J C N V 8 5 k G K U I o o 0 A J t V g B C 3 w J V b 9 3 O f S r d 6 C 0 i M I b H M T Q D N h t K D q C M z R Sy z 5 s K C Y 0 + F E / d Y 5 j z N I G Q h + 5 U F x C O / 0 2 9 5 1 z 4 9 L 0 I M u y l l 1 w H X c E + p d 4 E 1 I g E 5 R a 9 n u j H f E k g B C 5 Z F r X P T f G Z s o U C r M m y z c S D T H j P X Y L d U N D F o B u p q P n M r p r l D b t R M p U i H S k / p x I W a D 1 I P B N Z 8 C w q 6 e 9 o f i f V 0 + w c 9 Z M R R g n C C E f L + o k k m J E h 0 n R t l D A U Q 4 M Y V w J c y v l X a Y Y R 5 N n 3 o T g T b / 8 l 1 Q O H O / E 8 a 6 P C s W L S R w 5 s k 1 2 y B 7 x y C k p k i t S I m X C y T 1 5 J M / k x X q w n q x X 6 2 3 c O m N N Z r b I L 1 g f X 7 2 Q o w E = < / l a t e x i t >

4 ○
o n 4 z 8 x g 0 o L e L o A o c J t E N 2 F Y m e 4 A y N 1 L F 3 f M W E h i A e Z M 5 u g n n m I w y Q C 8 U l d L M v c 8 s 5 N C 7 N a n m e d + y q 6 7 h j 0L / E K 0 i V F D j r 2 K 9 + N + Z p C B F y y b R u e W 6 C 7 Y w p F G Z N X v F T D Q n j f X Y F L U M j F o J u Z + P n c r p h l C 7 t x c p U h H S s f p / I W K j 1 M A x M Z 8 j w W v / 2 R u J / X i v F 3 k E 7 E 1 G S I k T 8 c 1 E v l R R j O k q K d o U C j n J o C O N K m F s p v 2 a K c T R 5 V k w I 3 u + X / 5 L L b c f b c 7 z z W v X o u I i j T N b I O t k k H t k n R + S U n J E 6 4 e S W 3 J N H 8 m T d W Q / Ws / X y 2 V q y i p l V 8 g P W 2 w f A n q M D < / l a t e x i t > (a)< l a t e x i t s h a 1 _ b a s e 6 4 = " h q z S U S m G v h b A G m L 3 b e D e N j E k k G Q = " >A A A C H H i c b V D J S g N B E O 2 J W 4 z b q E c v j U H w 4 j C j c b u J X j w q G B P I h N D T q W i T n o X u G k k Y 5 k O 8 + C t e P C j i x Y P g 3 9 i J A 6 4 P C h 7 v V V F V L 0 i k 0 O i 6 7 1 Z p Y n J q e q Y 8 W 5 m b X 1 h c s p d X L n W c K g 5 1 H s t Y N Q O m Q Y o I 6 i h Q Q j N R w M J A Q i Po n 4 z 8 x g 0 o L e L o A o c J t E N 2 F Y m e 4 A y N 1 L F 3 f M W E h i A e Z M 5 u g n n m I w y Q C 8 U l d L M v c 8 s 5 N C 7 N a n m e d + y q 6 7 h j 0 L / E K 0 i V F D j r 2 K 9 + N + Z p C B F y y b R u e W 6 C 7 Y w p FIG. S7.Comparison between the forward decoder (a) and the sandwich decoder (b).The red vertical lines represent closed time boundaries, the dashed vertical lines represent open time boundaries, and the dashed horizontal lines represent open space boundaries.The core region of each window is marked with brown color.The buffer regions are the remaining white areas.
FIG. S14.Illustration of how our scheme can be potentially extended to lattice surgery operations such as the two-qubit measurement.(a) The shape of the 3D decoder graph for an experiment containing a two-qubit measurement.(b) A way to divide this decoder graph into windows, with the core and buffer regions for each window illustrated.Each window has size O(d) × O(d) × O(d).
Supplementary Material for 'Scalable surface code decoders with parallelization in time' S1. DECODER GRAPHS WITH BOUNDARIES boundary in a decoder graph is called open if every detector on this boundary is open.A boundary is closed if it is not open.