Optimization tools for distance-preserving flag fault-tolerant error correction

Lookup table decoding is fast and distance-preserving, making it attractive for near-term quantum computer architectures with small-distance quantum error-correcting codes. In this work, we develop several optimization tools that can potentially reduce the space and time overhead required for flag fault-tolerant quantum error correction (FTQEC) with lookup table decoding on Calderbank-Shor-Steane (CSS) codes. Our techniques include the compact lookup table construction, the Meet-in-the-Middle technique, the adaptive time decoding for flag FTQEC, the classical processing technique for flag information, and the separated $X$ and $Z$ counting technique. We evaluate the performance of our tools using numerical simulation of hexagonal color codes of distances 3, 5, 7, and 9 under circuit-level noise. Combining all tools can result in more than an order of magnitude increase in pseudothreshold for the hexagonal color code of distance 9, from $(1.34 \pm 0.01) \times 10^{-4}$ to $(1.42 \pm 0.12) \times 10^{-3}$.


I. INTRODUCTION
Inside a future large-scale quantum computer, there will be a continuous battle against unwanted interactions with the environment.The main goal of fault-tolerant quantum error correction (FTQEC) protocols [1] is to create a robust channel to transfer quantum information from the past to the future.The threshold theorem states that it is possible to suppress the failure rate of this channel (the logical error rate) to an arbitrarily small value given that the physical error rate of the constituent operations are below the accuracy threshold [2][3][4][5][6][7].It is essential to reduce both space and time overhead (the numbers of qubits and gates) for scalable quantum computing as decreasing logical error rates requires increasing overhead [8][9][10][11], and the current, leading proposals for FTQEC schemes have daunting requirements [12,13].
An FTQEC scheme is designed to be robust against propagating errors that emerge from faulty gates during the execution of the protocol [1].The scheme also has to protect against ancilla preparation and measurement errors, usually through repeated syndrome measurements.For an [[n, k, d]] stabilizer code [14] which encodes k logical qubits into n physical qubits and has minimum distance d, Shor's solution [1] was to utilize cat-state ancilla register that requires w ancilla qubits and (d + 1) 2 /4 rounds of syndrome measurements, where w is the maximum weight of the stabilizer generators.In Steane-style syndrome extraction [15], the ancilla register requires n qubits and is encoded with the same quantum error correcting code (QECC) as the data qubits.Similarly, in Knill-style error correction [16], the ancilla register consists of two blocks of n qubits encoded in the same QECC as the data qubits.
In contrast to complex ancilla structures, bare ancillas can also be used to fault-tolerantly extract the syndrome while preserving the minimum distance for some specific families of stabilizer codes [17][18][19] and subsystem codes [20][21][22], or by tolerating some loss of distance [23][24][25].For a general stabilizer code, however, generator measurements with bare ancillas might not be possible.A series of works aiming to reduce the size of the ancilla register resulted in increasingly lighter-weight constructions [26,27], which also led to the flag FTQEC schemes for perfect codes of distance three that use only two ancillas per generator [28], one flag-qubit and one syndrome qubit.The flag schemes later generalized to arbitrary codes of distance d require d + 1 ancillas per generator [29], while the schemes for some specific families of codes require fewer [30][31][32][33][34][35].
FTQEC schemes based on extraction circuits with cat states and flag FTQEC schemes both require repetition of syndrome measurements that can result in a large number of gates.Adaptive syndrome measurement schemes in which the subsequent measurement procedures depend on the previous syndrome measurement outcomes have been explored to reduce the number of rounds required for FTQEC schemes with Shor-type extraction circuits [36][37][38].
During the execution of an FTQEC protocol, faults can occur at any gate on any round of the syndrome measurements.The only information about the error on data qubits that we can obtain is a sequence of error syndromes, and we want to find an appropriate recovery operator from this information.An ideal strategy would be using all syndrome bits from all rounds, that is, the whole measurement outcomes in space-time.For some codes with a nice structure, such as surface codes, an efficient space-time decoder exists [39].However, constructing a space-time decoder for a general stabilizer code is not simple.To simplify the problem, we will con-sider an error decoder, which is composed of two parts: the space and the time decoders.Under the assumption that the syndrome measurements can be faulty, the time decoder finds a round of syndrome measurements that has no faults and gives a correct syndrome.The space decoder then uses the correct syndrome to construct a recovery operator.
Conventionally, flag FTQEC uses a lookup table decoder as a space decoder and relies on Shor-style repeated syndrome measurements as a time decoder, although there are instances where this is not the case [40].These decoders have pros and cons.The lookup table decoder is fast and distance-preserving.However, building a lookup table requires an exhaustive search over all possible fault combinations up to a certain number of faults, and the table requires a lot of memory to store.Thus, it might not work well with a code of high distance (unless code concatenation is applied).The Shor-style time decoder is simple and compatible with any space decoder.However, the large time overhead required in the repetition can result in a lower threshold.
In this work, we build several optimization tools for both space and time decoders for the purpose of reducing the overhead of both to obtain better-performing protocols for flag qubit-based FTQEC.Most of our tools are applicable to general stabilizer codes, but we primarily focus on self-orthogonal CSS codes (CSS codes in which Xand Z-type generators are of the same form) in which the number of physical qubits is odd, the number of logical qubits is 1, and logical X and Z operators are transversal for simplicity.Our main results are the following: (1) We develop a technique to build a lookup table more efficiently.Our compact lookup table can leverage the structure of a self-orthogonal CSS code and requires 87.5% less memory footprint compared to a lookup table designed for a generic stabilizer code.Our method also efficiently verifies whether a configuration of the flag circuits preserves the code distance.The development also leads to a notion of fault code, which can be useful in error sampling for the circuit-level noise model.(2) We introduce the Meet-in-the-Middle (MIM) technique, which can help the lookup table decoder correct faults more than the number of errors correctable by the underlying code.Although the correction is not always successful, the higher success probability can significantly increase the pseudothreshold in our simulations.(3) We generalize previous work [38] on adaptive syndrome measurement schemes to flag FTQEC and introduce one-tailed and two-tailed adaptive time decoders, which are useful in different circumstances.We also develop a classical processing technique on flag information that makes our FTQEC protocols compatible with any fault-tolerant Clifford computation.(4) We use our optimization tools and perform numerical simulations on the hexagonal color codes [41] of distances 3, 5, 7, and 9.The results show that each of our tools can significantly reduce the logical error rates and increase the pseudothreshold for each code while preserving the code distance.For the hexagonal color code of distance 9, the pseudothreshold is improved by one order of magnitude, from (1.34 ± 0.01) × 10 −4 to (1.42 ± 0.12) × 10 −3 , when all techniques are applied.
This paper is organized as follows.In Section II, we define the noise model in this work, review flag FTQEC, and provide definitions of fault-tolerant error correction.In Section III, we develop optimization tools for space decoder, including an efficient method to build a compact lookup table and the MIM technique.In Section IV, we develop optimization tools for time decoder, including the one-tailed and two-tailed adaptive time decoder, and other extended techniques for CSS codes.In Section V, we provide numerical results for the hexagonal color codes and observe the effects of the MIM, the adaptive time decoding, and the separated X and Z counting techniques on the logical error rates.We discuss and conclude our results in Section VI.

II. BACKGROUND
Quantum systems are fragile and can lose their properties easily when interacting with the environment.To protect quantum information, one can use a QECC to encode the quantum data.Quantum error correction (QEC) is a process that identifies an error when it occurs and then applies an appropriate error correction (EC) operator to remove the error.However, quantum operations in the process can be faulty and may introduce more errors to the system.For this reason, we want to make sure that the QEC process is fault tolerant, which provides robustness guarantees against the impact of noise on the error correction implementations.
In this section, we first describe the noise model that will be used in this work and provide the conventional definition of fault-tolerant error correction in Section II A. We then review flag FTQEC and provide a revised definition of fault-tolerant error correction, which is more suitable for flag FTQEC in Section II B.

A. Noise model and conventional definition of fault-tolerant error correction
An [[n, k, d]] stabilizer code [14] encodes k logical qubits using n physical qubits and can correct up to τ = ⌊(d − 1)/2⌋ errors, where d is the code distance.A stabilizer code is described by a stabilizer group, an Abelian group generated by r = n − k commuting Pauli operators whose elements are called stabilizers.The code space is the simultaneous +1 eigenspace of all elements in the stabilizer group.
The QEC process for a stabilizer code can be done by first measuring the eigenvalues of all stabilizer generators.An r-bit string of measurement outcomes is called the error syndrome (where bits 0 and 1 refer to +1 and −1 eigenvalues of each generator).An example of a cir- A syndrome extraction circuit with bare ancilla for measuring a stabilizer generator of the form ZZZZ.
cuit for measuring an eigenvalue of a stabilizer generator is displayed in Fig. 1.After the syndrome is obtained, an appropriate recovery operator will be found by a mapping called error decoder.Finally, the recovery operator will be applied to the data qubits.For Calderbank-Shor-Steane (CSS) codes [15,42], it is possible to correct Xand Z-type errors separately.In this work, we follow standard CSS decoding [43], meaning independent recovery for Xand Z-type errors, thus not taking the effect of X/Z correlations like Y errors into account.If all gates in the syndrome measurement process (with an example circuit in Fig. 1) are perfect, a stabilizer code of distance d should be able to correct up to τ errors as desired.However, the process above may not be faulttolerant under the circuit-level depolarizing noise.This is because a single faulty gate may lead to an error that can propagate to multiple errors on the data qubits, often referred to as hook errors [39].These errors can always be handled by complex ancilla [1,44,45] or flag circuits [28] and sometimes handled by the circuit order [17][18][19].
In this work, we use the circuit-level depolarizing noise model.After each gate, a fault occurs on the support of the gate.Every single-qubit gate is followed by a singlequbit Pauli operator P ∈ {X, Y, Z} with probability p/3 each, and every two-qubit gate is followed by a two-qubit Pauli operator P 1 ⊗ P 2 ∈ {I, X, Y, Z} ⊗2 \ {I ⊗ I} with probability p/15 each.In addition, a single-qubit preparation and measurement can also be faulty; this is modeled be a bit-flip channel after a single-qubit preparation or before a single-qubit measurement with error probability p.
One way to define FTQEC is by using the definition proposed by Aliferis, Gottesman, and Preskill: Definition 1. Fault-tolerant error correction [6] Let t ≤ ⌊(d−1)/2⌋ where d is the distance of a stabilizer code.An error correction protocol is t-fault tolerant if the following two conditions are satisfied: 1. Error correction correctness property (ECCP): For any input codeword with error of weight r, if s faults occur during the protocol with r + s ≤ t, ideally decoding the output state gives the same codeword as ideally decoding the input state.

Error correction recovery property (ECRP):
If s faults occur during the protocol with s ≤ t, regardless of the weight of the error on the input state, the output state differs from any valid codeword by an error of weight at most s.
When a QEC protocol satisfies Definition 1, it is guaranteed that the output error will have weight ≤ t whenever the weight of the input error plus the total number of faults in the protocol is ≤ t.This means that if the next round of QEC has no faults, it can always correct the output error from the current round.Normally, we would like to construct an FTQEC protocol in which t is as close to τ = ⌊(d − 1)/2⌋.If t = τ , we say that the FTQEC protocol preserves the code distance.

B. Flag technique and revised definition of fault-tolerant error correction
Before describing the flag technique for FTQEC, let us consider a well-known Shor FTQEC [1] applied to a stabilizer code of distance d.In this scheme, a stabilizer generator of weight w is measured using a cat state of the form 1 √ 2 (|0⟩ ⊗w + |1⟩ ⊗w ) and transversal CNOT gates; see Fig. 2. A circuit of this kind will be called a Shor syndrome extraction circuit.When the cat state is prepared fault-tolerantly, a single fault in the circuit can lead to an error of weight no more than one on the data qubits, so the set of all possible errors arising from up to t faults is exactly the same as a set of all possible errors on ≤ t qubits in this case.Therefore, any syndrome can uniquely identify the error (up to a multiplication of some stabilizer) when the number of faults in the protocol is ≤ t.
One drawback of the Shor syndrome extraction circuit is that the number of required ancilla qubits is equal to the maximum weight of the stabilizer generators.Also, fault-tolerant preparation of the ancilla cat state requires verification [1] or Divincenzo-Aliferis ancilla decoding circuit [26], which requires additional space and time overhead.One possible technique that can reduce the number of required ancillas for FTQEC is the flag technique [28], in which each syndrome extraction circuit uses one ancilla qubit to keep the syndrome measurement outcome and a few flag ancillas to find a location that a fault might have occurred.A circuit of this kind will be called a flag circuit; See Fig. 3 for an example.The flag measurement outcomes give extra information that can be used to partition set of all possible errors from a certain number of faults.Therefore, it is possible to distinguish between two non-equivalent errors that correspond to the same syndrome if the flag measurement outcomes associated with each error are different, making error correction easier.
Here we define fault combination, fault set, and distinguishability of a fault set as follows: Definition 2. Fault combination, combined data error, and cumulative flag vector [35] A fault combination Λ = {λ 1 , λ 2 , . . ., λ r } is a set of r faults λ 1 , λ 2 , • • • , λ r .Suppose that the Pauli error due to the fault λ i can propagate through the circuit and lead to data error E(λ i ) and flag vector ⃗ f (λ i ).The combined data error E(Λ) and cumulative flag vector ⃗ F (Λ) corresponding to the fault combination Λ are, Definition 3. Distinguishable fault set [35] Let S be the stabilizer group of a stabilizer code, and let the fault set F t denote the set of all possible fault combinations arising from up to t faults during the measurement of stabilizer generators of S. We say that F t is distinguishable if for any pair of fault combinations Λ p , Λ q ∈ F t , at least one of the following conditions is satisfied: where ⃗ s(E) is the error syndrome of a combined error E.
Otherwise, we say that F t is indistinguishable.
Note that the cases of faulty flag qubit measurements are included when the fault set is calculated for verifying fault set distinguishability (see Section III A).Having a distinguishable fault set is a key to successful error decoding.Given a set of syndrome extraction circuits (with or without flags), we can calculate the fault set F t and check whether it is distinguishable.If it is, all possible errors arising from up to t faults that correspond to the same syndrome and cumulative flag vector are always logically equivalent.Therefore, if the syndrome measurements give a syndrome ⃗ s and a cumulative flag vector ⃗ F , we can pick any error that corresponds to the pair (⃗ s, ⃗ F ) A flag circuit for measuring a stabilizer generator of the form ZZZZ.
to be a recovery operator.Using this idea, a decoding table and an FTQEC protocol can be constructed.
With the notion of fault distinguishability, it is possible to further generalize the definition of FTQEC as follows: Definition 4. Fault-tolerant error correction (revised) [35] Let t ≤ ⌊(d−1)/2⌋ where d is the distance of a stabilizer code.An error correction protocol is t-fault tolerant if the following two conditions are satisfied: 1. ECCP: For any input codeword with an error that can arise from r faults before the protocol and corresponds to the zero cumulative flag vector, if s faults occur during the protocol with r + s ≤ t, ideally decoding the output state gives the same codeword as ideally decoding the input state.
2. ECRP: If s faults occur during the protocol with s ≤ t, regardless of the number of faults that can cause the input error, the output state differs from any valid codeword by an error that can arise from s faults and corresponds to the zero cumulative flag vector.
The main difference between these two definitions of FTQEC is that Definition 4 considers the number of faults that can cause the input (or the output) error instead of the weight of the error.An FTQEC protocol satisfying Definition 4 can be constructed if we can find syndrome extraction circuits that give a distinguishable fault set (see previous results [35] by one of the authors of this work and the discussion in the next section for more details).In fact, while the threshold theorem proved by Aliferis, Gottesman and Preskill [6] relied on the weight of the error to define fault tolerance (Definition 1), the theorem has been shown to hold [35] even if the definition of fault tolerance uses the number of faults (Definition 4) instead.For flag FTQEC, using Definition 4 can result in simpler FTQEC protocols, so we will use Definition 4 in the protocol development throughout this work.

III. OPTIMIZATION TOOLS FOR SPACE DECODING
In this work, the term space decoder refers to a process that finds a recovery operator from a given syndrome under the assumption that it is exactly the same as the syndrome of an error that occurred to the codeword.The decoder succeeds if multiplying the error and the recovery operator gives a trivial logical operator (a stabilizer), and it fails if the multiplication gives a nontrivial logical operator.Our goal is to develop a space decoder such that whenever the total number of faults in the whole protocol is ≤ t, the decoder always succeeds.In this work, we are interested in a lookup table-based space decoder for flag FTQEC, so the decoder will use both syndrome and flag information obtained during the syndrome measurements.Note that the ability to correct faults for a certain code depends on the structure of the circuits for syndrome extraction, such as the ordering of gates.
In this section, we develop optimization tools for space decoding.In Section III A, we discuss how to efficiently construct a lookup table for error decoding for a distinguishable fault set F t , and introduce the notion of fault code.In Section III B, we discuss the Meet-in-the-Middle technique, an additional technique that can help improving our space decoders for both codes and increase the accuracy of the decoding.

A. Compact lookup table for minimum weight decoding and fault code
In this section, we discuss how to construct the fault set F t , verify its distinguishability, and construct the lookup table for error decoding.With our method, we can reduce the memory footprint requirement of the lookup table by 87.5% for self-orthogonal CSS codes compared to a lookup table designed for generic stabilizer codes.We also present the framework of fault codes that enables fast construction using streamlined Pauli-frame simulation represented as matrix algebra operations over GF (2).
A brief summary of our methods is as follows.Let the weight of a fault combination be the number of faults that give rise to the fault combination.The decoding table maps each full syndrome (⃗ s, ⃗ F ) to a recovery operator that corresponds to the combined data error of the minimum-weight fault combination that results in the full syndrome.To construct the decoding table, we start by collecting all weight-1 fault combinations that may arise in the extraction circuits.We map each resulting full syndrome to its corresponding data error.At this point, we say that the search radius of the lookup table is 1.Afterward, we combine pairs of weight-1 fault combinations to create all possible weight-2 fault combinations.The combined data error of each weight-2 fault combination is obtained by simply taking the product of the data errors, and the full syndrome is obtained by adding full syndromes of the weight-1 fault combination modulo 2. If the combining process leads to a new syndrome, we store it in the table.If the process leads to an existing syndrome, we have a collision and do one of the following: (1) If the stored combined data error and the new combined data error are the same up to a stabilizer, then we do nothing.(2) If the stored combined data error and the new combined data error differ by a logical operator (up to a stabilizer), then we raise an error; this implies that F 2 is not distinguishable.At this point, if there is no combination that causes the second case (that is, F 2 is distinguishable), we say that the search radius of the lookup table is 2. We can gradually increase the search radius using similar ideas until we reach the maximum search radius in which the fault set is distinguishable.Here, we rely on an efficient representation of the combined data errors using a decomposition of Pauli operators to pure errors, stabilizers and logical operators [46].
During sampling, the decoder receives a full syndrome that was measured.When the decoder finds this syndrome in the lookup table, it returns the corresponding actual recovery operator (ARO).However, when the decoder cannot find the syndrome in the lookup table, it only returns a so-called canonical recovery operator (CRO).Each syndrome has a unique canonical recovery operator, which guarantees that applying such an operator to the erroneous encoded state will map it back to the code space but with a possible logical error.
The full description of our methods is presented below.

Reducing the memory footprint
To decode an [[n, k, d]] stabilizer code, we can construct a lookup table that, for all possible fault combinations of weight 0 to t (where t = ⌊ d−1 2 ⌋), stores the full syndrome ⃗ σ = (⃗ s, ⃗ F ) as the key and maps the combined data error as the recovery operator.While this approach works, it is expensive.Let T stab denote the number of distinct full syndromes for the fault combinations of weight 0 to t for a generic stabilizer code.As T stab and thus the size of the lookup table grow exponentially in n, n − k (the number of generators), and the number of circuit locations, we want to choose a representation to store data as efficiently as possible.For general stabilizer codes, n − k bits are required for the syndrome bits and n − k bits for the cumulative flag vector (assuming flag circuits with single flag ancilla for simplicity).Meanwhile, the recovery operator requires 2n bits using the symplectic representation.Thus we have T stab (4n − 2k) bits of data in the map.
Leveraging the structure of CSS codes, we can significantly improve the memory footprint.Assuming standard CSS decoding in which two separate lookup tables are used for X and Z decoding.Denote with r X and r Z the number of Xand Z-type stabilizer generators satisfying r X + r Z = n − k.The per entry cost decreases, as the entries only need to cater for Xor Z-type operators and syndromes.Each entry for the Xand Z-type syndromes will have 2r X and 2r Z bits respectively for the syndrome and the cumulative flag vector, and n bits for the recovery operator.A self-orthogonal CSS code needs only one table futher decreasing the cost; see more detail on the total number of bits in Appendix A. Moreover, we can reduce the number of bits for the recovery operator to k using the following two key ideas: 1.In general, for an [[n, k, d]] code, each Pauli operator P ∈ P n can be decomposed as a product P = EM L of a pure error E, a stabilizer M ∈ S, and a logical operator L ∈ P k (where P k is the k-dimensional logical Pauli group) [46].We define a fixed set of pure errors called canonical recovery operators (CRO), one CRO for each unique syndrome ⃗ s.
2. Given a syndrome ⃗ s(E), the goal of decoding is to find a recovery operator R such that RE ∈ S, thus R converts the error into the logical identity operation.For any possible Pauli error, we only have to store its logical class, a value that indicates how the error is related to a CRO with the same syndrome.This enables the map value to be only 2k bits of information in general and k bits in case the code is a self-orthogonal CSS code.In this latter case and with k = 1, the logical class is 0 if the multiplication of the Pauli operator and the CRO with the same syndrome is in the stabilizer group, otherwise the logical class is 1.
Altogether, for a self-orthogonal CSS code with n ≫ k, the size of the table can be as small as 12.5% of the table if we viewed the code as a generic stabilizer code and stored the full recovery operators instead of the logical classes.For a CSS code that is not self-orthogonal, the gain is smaller but still significant.See Appendix A for detailed calculations around savings for the lookup table.Note that if the lookup table is used for proving distinguishability, all unique syndromes are required.However, in a realtime decoding architecture, the entries corresponding to significantly low probability fault combinations may be excluded, resulting in further reduction [47].

Constructing the lookup table
We now explicitly describe an algorithm to construct the lookup table.During the construction of the lookup table, we have a systemic way to enumerate fault combinations with their full syndromes and combined data errors instead of running through a circuit simulator for each case.The exhaustive enumeration of all possible fault combinations of weight 0 to t is done in two steps.First, we enumerate the single faults and capture their full syndrome and logical class in a single column of the fault check matrix, H f using matrix algebra over GF (2) to represent the propagation of errors in our syndrome extraction circuits.Second, we combine these columns in all possible combinations of 0 to t faults ( t i=0 N i combinations in total, where N is the number of possible single faults) while keeping track of the weight of each fault combination.This last step verifies whether F t is distinguishable (which is equivalent to verifying whether the protocol is distance preserving), and at the same time builds a lookup table for the decoder.
Enumerating weight-1 faults-From here on, we will only consider a self-orthogonal CSS code, and denote its parity check matrix H.In order to list all possible single faults under the circuit-level depolarizing noise model, it is sufficient to consider all possible weight-1 faults within a single round of syndrome measurements.Each column of the fault check matrix H f describes for each possible weight-1 fault what its full syndrome and its logical class are.As the logical class of each fault depends on how its CRO is defined, we define the fault check matrix relative to the right inverse H −1 of H (for which The high-level structure of H f consists of three major groups of rows and three major groups of columns.The three groups of rows are the (n − k)/2 generator bits, the (n − k)/2 flag bits, and the k bits for the logical class.Each single fault which is represented by a column of H f can be put into one of the following three categories: 1. Pure data qubit errors that result only in generator bits.They do not trigger flags, resulting in all-zero flag bits.The CRO R of each pure data qubit error E can be described by each column of H −1 H (since the syndromes of CROs are H(H −1 H) = (HH −1 )H = H), thus the product RE of each E can be described by each column of I n ⊕ H −1 H (where the matrix addition, denoted by ⊕, and multiplication are over GF( 2)).If E is an X-type (or a Z-type) error, the logical class of RE is described by a k-bit string in which the ith bit indicates whether RE anticommutes with Zi (or Xi ).That is, the logical classes of all pure data qubit errors are described by where J i is the column vector representing Zi (or Xi ).
2. Flag ancilla preparation or measurement errors which do not propagate to data qubits, thus, each single-flag error will result in a single flag bit.Therefore, all errors of this type have the all-zero syndrome and logical class 0, while the flag bits can be easily represented by the (n − k)/2 × (n − k)/2 identity matrix.
3. Gate faults that cause errors on the syndrome ancilla which can propagate to data and flag qubits -we order these faults by top-down and left-right place of occurrence and capture their effect in syndrome bits, flag bits, and logical class.The part of the effective matrix corresponding to this type of faults is denoted by H f,gate .
Note that single measurement and reset errors on the syndrome ancilla are ignored during this analysis as their effects would be removed by the time decoder through the repetition of syndrome measurements.
Generalizing H f for a non-self-orthogonal CSS code is straightforward.In that case, the parity check matrices for Xand Z-type errors can be different, leading to different fault check matrices.Generalizing H f for a generic stabilizer code is more complicated but still doable, as all operators must be considered in the symplectic form.In that case, the number of rows for the logical class is 2k.Also, instead of taking the inner product with J i , whether each CRO commutes or anticommutes with each logical operator can be determined by the symplectic inner product between the symplectic bitstrings representing the CRO and the logical operator.
In the case that the code is a self-orthogonal CSS code, n is odd, k = 1, and logical X and logical Z operators are transversal, the fault check matrix is, where J is the all-one column vector of length n (representing X ⊗n or Z ⊗n ).
As an example, consider the first group of columns for the [ [7,1,3]] Steane code [15] whose stabilizer generators can be defined by the parity check matrix, One can pick its right inverse H −1 as follows: We can see that each column of H −1 gives a Pauli operator for each syndrome bit.For a data error E of any weight, the syndrome ⃗ s(E) = HE can be recovered with the CRO defined by For example, for E = (0110000) T , ⃗ s(E) = (001) T and R(⃗ s(E)) = (1000000) T , thus RE = (1110000) T , for which the syndrome is trivial (as RE is a logical operator).
For errors of weight 1 on the data qubits, the operator FIGURE 4: (a) A flag circuit for measuring a Z-type stabilizer generator of weight w in this work.A flag circuit for measuring a X-type stabilizer generator of weight w can be obtained by replacing each CNOT gate that connects the data qubit to the syndrome ancilla with the gate in (b).

RE of each error can be represented by each column of
Since the logical class of RE can be determined by its weight parity, the logical classes of this type of errors are the row of L ≡ J T (I n ⊕ H −1 H) where J is the all-one column vector.That is for the Steane code, the part of H f corresponding to pure data qubit errors is Constructing H f,gate -In this work, we focus on the case that any Z-type or X-type stabilizer generator of weight w is measured using a flag circuit with a single flag ancilla similar to the circuit in Fig. 4 (with a slight modification, similar construction for a general flag circuit can also be made).For H f,gate , we are interested in how the errors propagate from the syndrome ancilla to the data qubits and the flag ancilla.The error propagation is represented via a binary matrices, an idea closely related to the "gate matrix", where the direction of propagation is the opposite way, towards the ancilla from the data qubits [45,48].Given single-flag syndrome extraction circuits for all stabilizer generators and the CNOT ordering for each circuit, H f,gate can be calculated via the propagator matrix P and the aggregator matrix A, defined as follows: For the error correction protocol with n data qubits and r flag bits (which is the same number as the number of X or Z stabilizer generators), The matrix P has n + r rows.The number of columns of P is r i=1 (w(g i ) + 2), where w(g i ) is the Hamming weight of the i-th stabilizer generator (g i ).This is from the fact that for each CNOT gate in the single-flag syndrome extraction circuits, the only fault that can lead to a unique data error after propagation is the fault that leads to a single Z error on the target qubit of the CNOT (which is the syndrome ancilla).To simplify the construction, we construct a submatrix P i of size (n + r) × (w(g i ) + 2) for each row g i of H (i.e each stabilizer generator), then concatenate the submatrices to get, P = (P 1 P 2 . . .P r ) . ( As the order of the CNOT gates matters in subtle ways, for a given stabilizer generator g i , we represent the CNOT ordering by the permutation π i : {1, 2, . . ., w(g i )} → supp(g i ), where π i (j) indicates the control (data) qubit of the j-th CNOT (the target qubit is always the syndrome ancilla).π i can also be represented by a list.For example, two possible permutations of CNOT gates in the syndrome extraction circuit for measuring To construct P i , we iterate from j = 1 to w(g i ), and create a column for each iteration with all zeros except for the 1 in row π i (j).We then insert an all-zero column on the second from the left and the second from the right positions (which represent the flag CNOTs), and set its value to 1 at row n + i.In our running example of g 1 = (0001111), for a permutation of π 1 = [4,6,5,7], 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 The aggregator matrix A plays the role of propagating the errors to the end of the syndrome measurement circuits.For each g i , we define A i to be a square matrix of size (w(g i ) + 2) × (w(g i ) + 2) having a lower triangle set to all 1s, and define A = r i=1 A i to be the direct sum of all A i 's.In our example case of g 1 , Multiplying the propagator and the aggregator matrices yields, where columns of the submatrices Ω i are the final Pauli operators, and columns of the submatrices Φ i are the cumulative flag vectors after measuring g i and having a fault propagated from the syndrome ancilla to the data qubits at the location corresponding to the given column.Next, we find the syndromes for these Pauli operators by multiplying them with the parity check matrix, Then, for each syndrome, we define the CRO based on the right inverse Finally, we determine the logical class L for each of the faults by adding the parity of the CRO and the propagated, final data error, As a result, the part of H f corresponding to the gate faults is, The relationship between the full syndrome, the data error, and the CRO of each fault is as follows: Suppose that the i-th column of H f (which represents a single fault on the i-th location) contains error syndrome ⃗ s i , flag vector ⃗ f i , and logical class l i .The CRO of the fault is H −1 ⃗ s i , while the data error of the fault is l i J ⊕H −1 ⃗ s i .That is, in case of a single fault, the actual recovery operator (ARO) we need to apply when finding the full syndrome (⃗ Verifying distinguishability and building the lookup table-The fault set F t is distinguishable if and only if there is no fault combination from up to 2t faults that gives a non-trivial logical operator with trivial full syndrome [35].As the fault check matrix already contains all the possible single faults, in case of t = 1, we only need to extend the matrix by a column with all zeros (which represents 0 faults) and check whether there is a pair of columns which are the same except for the logical class.If there is, the combined data errors of one or two faults add up to an undetectable logical operator, meaning that F 1 is not distinguishable.
When t ≥ 2, we populate the cache with the logical classes of higher-weight fault combinations by simply combining all possible fault combinations of lower-[ [7,1,3]] [ [19,1,5]] [ [37,1,7]] [ [61,1,9]  The number of columns of the fault check matrix counted in the first row results from the three-part structure of data errors, flag errors, and gate faults.These columns are not necessarily unique, which can be seen in the second row that counts the number of unique columns.The time to verify distinguishability for the different codes on a single thread with our C++ code depends on the number of unique columns, hence the verification of the higher distance code takes longer than shorter ones.All timings are reported using Intel Xeon Gold 6226R, 2.90GHz processors.Some fault combinations have the same full syndrome, hence the cache size is smaller than the full number of fault combinations.The cache size in memory is reported from actual usage, including the overhead of the hash table implementation.
weight fault combinations while keeping track of the weights of the fault combinations.We describe the i-th fault combination as a key-value pair [(⃗ s i , ⃗ F i ) : (l i , w i )] where (⃗ s i , ⃗ F i ) is the full syndrome, l i is the logical class, and w i is the weight of the fault combination.Combining the i-th and the j-th fault combinations gives As we aim to check whether F t is distinguishable, we fill up the cache by combining any pair of fault combinations that satisfy w i + w j ≤ t.In case that the process gives the new key (the full syndrome) that already exists in the cache, we have a key conflict.This can be one of the following cases: 1.The new and the existing fault combinations have the same full syndrome and the same logical class but have different weights.In this case, we store the fault combination with smaller weight in the cache.
2. The new and the existing fault combinations have the same full syndrome but have different logical classes.As the sum of weights of these two fault combinations is ≤ 2t, we raise an error-there exists a fault combination from up to 2t faults that gives a non-trivial logical operator with trivial full syndrome, that is, F t is not distinguishable.
If at the end we find that F t is distinguishable, we can construct a lookup table of search radius t from the cache as follows: for each key-value pair [(⃗ s i , ⃗ F i ) : (l i , w i )] in the cache, we store a new key-value pair [(⃗ s i , ⃗ F i ) : l i J ⊕ H −1 ⃗ s i ] in the lookup table (the weights are not necessary for decoding, though it might be useful for estimating the number of faults that causes the full syndrome).That is, l i J ⊕ H −1 ⃗ s i is the ARO for the full syndrome (⃗ s i , ⃗ F i ).When performing error decoding, the ARO is applied if the full syndrome obtained from measurements is found on the lookup table; otherwise, the CRO (H −1 ⃗ s i ) is applied.
The lookup table can then be stored in an efficient binary format on disk or memory as needed.In Table I, we displayed the metrics related to the lookup table decoder obtained by the algorithm above.
In summary, we perform an exhaustive search of fault combinations which gives us a lookup table with search radius t; this is equivalent to verifying the distinguishability of F t .If we can construct the lookup table with t = τ = ⌊(d − 1)/2⌋, we have a minimum-weight decoder that is distance-preserving under the circuit-level depolarizing noise model.As a hash-table requires O(1) amortized complexity for lookup, this decoder is also relatively fast for numerical simulations or real-time decoding compared to more complicated algorithms such as MaxSAT-decoding [49], neural-network-based decoder [50], or the restriction decoder [51] with minimum weight perfect matching decoding [52], all of which have at least O(n) complexity.However, the table size scales exponentially in the number of qubits, locations, and stabilizer generators, and thus, constructing the lookup table may be impractical for a code of high distance.

The fault code
Any CSS code can be defined by its parity check matrix H, which maps a bitstring representing a combination of errors on the data qubits to the error syndrome of the error combination.In the case of flag FTQEC where the circuit-level noise model is considered, we can use similar ideas and define a fault code by the fault check matrix H f which maps a bitstring representing a combination of possible faults to the full syndrome of the fault combination (which includes the error syndrome of the combined data error and cumulative flag vector) and the logical class relative to the CRO for the syndrome.It should be noted that the distance of the fault code might be lower than the distance of the underlying CSS code; this depends on the syndrome extraction circuits, which affect the distinguishability of the fault set.We can define the effective distance d eff to be the minimum number of faults that can give a fault combination with a non-trivial logical operator and the trivial full syndrome.The number of faults t eff that the fault code can correct is t eff = ⌊(d eff − 1)/2⌋ (this is the maximum number of t in which F t is distinguishable).If the effective distance and the code distance are equal, we say that the error correction protocol is distance preserving.Calculating the distance of classical codes can be done by determining the spark of the parity check matrix H, which is known to be NP-hard, in general [53].However, the spark algorithm does not work in the case of degenerate CSS codes, as it reports only the minimum weight of the stabilizers which is a lower bound on the code distance [42].Our algorithm described in this section can be viewed as a modified spark algorithm that uses the logical class information to calculate the distance of the code (based on H) and also the effective distance of the fault code (based on H f ).
The perspective of the fault code can also be useful to extend a technique frequently used for error sampling (in qecsim [54], for example) to the circuit-level noise model beyond the code capacity noise model (memory errors only) and phenomenological noise model (both memory and measurement errors).Here, a randomly generated column vector of Hamming weight w now represents faults on w locations instead of errors on w qubits.Suppose that the vector ⃗ v represents the fault combination and H f ⃗ v gives the full syndrome (⃗ s(⃗ v), ⃗ F (⃗ v)) and the logical class l(⃗ v).In an error correction simulation, the decoder can predict the recovery operator ⃗ r based on the full syndrome.We will find that the predicted recovery operator causes a logical error if and only if l(⃗ v) and l(⃗ r) differ.
In principle, this method can lead to a better sampling rate compared to running the full circuit simulation for each sample.However, one needs to be aware of the probability distribution when generating vectors representing the fault combinations, as each possible single fault might not occur at the same rate.

B. Meet-in-the-Middle Technique
If the fault set F t of each code is distinguishable, the flag FTQEC protocol can correct up to t faults with certainty.However, whenever t + 1 or more faults occur, the error correction is not guaranteed; our decoder can either remove the error or cause a logical error on the encoded state.Although the probability of having t + 1 or more faults is O(p t+1 ), being able to correct more cases of faults can lead to a higher pseudothreshold.In this section, we introduce the Meet-in-the-Middle technique, which can help correct errors in case there are more than t faults in our FTQEC protocol.Note that this technique is general and could help any FTQEC protocol with a table-based decoder to correct faults more than its capability if the stabilizer code being used is not a perfect (or a perfect CSS) code.
The Meet-in-the-middle (MIM) technique is inspired by the bidirectional search algorithm [55] to improve the table-based decoder previously discussed in Section II B (see also Section III A) in case that the decoder cannot find in its lookup table the full syndrome obtained from measurements.Consider the case that the fault set F t is distinguishable, and a lookup table of search radius t can be constructed.Suppose that more than t faults occur and the full syndrome is (⃗ s m , ⃗ F m ) which is not in the lookup table.The table-based decoder discussed in Section II B will return the canonical recovery operator that may cause a logical error after correction.To make successful error correction more probable in such cases, one could, in principle, construct a lookup table with a search radius larger than t by relaxing the distinguishability requirement for fault combinations with weights higher than t.However, this can be impractical as the number of fault combinations grows too fast when the search radius increases.
To overcome this issue, we instead conduct a search during decoding starting from the missing syndrome (⃗ s m , ⃗ F m ).That is, we construct another decoding table, called the MIM table, with search radius at most ρ ≤ t using ideas similar to the original lookup table, but we also add (⃗ s m , ⃗ F m ) to the map key before storing the syndrome in the MIM table and check whether it is in the decoding lookup table or not.If a new map key in the MIM table is the same as some map key in the decoding lookup table, the search stops, and the decoder constructs a recovery operator from two combined data errors from the MIM table and the decoding table that correspond to the two map keys.If the MIM search radius reaches ρ and no matching syndrome is found, the decoder returns the CRO for the full syndrome.Using the recovery operator obtained from this method, we can correct up to t + ρ faults with probability higher than using the CRO of the full syndrome only.
An example of error decoding using a lookup table and the MIM technique is illustrated in Fig. 5.In our FTQEC protocols for hexagonal color codes of distance 3, 5, 7, and 9, we find that constructing the MIM table with search radius ρ = t is sufficiently fast to be used at runtime.Note that the MIM technique does not guarantee successful error correction due to potential degeneracy in syndromes above the guaranteed number of correctable faults.However, we do find numerically that the MIM technique has a positive impact on the performance of our decoders for the hexagonal color codes for distances 3, 5, 7, and 9.

IV. OPTIMIZATION TOOLS FOR TIME DECODING
In general, faults can happen at any point during syndrome measurements, and the syndrome obtained at each round of measurements may not be the correct syndrome  (the syndrome of the combined data error at the end of that round).In particular, measurement errors can lead to a syndrome that differs from the correct syndrome by some bits.Errors on the data or ancilla qubits that happen in the middle of the syndrome extraction can also result in a syndrome that only captures some parts of the correct syndrome.Applying a space decoder to a faulty syndrome can lead to an incorrect recovery operation.For this reason, one must perform multiple rounds of syndrome measurements.
The goal of a time decoder is to find a round with a correct syndrome at least at one point in the whole syndrome measurement process.If this can be done, an FTQEC protocol satisfying both conditions in Definition 4 can be constructed.Note that according to the definition, it is sufficient to consider only the case that the total number of faults in the whole protocol is no more than t, where t is the number of errors that a stabilizer code being used can correct.This is because the failure probability of the FTQEC protocol (the probability of having t + 1 or more faults in the protocol) will be O(p t+1 ) similar to the failure probability of an ideal error correction with the same stabilizer code.(Nevertheless, in terms of better decoding accuracy, it is beneficial to consider correcting some cases of t + 1 or more faults as suggested by the MIM technique in Section III B.) In this section, we develop several types of time decoders for flag FTQEC, building on the ideas of adaptive decoders for Shor-style error correction [38].Different time decoders use different fault count estimation procedures.In Section IV A, we describe a conventional way to perform repeated syndrome measurements for flag FTQEC in terms of difference vectors, which will be useful for the development in latter sections.In Section IV B, we develop one-tailed and two-tailed adaptive time decoders which utilize flag information in the protocols.One-tailed adaptive decoder is applicable to a larger family of codes, while the two-tailed adaptive decoder is more optimized to self-orthogonal CSS codes but need to be used with an extended technique so that it becomes fully fault tolerant when applying to quantum computation.In Section IV C, we develop two extended techniques that can further improve the performance of our adaptive time decoders for FTQEC, given that the code being used is a self-orthogonal CSS code.
A. Shor time decoder for flag FTQEC In Shor's original approach [1], the syndrome extraction is repeated until the same syndrome appears t + 1 times in a row.Observe that for R repeated but untrustworthy syndromes, at least R faults are required to make them the same (we can think of having exactly the same measurement errors for example).Therefore, to make sure that a round with a correct syndrome exists when considering the case with up to t faults, it is sufficient to wait for t + 1 repeated measurements.A time decoder with this stop condition will be referred to as Shor time decoder.
It is possible to rephrase the Shor time decoder using the notion of difference vector.For a syndrome history (⃗ s 1 , ⃗ s 2 , . . ., ⃗ s m ) of length m, we define a difference vector ⃗ δ to be an (m − 1)-bit string in which As two repeated syndrome measurements are represented by a zero in the difference vector, Shor's method can be reformulated as waiting for t consecutive zeros in ⃗ δ.
As we aim to correct no more than t faults, the analysis of our time decoders can be made easier by thinking about the budget of t faults.Shor's method spends all of this budget on counting consecutive zeros in the difference vector and is completely oblivious to other parts of the syndrome history (because the counter is reset whenever bit one appears).We call the parts of the syndrome history outside of the zero substring the context of the zero substring.As Shor's method does not take the context into account, we call this strategy "contextunaware".In the worst-case scenario for the Shor time decoder, (t + 1) 2 rounds of syndrome measurements are done before the stopping condition is satisfied.The context of the zero substrings contains useful information and not counting the faults in the context results in underestimating the number of faults that can cause a given syndrome history.Context-aware strategies that have a better estimate of the number of faults can stop earlier and execute fewer measurements, resulting in higher pseudothresholds.
As flag circuits are used in the syndrome extraction, we also obtain a flag vector history ( ⃗ f 1 , ⃗ f 2 , . . ., ⃗ f m ) from m rounds of syndrome measurements, which also leads to a cumulative flag vector history ( ⃗ F 1 , ⃗ F 2 , . . ., ⃗ F m ).Note that the calculation of a difference vector does not involve flag vectors; since the cases of faulty flag qubit measurements are considered when we evaluate the distinguishability of a fault set, all flag measurement outcomes are considered correct and can be used for error decoding.Our goal is to find a round such that all syndrome bits are correct.
The correct syndrome will be used in conjunction with the flag information obtained right before the measurements of the correct syndrome.Suppose that the code being used is a CSS code, and X-type generator measurements at round i (which lead to ⃗ s i,x , ⃗ f i,x , and ⃗ F i,x ) are done before Z-type generator measurements (which lead to ⃗ s i,z , ⃗ f i,z , and ⃗ F i,z ).If the syndrome from round l is correct according to Shor time decoder, Z-type (or Xtype) error correction will be done using ⃗ s l,x and ⃗ F l−1,z (or ⃗ s l,z and ⃗ F l,x ).We also use similar ideas for error correction with other time decoders.
Suppose that a table-based space decoder for flag FTQEC can be constructed (as discussed in Section III).Then, a flag FTQEC protocol with Shor time decoder is as follows: Protocol 1. Flag FTQEC protocol with Shor time decoder Let t = ⌊(d − 1)/2⌋ be the number of errors that a stabilizer code of distance d can correct.Let ⃗ s i = (⃗ s i,x , ⃗ s i,z ) and ⃗ f i = ( ⃗ f i,x , ⃗ f i,z ) be syndrome and flag vector obtained from the i-th round of full syndrome measurements with flag circuits.Let the cumulative flag vector at the i-th round be ⃗ ⃗ f j (mod 2).After the i-th round with i ≥ 2, calculate δ i−1 .Repeat syndrome measurements until the last t bits of ⃗ δ is zero or the total number of rounds reaches (t+1) 2 .Suppose that the latest round is round l.Perform Z-type error correction using (⃗ s l,x , ⃗ F l−1,z ), and perform X-type error correction using (⃗ s l,z , ⃗ F l,x ).

B. Adaptive time decoder for flag FTQEC
Recently, FTQEC protocols with adaptive syndrome measurement techniques have been proposed by some of the authors of this work [38].Instead of using flag qubits, in that work, each stabilizer generator is measured using a syndrome extraction circuit with a cat state (similar to Shor's original circuits [1]).The authors show that using the adaptive strong decoder, it is possible to reduce the number of syndrome measurement rounds in the worstcase scenario from (t+1) 2 rounds to (t+3) 2 /4−1 rounds.The resulting FTQEC protocol satisfies the error-weight based definition of FTQEC (Definition 1) and is applicable to any stabilizer code.In this work, we extend the adaptive strong decoder based measurement techniques to flag FTQEC and develop protocols satisfying the revised FTQEC conditions that use the number of faults instead of the weight of errors (Definition 4).The main difference from Ref. [38] is that this work also uses flag information to estimate the number faults occurred in the protocol, leading to a faster procedure to find a syndrome suitable for error correction.We start by describing the key ideas of Ref. [38] in terms of correlated and uncorrelated bit histories, which is useful for bounding from below the number of occurred faults.Afterwards, we explain how each technique in Ref. [38] could be improved using the flag information.

Counting faults in correlated and uncorrelated bit histories
Let us first consider a way to estimate the number of occurred faults from a given difference vector ⃗ δ.A single fault can cause either one or two consecutive bits of ones in ⃗ δ [38].Thus, for each substring ⃗ κ in ⃗ δ, the number of faults that can cause such a substring is bounded from below by the number of 11 sequences plus the number of remaining 1s in ⃗ κ.
Suppose that the difference vector is of the form ⃗ δ = η 1 1η 2 1 . . .1η c where η j = 00 . . .00 are zero substrings and 1 ≤ j ≤ c.For each η j of length γ j ≥ 1 with 2 ≤ j ≤ c − 1, we define α to be the total number of non-overlapping 11 sequences plus the total number of remaining 1s before the substring 1η j 1, and define β j similarly but for the substring after 1η j 1 (for η 1 and η c , β 1 and α c are defined similarly to those of other η j 's, and we let α 1 = 0, β c = 0).The zero substring η j of length γ j corresponds to γ j + 1 consecutive rounds with the same syndrome, so the number of rounds that can cause these rounds to give incorrect syndromes is at least γ j + 1.Therefore, under the assumption that there are at most t faults in the whole protocol, if we find that there exists γ j such that t − α j − β j < γ j + 1, the syndromes of the γ j + 1 rounds that give rise to η j cannot be all incorrect.For this reason, at least one syndrome corresponding to η j is correct and can be used for error correction (see the full analysis in Ref. [38] for more details).
For example, assume that the total number of faults in the protocol is t = 4, and ten rounds of syndrome measurements give the following ⃗ δ: Round: 1 2 3 4 5 6 7 8 9 10 ⃗ δ: 1 1 0|1 0 0 1|0 1 Focusing on the substring 1η j 1 = 1001, we find that α j = 1 and β j = 1, meaning that the patterns of ⃗ δ on the left and the right sides of 1001 arise from at least two faults, and the number of remaining faults is at most two.We can see that γ j = 2 because of the two zeros in the substring 1001 and corresponds to three rounds with the same syndrome.Since the number of remaining faults that can cause the pattern 1001 is less than the number of rounds with the same syndrome, the syndrome of at least one round in these three rounds must be correct and can be used for error correction.There are multiple, increasing fine-grained ways of estimating the number of faults in the context around each zero substring in the difference vector.Here, we use the term bit history as a general term for a series of syndrome bits (measurement outcomes) from a given stabilizer generator, a given flag bit (the measurement outcome of a flag qubit), or bits in a difference vector (that is taken as the difference history of a group of bits).A key element in this discussion is the notion of correlated and uncorrelated bit histories.
Under the assumed error model, two-bit histories are uncorrelated if they are independent of each other.For example, in our case, the circuit-level depolarizing channel is memoryless, and each fault can cause either one or two consecutive bits of ones.Thus, different sections of the same syndrome bit history that are at least two bits apart are uncorrelated as they are independent in time.Similarly, in space, if there are no shared qubits between two generators, then their syndrome bit histories are completely independent.Also, flag qubits are always reset between rounds of measurements, and thus, all flag bits are independent.However, when two stabilizer generators share at least one qubit, their syndrome bit histories are correlated.Similarly, due to hook errors, the flag qubit's bit history and the syndrome bit history of that same stabilizer generator are correlated.
Our goal is to estimate the number of faults that occurred from a given bit history in the case of flag FTQEC.Estimates from uncorrelated histories can be summed together.When two or more estimates are from correlated histories, the best we can do is to take the maximum of those estimates.Note that the total estimates must not exceed the actual total number of occurred faults in any case, otherwise, the error correction protocol will not be fault tolerant.
For the estimation in the previous work [38], which is discussed previously in this section, the bits of the syndrome history before and after each substring 1η j 1 are uncorrelated to the bits within η j under the memoryless depolarizing channel assumption.This means that α j and β j , which are the minimum numbers of faults that can cause the substring before and after 1η j 1, can be independently estimated.The estimated number of faults in the context outside of the zero substring η j is, therefore, α j + β j .
In this work, we further extend the fault counting idea to flag FTQEC in which flag circuits with single flag qubit are used for syndrome extraction.Below, we will discuss two types of adaptive time decoders with different stop conditions, namely one-tailed and two-tailed adaptive time decoders.Both protocols are applicable to any stabilizer code as long as flag circuits for the code that give a distinguishable fault set can be found.The flag FTQEC protocol with one-tailed adaptive time decoder satisfies the FTQEC conditions in Definition 4, thus it is applicable to any fault-tolerant quantum computation as long as the fault-tolerant implementation of other operations (gate, state preparation, or measurement) also satisfies the revised definition of fault tolerance which considers the number of faults instead of the weight of the error [35].Meanwhile, the flag FTQEC protocol with twotailed adaptive time decoder does not satisfy the FTQEC conditions in Definition 4 as the output error may correspond to a nontrivial cumulative flag vector, hence it is only applicable to quantum memory.Nevertheless, for a self-orthogonal CSS code, the FTQEC protocol with the two-tailed adaptive time decoder can be applied to any fault-tolerant Clifford computation if the cumulative flag vector is processed appropriately.An analysis of this extension will be discussed in Section IV C.

Two-tailed adaptive time decoder
For the substring 1η j 1 in ⃗ δ, suppose that bit one on the left of η j is the i 1 -th bit of ⃗ δ, and bit one on the right of η j is the i 2 -th bit of ⃗ δ.Let α j , β j , γ j be defined as before, and let µ j , ν j be the total numbers of nonzero flag bits obtained from round 1 to round i 1 and from round i 2 + 1 onward.Also, let ω j be the sum of the numbers of flag bits that exceed 1 bit per round during round i 1 + 1 to round i 2 .For example, consider the substring 1η j 1 = 1001 in the example below: Round: 1 2 3 4 5 6 7 8 9 10 # flag bits: 1 0 2 0|0 2 1|0 0 1 ⃗ δ: 1 1 0|1 0 0 1|0 1 In this example, α j = 1, β j = 1, γ j = 2, µ j = 3, and ν j = 1, and ω j = 1.Since a single fault can cause both nontrivial flag bits and syndrome differences (that is, syndrome bits and flag bits are correlated), one has to make sure that the number of faults is not overcounted.The numbers of faults that can cause bit histories before and after 1η j 1 are bounded from below by αj = max(α j , µ j ) and βj = max(β j , ν j ), respectively.So an estimate of the number of faults for the context outside of η j is αj + βj .
Next, let us consider η j of length γ j which corresponds γ j + 1 consecutive rounds with the same syndrome.To make all syndromes in this region incorrect, it requires at least one fault per round.So if we find a round with more than one flag bit, the number of flag bits that exceed one bit per round can be a part of the total estimate.That is, for each η j , the total estimate is αj + βj + ω j .
Under the assumption that there are at most t faults in the whole protocol, if we find that there exists γ j such that t − αj − βj − ω j < γ j + 1 (or equivalently, αj + βj + γ j + ω j ≥ t), we know that a syndrome of at least one round in the γ j + 1 rounds that give rise to η j must be correct.
Another way to find a correct syndrome is to estimate the total number of faults that can cause the whole syndrome and flag bit histories.Let N 11 be the total number of non-overlapping 11 sequences in the whole ⃗ δ.Assuming that there are at most t faults in the whole protocol, if N 11 ≥ t, the last round must have a correct syndrome.
Suppose that a table-based space decoder for flag FTQEC can be constructed.Then, a flag FTQEC protocol with two-tailed adaptive time decoder is as follows: Protocol 2. Flag FTQEC protocol with two-tailed adaptive time decoder Let t = ⌊(d − 1)/2⌋ be the number of errors that a stabilizer code of distance d can correct.Let ⃗ s i = (⃗ s i,x , ⃗ s i,z ) and ⃗ F i = ( ⃗ F i,x , ⃗ F i,z ) be syndrome and cumulative flag vector obtained from the i-th round of full syndrome measurements with flag circuits.After the i-th round with i ≥ 2, calculate δ i−1 .Repeat syndrome measurements until one of the following conditions is satisfied, then perform error correction using the error syndrome corresponding to each condition: 1.For each η j in ⃗ δ, calculate αj , βj , γ j , ω j .If at least one η j with αj + βj + γ j + ω j ≥ t is found, stop the syndrome measurements.Let l be the last round of the γ j + 1 rounds that correspond to η j .Perform Z-type error correction using (⃗ s l,x , ⃗ F l−1,z ), and perform X-type error correction using (⃗ s l,z , ⃗ F l,x ).
2. Calculate N 11 from the whole syndrome and flag bit histories.If N 11 ≥ t, stop the syndrome measurements.Suppose that the latest round is round l.Perform Z-type error correction using (⃗ s l,x , ⃗ F l−1,z ), and perform X-type error correction using (⃗ s l,z , ⃗ F l,x ).
The two-tailed adaptive time decoder for flag FTQEC developed in this work use similar ideas to the adaptive strong decoder presented in the previous work [38].
Therefore, the number of syndrome measurement rounds in the worst-case scenario is (t + 3) 2 /4 − 1 when t is odd, and is (t + 2)(t + 4)/4 − 1 when t is even.This can be proved by assuming that all faults does not cause any nonzero flag bits, then the rest of the proof follows the proof of Theorem 2 of the previous work [38].
If the syndrome ⃗ s l and cumulative flag vector ⃗ F l = l i=1 ⃗ f i (mod 2) of round l are used for error correction, any faults that happened up to round l will be corrected.However, because round l may correspond to some η j in the middle of ⃗ δ, an output error may correspond to a nontrivial cumulative flag vector.Therefore, Protocol 2 may not satisfy FTQEC conditions in Definition 4 and cannot be applied to fault-tolerant quantum computation.Nevertheless, Protocol 2 is still applicable to a quantum memory.To do so, one needs to pass the remaining cumulative flag vector of the current FTQEC routine (the sum of the flag vectors from round l + 1 onward) to the next FTQEC routine and use it as an initial flag vector.

One-tailed adaptive time decoder
One-tailed and two-tailed decoders use similar ideas to estimate the number of faults, except that in the onetailed case, the syndrome and cumulative vector for error correction must be from the very last zero substring in ⃗ δ (it is to ensure that the output error satisfies both conditions in Definition 4).Suppose that ⃗ δ = η 1 1η 2 1 . . .1η c for some positive integer c, η c has length γ c ≥ 1, and bit one on the left of η c is the i 1 -th bit of ⃗ δ.We define α c as usual and define µ c to be the total number of nonzero flag bits obtained from round 1 to round i 1 .Also, we define ω c to be the sum of the numbers of flag bits that exceed 1 bit per round during round i 1 + 1 onward.Let αc = max(α c , µ c ).In this case, the total estimate of the number of occurred faults is αc + ω c .
Assuming that there are at most t faults in the whole protocol, if we find that αc + γ c + ω c ≥ t, at least one round in the γ c + 1 rounds that give rise to η c must have a correct syndrome.This is the first possible stop condition.
The second possible stop condition is similar to what we have for the two-tailed decoder.Let N 11 be the total number of non-overlapping 11 sequences in the whole ⃗ δ.If N 11 ≥ t, the last round must have a correct syndrome.
Suppose that a table-based space decoder for flag FTQEC can be constructed.Then, a flag FTQEC protocol with the one-tailed adaptive time decoder is as follows: Protocol 3. Flag FTQEC protocol with one-tailed adaptive time decoder Let t = ⌊(d − 1)/2⌋ be the number of errors that a stabilizer code of distance d can correct.Let ⃗ s i = (⃗ s i,x , ⃗ s i,z ) and ⃗ F i = ( ⃗ F i,x , ⃗ F i,z ) be syndrome and cumulative flag vector obtained from the i-th round of full syndrome measurements with flag circuits.After the i-th round with i ≥ 2, calculate δ i−1 .Repeat syndrome measurements until one of the following conditions is satisfied: Suppose that the latest round when any condition is satisfied is round l.Perform Z-type error correction using (⃗ s l,x , ⃗ F l−1,z ), and perform X-type error correction using The number of rounds of full syndrome measurements in the worst-case scenario for Protocol 3, which is also the minimum number of rounds required to guarantee that error correction can be done, can be found by the following theorem: Theorem 1. Suppose that flag circuits being used in Protocol 3 give a distinguishable fault set F t , where t = ⌊(d − 1)/2⌋ and d is the distance of the stabilizer code.Performing t(t+3) 2 + 2 rounds of full syndrome measurements is sufficient to guarantee that Protocol 3 is strongly t-fault tolerant; i.e., both conditions in Definition 4 are satisfied.
By induction, the maximum length of Here ⃗ δ is of the form, 00 . . .00 The number of rounds that gives ⃗ δ of the maximum length is t(t+3) 2 + 1.By performing one more round of syndrome measurements, ⃗ δ is extended by one bit, which must be 0 if the total number of faults is no more than t.In that case, αc + γ c ≥ t will be satisfied.Therefore, t(t+3) 2 + 2 rounds of full syndrome measurements is sufficient to guarantee that flag FTEC can be performed.
Note that there are other forms of ⃗ δ in which none of η 1 , η 1 1η 2 , η 1 1η 2 1η 3 , . . ., η 1 1η 2 1 . . .1η c satisfies any condition in Protocol 3, and the length of ⃗ δ is t(t+3) 2 − 1; For example, suppose that t = 3. Possible forms of such ⃗ δ are 001101011, and 001001111.In any case, one of the conditions in Protocol 3 will be satisfied if one more round of syndrome measurements is done, so the number of rounds to guarantee fault tolerance is still t(t+3) Note that the number given by Theorem 1 is worse than that of the two-tailed decoder because we are not allowed to check whether the syndrome of any round in the middle can be used for error correction.
An advantage of the FTQEC protocol with one-tailed adaptive time decoder is that it is applicable to any kind of fault-tolerant quantum computation as long as the corresponding fault-tolerant implementation satisfies the revised definitions of fault tolerance which consider the number of faults instead of the weight of errors [35].This is possible because when the syndrome and cumulative flag vector for error correction are from the last zero substring in ⃗ δ, it is guaranteed that the output error corresponds to a zero cumulative flag vector.

C. Extended techniques for CSS codes
In this section, we discuss two additional techniques which can further improve our flag FTQEC protocols with adaptive time decoding.The first technique is the separated X and Z counting which is applicable to any CSS code.This technique is based on the ideas from Refs.[37,38], and can be used to improve the pseudothreshold.The main difference from the technique developed in Ref. [38] is that this work also uses flag information to estimate the number of occurred faults, making the procedure to obtain a syndrome for error correction terminate faster.The second technique is the classical processing of the remaining cumulative flag vector.This technique allows our flag FTQEC protocol with the two-tailed adaptive time decoder to be applicable to any fault-tolerant Clifford computation.

Separated X and Z counting
For any CSS code, Z-type and X-type errors can be corrected separately.It is possible to improve the number of measurements by separating the X-type and Z-type syndrome measurement rounds (which correspond to Xtype and Z-type stabilizer generators).In this section, we introduce the XZ and ZX decoding strategies.In the XZ strategy, first, we execute a time decoder (which can be Shor, one-tailed, or two-tailed decoder) using only the X-type syndromes.The difference vector for this process is denoted by ⃗ δ x .After the decoder returns the X-type

Remaining cumulative flag vector Logical Clifford
Initial flag vector of the of the current FTQEC routine operation next FTQEC routine TABLE II: A list of required classical processing operations on the remaining cumulative flag vector in case that a logical Clifford gate is performed between two FTQEC routines.With these operations, a flag FTQEC protocol with two-tailed adaptive time decoder or separated X and Z counting is applicable to any fault-tolerant Clifford computation, given that the CSS code is self-orthogonal.
syndrome and the cumulative flag vectors for Z-type error correction, we estimate the number of faults t x that could cause ⃗ δ x ; we define α all,x to be the total number of non-overlapping 11 sequences plus the total number of remaining 1s in ⃗ δ x , define µ all,x to be the total number of nontrivial flag bits in ⃗ δ x , and let t x = max(α all,x , µ all,x ).
Given that we spend this number of faults from our fault budget t, we can reduce the target number of faults in the stop condition for the Z-type syndrome measurements.Afterward, we run a time decoder for Z-type syndromes with the target number of faults t z = t − t x .The ZX strategy is similar to the XZ strategy, except that the Z-type generators are measured first.When the separated X and Z counting technique is applied to a flag FTQEC protocol, one can find syndromes for Z-type and X-type error corrections faster compared to a conventional method where the target numbers of faults for both types of error corrections are t.However, a drawback is that the flag FTQEC protocol will only be compatible with quantum memory.This is because of each type of error correction requires flag information of the opposite type.In particular, suppose that the time decoder for X-type syndrome measurements give syndrome ⃗ s x and cumulative flag vector ⃗ F x , and the time decoder for Z-type syndrome measurements give syndrome ⃗ s z and cumulative flag vector ⃗ F z .Z-type error correction will be done by applying a space decoder to ⃗ s x and the zero cumulative flag vector, while X-type error correction will be done by applying a space decoder to ⃗ s x and ⃗ F z .The cumulative flag vector ⃗ F x which has not been used will be treated as the remaining cumulative flag vector of the current FTQEC routine and used as an initial flag vector for Z-type error correction in the next FTQEC routine.

Classical processing of the remaining cumulative flag vector
One drawback of a flag FTQEC protocol that uses the two-tailed adaptive time decoder or the separated X and Z counting technique is that it is only applicable to a quantum memory, not a general fault-tolerant quantum computation.This is because the output error at the end of each FTQEC routine may correspond to a nontrivial cumulative flag vector.To correct such an error, one needs to pass the flag information from each FTQEC routine (the remaining cumulative flag vector) to the next FTQEC routine.However, if there is some quantum computation between two FTQEC routines (as in an extended rectangle [6]), the error will be transformed and may not be correctable if the corresponding flag information is not processed properly.
Nevertheless, for any self-orthogonal CSS code, a flag FTQEC protocol with two-tailed adaptive time decoder or separated X and Z counting (or both) can made applicable to any fault-tolerant Clifford computation.For example, let us consider an application of a logical Hadamard gate H between two FTQEC routines.Suppose that the first FTQEC routine causes an output error E x • E z and the remaining cumulative flag vector is ( ⃗ F x , ⃗ F z ).Without a logical Hadamard gate, E x and E z can be corrected using ⃗ F z and ⃗ F x , respectively.A logical Hadamard gate transforms an X-type error to a Z-type error of the same form, and vice versa.Because the X-type and Z-type generators are of the same form, possible fault combinations for both types of errors are also of the same form.To correct the transformed error H(E x •E z ) H † in the second FTQEC routine, one needs to swap the X-type and Z-type cumulative flag vector; that is, the initial flag vector for the second FTQEC routine must be ( ⃗ F z , ⃗ F x ).
We can apply similar ideas for flag information processing to logical S and logical CNOT gates.The summary of the classical processing operations for logical H, S, and CNOT gates is provided in Table II.Because {H, S, CNOT} generates the Clifford group, a flag FTQEC protocol with two-tailed adaptive time decoder or separated X and Z counting is applicable to any faulttolerant Clifford computation given that the CSS code is self-orthogonal.Note that the magic state distillation and injection [56,57] use only Clifford operations.Thus, our techniques are also applicable to fault-tolerant universal quantum computation given that high-fidelity magic states are provided.

A. Methods
Our optimization tools for space and time decoders including the compact lookup table construction, the MIM technique, and the adaptive time decoders for flag FTQEC are applicable to any stabilizer code.However, we focus on a specific family of codes where the aforementioned tools can be simplified and extended techniques, including separated X and Z decoding and classical processing of flag information are applicable-the family of self-orthogonal CSS codes in which the number of physical qubits is odd, the number of logical qubits is 1, and logical X and Z operators are transversal.To evaluate the performance of our tools, we simulate FTQEC protocols on the [[(3d 2 + 1)/4, 1, d]] hexagonal color codes [41] of distance 3, 5, 7 and 9.These codes are planar topological codes with configurations displayed in Fig. 6.For each code, stabilizer generators are measured using the syndrome extraction circuits with single flag ancilla, as depicted in Fig. 4. It was proven that for the hexagonal code of any distance, using flag circuits of this form preserves the code distance regardless of the gate orderings [32,35]).The simulation is implemented under the circuit-level depolarizing noise model specified in Section II A. As there is no idling noise in our error model, the syndromes can be extracted sequentially.
To construct a lookup table for space decoding and to verify that our circuit configurations preserve the code distance, we implement the algorithm described in Section III A using C++.The timing for verification alongside the statistics of the lookup table can be found in Table I.The lookup table for these codes can be generated on the fly before the sampling starts as the required time is low enough.
Here we simulate the storage (i.e. the result of the logical identity operation) of the logical state | 0⟩.We use of the Pauli frame simulator in Stim [58] to collect measurement samples, and use Cirq [59] for constructing the circuits with the given noise model.After a perfect preparation of | 0⟩, we perform noisy error correction and recovery.In the error correction process, full rounds of syndrome measurements are repeated until the stop condition of the time decoder is satisfied.The time decoder returns an accepted full syndrome (consisting of error syndrome and cumulative flag vector), then the space decoder determines the recovery operation based on the accepted full syndrome.This recovery operation is applied to the data qubits afterwards.Finally, we apply an ideal error correction and determine whether the output error is a logical X error (which corresponds to having | 1⟩ as the output state).FIGURE 6: The studied members of the hexagonal color code, for distances 3, 5, 7, and 9 (right to left).Qubits are on the vertices and stabilizer generators are the plaquettes.As the codes are self-orthogonal CSS codes, both the X and Z stabilizer generators are described by the same layout.

B. The overall effect of optimization tools
We first compare two protocols: (1) the FTQEC protocol with Shor time decoder without the MIM technique (the protocol in which none of our optimization tools are applied) and ( 2) the FTQEC protocol with the MIM technique and the two-tailed adaptive time decoder with the ZX strategy (the best FTQEC protocol in this work which is compatible with any Clifford computation on a self-orthogonal CSS code).The logical error rate p L vs physical error rate p for hexagonal color codes of distance 3, 5, 7, and 9 are plotted in Fig. 7. Our results show that for each code, applying the optimization tools can significantly improve the pseudothreshold (the intersection between each plot and the p L = 2p/3 line).Furthermore, the optimized decoder yields orders of magnitude improvements in the logical error rate in the p = 10 −4 error regime.
Under a noise model parameterized by a single parameter p, the fault-tolerant threshold p th is the error probability under which the logical error rate is guaranteed to decrease with increasing code distance for a specific code family and decoder.Our decoders can yield a p th for concatenated code families using a level-by-level decoder, but they will not yield a threshold for topological code families for two reasons.The practical reason is that our space decoder that uses a lookup table is not scalable to the large d limit.The fundamental reason is that the time decoder will always take ⃗ δ in which all bits are one when d is large, because δ j for each round will be 1 with a probability exponentially close to 1 for finite p.The space decoder then acts on the final state but lacks the information about correlations to properly correct it.This is why an efficient space-time decoder is critical for achieving p th for topological codes.
We can define an effective threshold pth as the error rate below which increasing the code distance improves the logical error rate for this finite set of codes.The optimized protocol yields a pth = 1.5 × 10 −3 , while the unoptimized protocol yields pth = 4.5 × 10 −5 .We also note that the crossing point between the codes of distances d and d − 2 is dropping quickly with the unoptimized de-FIGURE 7: The upper plot shows the curve of logical error rate p L vs physical error rate p for the hexagonal color code family without any of our optimization techniques, using the Shor time decoder without MIM.The lower plot uses the best-performing combination of our techniques, including MIM and the two-tailed adaptive time decoder with ZX strategy.Pseudothresholds for each curve (the p th error rate which gives p L (p th ) = 2p th /3) are included in the labels and marked with vertical lines.The data points represent the number of logical errors divided by the total number of samples at that p error rate and thus estimate the true logical error rates, which should lie within the shaded areas with high confidence.The dotted helper lines, which are αp t+1 where α = 2 3 p −t th retroactively calculated for each curve from its pseudothreshold, show good agreement with distance preservation.coder, while it is stable for the optimized decoder over this code set.Table III summarizes the effects of different optimization tools on the pseudothreshold of the d = 9 color code.In the next sections, we further discuss the effect of each technique that can contribute to this improvement.

C. The effect of the Meet-in-the-Middle technique
In this section, we evaluate the performance of simulated storage that uses the space decoder with and with- out the MIM technique.We explore the effect for distances 3, 5, 7, and 9, and compare the effect when the time decoder is Shor, one-tail, or two-tail time decoder.We observe a significant decrease in logical error rates and an improvement in pseudothreshold when the MIM technique is applied.We also find that the benefit increases with the code distance.In Fig. 8, we show the improvement for the code of distance 9 where the benefit is the largest.The results for codes of other distances are provided in Figs.

D. The effect of the adaptive time decoders
In this section, we compare the performance of the simulated storage numerical experiments that use different time decoders when the MIM technique is applied.The results are displayed in Fig. 9 for the hexagonal code of distance 9, and we refer the reader to Fig. 17 in Appendix C for the results for the codes of other distances.
For the code of distance 9, in comparison with the Shor time decoder, the one-tailed adaptive time decoder improves the pseudothreshold by 40% from (2.79 ± 0.07) × 10 −4 to (3.91 ± 0.07) × 10 −4 .The two-tailed method achieves (5.96 ± 0.71) × 10 −4 pseudothreshold, which is more than a 100% increase compared to the Shor time decoder.However, this gain vanishes at lower error rates, and the performances of Shor and one-tailed decoders become similar at around p = 10 −4 .It is not surprising as we expect all adaptive time decoders to converge to Shor time decoder at lower error rates.The main reason for this convergence is that the performance gains for the adaptive techniques come from a decrease in the average number of rounds for syndrome measurements, and the decrease converges to zero at low error rates.How fast the decrease converges does matter, and in contrast FIGURE 8: The effect of the MIM technique on different time decoders at distance 9.The effect is the largest for Shor time decoder, more than doubling the pseudothreshold.The MIM technique also gives at least a significant 70% improvement on the adaptive time decoders.
to the one-tailed approach, the two-tailed time decoder preserves its performance gain over Shor time decoder at the observed low-error regime as low as 5 × 10 −5 .
We also provide the plots of the average numbers of full rounds of measurements for all decoders.At a lowerror-rate regime, all decoders have the same minimum number of measurement rounds, t + 1, which corresponds to the case that all bits in the difference vector are zeros.We can see the separation more clearly when the phys-ical error rate is in the 10 −3 range; the two-tailed time decoder requires the fewest rounds, followed by the onetailed decoder, and the Shor time decoder performs the worst.At the high-error-rate regime, all bits in the difference vector tend to be ones.In this case, the Shor time decoder requires (t + 1) 2 rounds, while both one-tailed and two-tailed decoders require 2t + 1 rounds.

E. The effect of the separated X and Z counting technique
In this section, we observe the performance gains when the separated X and Z counting technique is applied.
Here we compare the FTQEC protocols that use the two-tailed adaptive time decoder with joint X and Z generator measurements (as in Section IV B 2), the twotailed adaptive time decoder with XZ strategy, and the two-tailed adaptive time decoder with ZX strategy (as in Section IV C 1).The logical error rate is calculated from the number of samples in which the output error is a logical X error.The p L versus p plots for the code of distance 9 are shown in Fig. 10 (the results for codes of other distances can be found in Fig. 18 in Appendix C).
In terms of the pseudothreshold, we observe that the decoder with separated X and Z counting performs the best when Z-type generators are measured before X-type generators.Compared to the two-tailed decoder with joint X and Z generator measurements, the separated two-tailed ZX decoder improves the pseudothreshold by 140% from (5.96 ± 0.71) × 10 −4 to (1.44 ± 0.20) × 10 −3 .This is mainly because measuring generators of the first type (X or Z) requires more rounds, and it is more probable that the measurements can cause correlated errors of the same type as the generators being measured (which are more difficult to correct than uncorrelated errors since they require flag information).Because in our simulations we measure the performance of storing the logical |0⟩ state (thus, a logical X error is counted), the decoder that measures X-type generators first performs worse.We also observe that there is no significant difference between the two-tailed decoder with joint measurements and the two-tailed decoder with XZ strategy.
We also provide plots of the average number of full rounds of measurements for all decoders (where the full round of single-type generator measurements is counted as half a round of total measurements).At the lowerror-rate regime, all decoders require t + 1 rounds.For the original two-tailed decoder, the average number of rounds increases as the physical error rate increases, and it reaches 2t+1 rounds at the high-error-rate regime.For both two-tailed decoders with separated X and Z counting, we find that the average number of rounds increases near the pseudothreshold, then there are the dips after the pseudothreshold, and the numbers reach t + 1 rounds at the high-error-rate regime.The dips come from the fact that the measurements of generators of the first type (either X or Z) can stop at less than (2t + 1)/2 rounds FIGURE 10: Logical error rates of the two-tailed time decoder with XZ and ZX strategies in comparison with the two-tailed adaptive time decoder with joint X and Z measurements (left) and corresponding average number of rounds (right) for the hexagonal color code of distance 9. but the estimate of the number of occurred faults can be t, which then causes the measurements of generators of the second type to stop at 1/2 rounds.At the higherror-rate regime, the decoders with separated X and Z counting require t + 1 rounds since measuring generators of the first type requires (2t + 1)/2 rounds while measuring generators of the second type requires 1/2 rounds on average.Overall, the decoder that measures Z-type generators first performs better than the decoder that measures X-type generators first.

VI. DISCUSSIONS AND CONCLUSIONS
In this work, we focus on flag FTQEC with lookup table decoding and improvements to a decoder consisting of a time decoder and a space decoder.For the space decoder, we first develop a technique to build the lookup table more efficiently in Section III A. With our lookup table construction method, the lookup table for a selforthogonal CSS code requires at least 87.5% less memory compared to the lookup table for a generic stabilizer code.The construction method also verifies the distinguishability of the fault set corresponding to flag circuits for syndrome measurements.Our construction also leads to the notion of the fault code, a linear code correspond-ing to the faults under circuit-level noise, which simplifies the verification of the distance of the protocol.More efficient decoding schemes for the fault code can be an interesting avenue to explore in future work.
Another optimization tool for space decoding is the MIM technique in Section III B, which could improve decoding accuracy when the number of faults in the protocol is greater than t (where t = ⌊(d−1)/2⌋ for the code of distance d).The effect of the MIM technique on the simulated storage of the hexagonal color codes is discussed in Section V C (see also Fig. 14).We find that for any kind of time decoder, the logical error rates are reduced, and the pseudothresholds are improved when applying the MIM technique, with greater improvements at larger distances.
For the time decoder, we generalize the adaptive syndrome measurement technique from the previous work [38] (which is applicable to Shor-style error correction [1]) to flag FTQEC, and develop one-tailed and two-tailed adaptive time decoders in Section IV B. For a general stabilizer code in which flag FTQEC is possible, the onetailed decoder is preferable as it is compatible with any fault-tolerant quantum computation, while the two-tailed decoder is applicable to quantum memory only.Nevertheless, for self-orthogonal CSS codes, the two-tailed decoder is applicable to any fault-tolerant computation built from Clifford gates and application of T gates by gate teleportation using high-fidelity magic states with the help of the classical processing technique on cumulative flag vectors developed in Section IV C. The effect of the adaptive time decoders on the simulated storage is discussed in Section V D. We observe that our adaptive time decoders can improve the pseudothresholds compared to the non-adaptive (Shor) time decoder while preserving the code distance.The two-tailed decoder also outperforms the one-tailed decoder.
The two-tailed adaptive decoder without MIM in this work is similar to the adaptive strong decoder in the previous work [38], except that this work uses flag circuits instead of syndrome extraction circuits with cat states.The numerical results show that using flag circuits results in a 20-35% increase of the pseudothreshold for the hexagonal color codes of distances 3, 5, 7 and 9.This is mainly because flag circuits have fewer state preparation and qubit measurement locations, although they have more gates.The previous work [38] also assumes fault-tolerant preparation of cat states, which requires verification [1] or ancilla decoding circuit [26] that can result in higher space and time overhead.Thus, the pseudothresholds could be worse in that case if additional requirements are also considered.It should be noted that flag circuits may not outperform syndrome extraction circuits with cat states in general, as flag FTQEC for other codes may require more complicated flag circuits.
We can further improve the performance of adaptive time decoders on self-orthogonal CSS codes by using the separated X and Z counting technique described in Section IV C. Here, we estimate the number of faults occurred from the measurement of generators of the first type (either X or Z) and then use that information in the measurement of generators of the second type.The effect of this technique can be found in Section V E. When the logical |0⟩ state is stored, we find that the protocol that measures Z-type generators before X-type generators performs the best.We see no significant difference in the protocol that measures X-type generators before Z-type generators, and the protocol that measures Xtype and Z-type generators jointly.Thus, the separated X and Z counting provides an advantage only for certain input states depending on the measurement order.
Combining all techniques together, we find a significant improvement in the pseudothreshold while the code distance is still preserved.For example, on the hexagonal color code of distance 9, the pseudothreshold goes up from (1.34 ± 0.01) × 10 −4 to (1.42 ± 0.12) × 10 −3 .We also find that in comparison with the unoptimized decoder, the crossing points between the codes of distances d and d − 2 come much closer when all techniques are applied (as shown in Fig. 7), leading to a higher effective threshold pth for this set of codes.
While our techniques are applicable to a broader family of codes, it would be interesting to see how our results compare with other works that study error decoding on the hexagonal color codes under circuit-level noise.For example, Baireuther et al. [60] reported a pseudothreshold above 2 × 10 −3 (against p L = p instead of p L = 2p/3) with a neural-network decoder, which also preserves the code distance empirically.However, it was also reported that training decoders for d > 7 became too expensive.By adapting efficient color-decoding algorithms known as restriction decoder [61] and projection decoder [51], Chamberland et al. [32] and Beverland, Kubica and Svore [24] reported threshold values of 2 × 10 −3 and 3.7×10 −3 respectively.The difference between threshold values is mostly contributed by different choices of syndrome extraction circuits: for each weight-six stabilizer generator, Ref. [32] used three flag qubits for connectivity considerations, while Ref. [24] did not use any flag qubits.However, both the restriction decoder and the projection decoder can only correct up to d/3 errors (see Fig. 15 in Sahay and Brown [62] for example failure modes) on the color code family considered in this paper 1 .Recent preprints report distance-losing schemes to decode the color code with even higher thresholds of 4.7 × 10 −3 [25], and between 5 × 10 −3 to 7 × 10 −3 [63] without using flag qubits.
In contrast to the constructions that utilize the restriction decoder [32] and the projection decoder [24], our adaptive decoding method preserves the code distance (although the lookup table is not scalable to codes with larger distances).It is expected that our method could become advantageous for the codes of interest when the physical error rate is below a certain value.However, the noise models in Refs.[24,25,32,60,63] also consider idling noise, while our noise model does not.Sequential syndrome extraction is expected to perform poorly in architectures where idling noise is dominant (see Appendix B for an analysis on the [ [7,1,3]] code).To improve performance, our methods need to be combined with optimized schedules specific to the given code family.CNOT schedule optimization is involved, requiring an enumeration of valid CNOT schedules satisfying basic constraints and finding the best-performing one using exhaustive search, similar to how Beverland, Kubica, and Svore [24] found a well-performing schedule for hexagonal color codes and bare ancillas.It is thus an open question what the error regime is where our flag qubit-based, adaptive methods are advantageous in comparison to the non-distance preserving decoders.This analysis will require evaluation using code-specific optimizations under different strength idling noise scenarios, which we leave for future work.
Hierarchical decoding approaches also provide an interesting avenue to explore with lookup table-based and adaptive techniques [64,65].We conjecture that our techniques may result in efficient pre-decoders.The lookup tables and the adaptive syndrome algorithms would have to be restricted to local sections of topological codes or sparsely connected modules of other codes.
Then, when the lookup table decoders cannot decode the local problems, the more expensive and accurate decoder can attempt to decode the nonlocal problem.
It should be noted that this work uses the adaptive syndrome measurement technique, which assumes fast qubit preparation and measurement.For the architectures on which qubit measurement and reset are slow, however, our method may require a large number of ancillas or may not be possible.In that case, one may consider using the flag schemes that do not require fast qubit measurement and reset, such as the flag scheme for any distance-3 code [66], or the flag scheme in which the flag gadgets are constructed from the classical BCH codes [67].
If the code is self-orthogonal, then the two tables coincide, T := T X = T Z , r := r X = r Z .Thus, where we used the fact that T XZ must be at least 1 for t > 0 and a non-trivial encoding.This upper bound means that at a zero rate code, leveraging the structure of a self-orthogonal CSS code and the CROs can create a memory footprint less than 12.5% that of the memory footprint of a lookup table if we viewed the code as a stabilizer code.

Appendix B: The effect of idling noise
To demonstrate the effect of idling noise, we evaluate the [ [7,1,3]] code under a naive interleaved schedule, depicted in Fig. 12a without noise terms and in Fig. 12b with gate noise terms with strength p = 0.02 and idling noise terms with strength p I = 0.01.Note that further improvements are possible to reduce idling in the circuit by doubling the number of flag qubits and ancilla qubits and measuring X and Z stabilizer generators in parallel, similar to the scheme by Beverland, Kubica, and Svore [24].This will, however, be only possible for the two-tailed adaptive decoder, and the separated X/Z decoder will not work by definition.Also, protocol-specific CNOT schedule optimization might be possible depending on the underlying quantum code.As we are not aiming to find tools on the code level, this investigation is out of scope of this paper.It is also interesting to point out that using a single flag qubit and single ancilla forces sequential execution of the gates within a generator, while multi-flag based schemes such as in the work of Chamberland et al. [32] allow for multiple CNOT gates to be executed in the same time step.While our methods here use single flag qubit-based protocols, that angle can be relaxed if the strength of idling noise requires it.
Our numerical evaluation results displayed in Fig. 11 show that at idling noise strength p I = p, the pseudothreshold is 20 to 25 times smaller than the case without idling noise p I = 0.However, as the relative strength of the depolarizing noise p/p I increases, the performance approaches the ideal case rapidly.Furthermore, we can see that our decoders still preserve the distance, which is expected given that the single qubit depolarizing noise terms do not change the set of errors to be corrected but only change the strength of some terms.

FIGURE 2 :
FIGURE 2: A Shor syndrome extraction circuit for measuring a stabilizer generator of the form ZZZZ.

FIGURE 5 :
FIGURE 5: An illustration of the error decoding using a lookuptable and the MIM technique on the Hilbert space H = C ⊗n of the physical qubits.A code of distance 9 is considered in this example.Using a lookup table with search radius 4 only, any erroneous states lying on the green (or red) circles, which are up to 4 faults (circles) away from the logical state |ψ L ⟩ (or |ψ ⊥ L ⟩), will be recovered to the logical state |ψ L ⟩ (or |ψ ⊥ L ⟩).Consider the erroneous state E|ψ L ⟩ which is not on any green or any red circle.In (a), E|ψ L ⟩ is 5 faults away from |ψ L ⟩ and 6 faults away from |ψ ⊥ L ⟩.Using the MIM table of radius 1, the recovery operator found by the decoder is R 1 .Since R 1 E is a stabilizer, R 1 brings the state back to the original state |ψ L ⟩.In (b), E|ψ L ⟩ is 6 faults away from both |ψ L ⟩ and |ψ ⊥ L ⟩.Using the MIM table of radius 2, the recovery operator found by the decoder is either R 1 such that R 1 E is a stabilizer, or R 2 such that R 2 E is a nontrivial logical operator.In this case, the state after recovery can be either |ψ L ⟩ or |ψ ⊥ L ⟩.

FIGURE 11 :
FIGURE 11:  The effect of idling noise on a naive CNOT schedule for the [[7,1,3]] code at different idling noise strength p I relative to the gate errors p.In this setup p I = p is the full, standard depolarizing noise model, and p I = 0 is the one we used to evaluate our methods in the main text, while p I = p/2, p I = p/5 and p I = p/10 are between those two extremes.

FIGURE 12 :
FIGURE 12:  An interleaved schedule of extracting the Z syndrome of the [[7,1,3]] code without (a) and with (b) noise terms at gate depolarizing strength p = 0.02 and idling noise strength p I = p/2 = 0.01.Data qubits are 0 to 6, ancilla qubits are 7 to 9 and flag qubits are 10 to 11. Brackets above and below the circuit group gates together that are executed during the same time step.X-type syndrome extraction is similar.

TABLE I :
Metrics of the lookup table.
table and the MIM technique on the Hilbert space H = C ⊗n of the physical qubits.A code of distance 9 is considered in this example.Using a lookup table with search radius 4 only, any erroneous states lying on the green (or red) circles, which are up to 4 faults (circles) away from the logical state |ψ L ⟩ (or |ψ ⊥ L ⟩), will be recovered to the logical state |ψ L ⟩ (or |ψ ⊥ L ⟩).Consider the erroneous state E|ψ L ⟩ which is not on any green or any red circle.In (a), E|ψ L ⟩ is 5 faults away from |ψ L ⟩ and 6 faults away from |ψ ⊥ L ⟩.Using the MIM table of radius 1, the recovery operator found by the decoder is R 1 .Since R 1 E is a stabilizer, R 1 brings the state back to the original state |ψ L ⟩.In (b), E|ψ L ⟩ is 6 faults away from both |ψ L ⟩ and |ψ ⊥ L ⟩.Using the MIM table of radius 2, the recovery operator found by the decoder is either R 1 such that R 1 E is a stabilizer, or R 2 such that R 2 E is a nontrivial logical operator.In this case, the state after recovery can be either |ψ L ⟩ or |ψ ⊥ L ⟩.