Deep Q-learning decoder for depolarizing noise on the toric code

We present an AI-based decoding agent for quantum error correction of depolarizing noise on the toric code. The agent is trained using deep reinforcement learning (DRL), where an artificial neural network encodes the state-action Q-values of error-correcting $X$, $Y$, and $Z$ Pauli operations, occurring with probabilities $p_x$, $p_y$, and $p_z$, respectively. By learning to take advantage of the correlations between bit-flip and phase-flip errors, the decoder outperforms the minimum-weight-perfect-matching (MWPM) algorithm, achieving higher success rate and higher error threshold for depolarizing noise ($p_z = p_x = p_y$), for code distances $d\leq 9$. The decoder trained on depolarizing noise also has close to optimal performance for uncorrelated noise and provides functional but sub-optimal decoding for biased noise ($p_z \neq p_x = p_y$). We argue that the DRL-type decoder provides a promising framework for future practical error correction of topological codes, striking a balance between on-the-fly calculations, in the form of forward evaluation of a deep Q-network, and pre-training and information storage. The complete code, as well as ready-to-use decoders (pre-trained networks), can be found in the repository https://github.com/mats-granath/toric-RL-decoder.


I. INTRODUCTION
The basic building block of a quantum computer is the quantum bit (qubit), the quantum entity that corresponds to the bit in a classical computer, but which can store a superposition of 0 and 1 [1]. The main challenge in building a quantum computer is that the qubit states are very fragile and susceptible to noise. Surface codes [2][3][4][5] are two-dimensional structures of qubits located on a regular grid which provide fault tolerance by entangling the qubits. In the surface code, logical qubits are topologically protected, which means that only strings of bit flips that stretch from one side to the other of the code cause logical bit flips, whereas topologically trivial loops (contractable to a point) do not. In recent years, experiments have taken first steps in quantum error correction in several promising quantum-computing architectures, e.g., superconducting circuits [6][7][8][9][10][11][12][13][14][15], trapped ions [16][17][18][19][20], and photonics [21,22], and work continues towards large-scale implementation of surface codes.
Even though the surface-code architecture provides extra protection to logical qubits, the physical qubits are still susceptible to noise causing bit-flip or phase-flip errors. Such errors need to be monitored and corrected before they proliferate and create non-trivial strings that cause logical failure. The challenge with correcting quantum-mechanical errors is that the errors themselves cannot be detected (because such measurements would destroy the quantum superposition of states), but only the syndrome, corresponding in the surface codes to local 4-qubit parity measurements, can. An algorithm that provides a set of recovery operations for correction of the error given a syndrome is called a decoder.
As the syndrome does not uniquely determine the errors, the decoder needs to incorporate the statistics of errors corresponding to any given syndrome. Optimal decoders, which give the highest theoretically possible error-correction success rate, are generally hard to find, except for the simplest hypothetical types of noise.
Many types of decoder algorithms exist that deal in different ways with the lack of uniqueness in the mapping from syndrome to error configuration. Methods range from Monte Carlo-based decoders [23,24], cellular automata [25,26], renormalization group [27], as well as various types of neural-network-based decoders [28][29][30][31][32][33][34][35][36][37][38][39][40], which is also the tool used in the present paper. The benchmark algorithm for the decoding problem is Minimum Weight Perfect Matching (MWPM) [41][42][43], which is a graph algorithm for pairwise matching of syndrome defects that is based on the assumption that the most likely error configuration is one that corresponds to the minimum number of errors. However, this does not take into account that different error channels may have different probabilities (biased noise), or that syndrome defects will in general be correlated.
For a decoder to be used for actual operation in a quantum computer, not only correction success rate, but also speed, is a crucial factor. A long delay for calculating error correcting operations will not only slow down the calculations, but also make the code susceptible to additional errors. For this reason, decoders based on algorithms that do extensive sampling of the configuration space on the fly, such as Monte Carlo-based decoders [23], may not be viable as practical decoders. Instead, using some level of pre-training to generate and store information for fast retrieval will likely be necessary. Tabulating the information of syndrome versus most likely logical error is expected to be prohibitively expensive in terms of both storage and training, and slow to access, for anything but very small codes. Given these constraints, the need for pre-training, the massive state space and corresponding amount of data, it is natural to consider machine-learning solutions, especially given the recent deep-learning revolution [44,45] and its applications within quantum physics [46][47][48].
In this paper, we use deep reinforcement learning [49,50], expanding on the framework for error correction in the toric code (i.e., surface code with periodic boundary conditions) introduced by Andreasson et al. [36]. Reinforcement learning and deep reinforcement learning (DRL) has recently emerged as a promising tool for various quantum control tasks [51,52]. In Ref. [36], only uncorrelated noise (with independent bit-and phase-flip errors) was considered and it was found that the DRL decoder could achieve success rates of error correction on par with MWPM. In the present work, we consider depolarizing noise (p x = p y = p z ) and find that a similar decoder can outperform MWPM for moderate code size d ≤ 9. The decoder trained on depolarizing noise is also found to be quite versatile, having MWPM success rates on uncorrelated noise, as well as giving intermediate performance on biased noise. Similarly to the previous work we do not consider syndrome measurement errors, but focus on mastering the more elementary but nevertheless challenging task of efficiently decoding a perfect syndrome with depolarizing noise.
A decoder based on DRL has the potential to offer an ideal balance between calculations on the fly and pretraining. The information about the proper error correction string for a given syndrome is stored in a very efficient way, using two principles: 1) The step-by-step decoding using the pre-trained neural network generates an effective tree structure where many different syndromes will reduce to the same syndrome after one operation, such that subsequent correction steps will use the same information, iteratively reducing the complexity.
2) The deep neural network is a 'generalizer' which can spot and draw conclusions from common features of different syndromes, including syndromes that have not been seen during training.
The paper is organized as follows. In Sec. II, we give a brief introduction to quantum error correction for the toric code. In Sec. III, we introduce deep reinforcement learning and Q-learning, and discuss how these are implemented in training and utilizing the decoder. In Sec. IV, the performance of the DRL decoder is presented and benchmarked against both MWPM and analytic expression valid for low error rates. We summarize the main results and give an outlook to further developments in Sec. V.

II. TORIC CODE
The toric code in the form considered here consists of a two-dimensional quadratic grid of physical qubits with Circles represent physical qubits, with shading showing periodic boundaries. Bit flip X (red), phase flip Z (blue), and Y ∼ XZ (yellow) errors with corresponding plaquette and vertex "defects" as end points of error chains. The defects are measured by the plaquette (⊗Z) and vertex (⊗X) paritycheck operators, respectively. Also shown are logical bit and phase flip operators corresponding to closed loops spanning the torus. periodic boundary conditions. In this section, we provide a high-level summary of the main concepts relevant for our study and refer the reader to the literature for more details [2][3][4][5]. A d×d grid contains 2d 2 qubits corresponding to a Hilbert space of 2 2d 2 states, out of which four will form the logical code space. That is, it encodes a 4-fold qudit corresponding to two qubits, which we will nevertheless refer to as the logical qubit. It is a stabilizer code where a large set of commuting local parity check operators (the stabilizers) split the state space into distinct sectors.
The stabilizers for the toric code are divided into two types, here represented as plaquette and vertex operators, consisting of products of Pauli Z or X operators on the four qubits on a plaquette or vertex (see Fig. 1), respectively. Eigenstates of the full set of stabilizers, with eigenvalue ±1 on each plaquette and vertex of the lattice, are globally entangled, which provides the basic robustness to errors. The logical qubit corresponds to the sector with eigenvalue +1 on all stabilizers. We will refer to a stabilizer with eigenvalue −1 as a plaquette or vertex defect. A single bit flip X or phase flip Z on a state in the qubit sector will produce a pair of defects on neighboring plaquettes or vertices, with Pauli Y ∼ XZ giving both pairs of defects, as shown in Fig. 1.
The set of stabilizer defects corresponding to any given configuration of X, Y , or Z operations on a state in the logical sector is called the syndrome. Logical operations, which map between the different states in the logical sector, are given by strings of X or Z operators that encircle the torus, corresponding to logical bit-flip and phase-flip operations, respectively (see Fig. 1). The shortest loop that can encircle the torus has length d; correspondingly, the code distance is d. For simplicity, we consider only odd d, as there is an odd-even effect in some quantitative aspects of the problem. The toric code is an example of a topological code, as the logical operations correspond to 'non-contractible' loops on the torus, whereas products of stabilizers can only generate 'contractible' loops. Figure 2(a) shows an example of an error configuration (also referred to as an error chain) on a d = 9 toric code together with the corresponding syndrome, generated randomly at an error rate p = 0.22. Visible for the decoder is only the syndrome [ Fig. 2(b)] based upon which the decoder should suggest a sequence of operations (a correction chain) that eliminates the syndrome in such a way that it is least likely to cause a logical bit-and/or phase-flip operation. To evaluate the success rate of a correction chain for a given syndrome, it should be complemented by the full distribution of error chains corresponding to that syndrome, to calculate which fraction of error+correction chains contain an odd number of logical operations of any type.

III. DEEP REINFORCEMENT LEARNING ALGORITHM
The DRL-based decoder presented in this paper is an agent utilizing reinforcement learning together with a deep convolutional neural network, called the Q-network, for approximation of Q-values. The agent suggests, step by step, a sequence of corrections that eliminates all defects in the system as illustrated in Fig. 3 (see also Figs. 17 and 18 in Appendix C).

A. Q-learning
The purpose of Q-learning [53] is for an agent to learn a policy, π(s, a), that prescribes what action a to take in state s. An optimal policy maximizes the future cumulative reward of actions within a Markov decision process with the rewards provided by the environment, depending on the initial and final states and the action r a (s, s ). In this paper, we use a deterministic reward scheme, as discussed below. To measure the future cumulative reward, the action value function, or Q-function, is given by where action a t is taken at time t, and subsequently following the policy π, with γ ≤ 1 a discounting factor. The The objective of the DRL decoder is to find a correction string which is consistent with the syndrome and which takes the minimal number of qubit operations. The benchmark MWPM decoder instead treats the plaquette and vertex configurations as separate graph problems, suggesting the shortest independent correction chains of X and Z. The full decoding sequence for this syndrome using the DRL decoder is shown at github.com/mats-granath/toric-RL-decoder. 3.

2.
FIG. 3. Value functions V (s) = maxa Q(s, a) for a sequence of syndromes corresponding to a particular error chain, using the reward scheme in Eq.
(3) with γ = 0.95. For this simple syndrome, the optimal sequence is three steps long and the theoretical state values are compared to those output by the Q-network. The error chain itself is irrelevant to the correction sequence; only the syndrome is important.
Q-function corresponding to the optimal policy satisfies the Bellman equation such that the optimal policy will self-consistently correspond to the action maximizing Q. As discussed in more detail in Sec. III B, we use one-step Q-learning, in which the current measure of Q(s, a) is updated by explicit use of the Bellman equation with some learning rate α, using -greedy exploration.
The reward scheme that we use is given by (3) where E t represents the number of defects in the syndrome at step t, such that X and Z operators can give reward −2, 0, or 2, whereas Y operators can give reward −4, −2, 0 , 2, or 4. The terminal reward, given a discounting factor γ < 1, incites the agent to correct the full syndrome in the minimal number of steps. The explicit reward for eliminating defects is implemented to speed up convergence, without which the agent would have to find terminal states by completely random exploration. The reward scheme is not expected to give an optimally performing decoder [36,40]; rather than using the statistics of error chains in an unbiased fashion, it makes the assumption that the most likely error chain is the shortest. As expected (see Sec. IV), for biased noise this gives sub-optimal performance. Figure 3 shows an example of Q-network estimated and exact state values V (s) = max a Q(s, a) for an example syndrome, showing that the Q-network gives a quantitatively accurate representation of Q-values. The numerical accuracy in general deteriorates the larger the syndrome is, i.e., the further it is removed from the terminal state.

Efficient Q-network representation
To improve the representational capacity of the Qnetwork, we use an efficient state-action space representation, which was suggested in Ref. [36] for bit-flip operations and which we now extend to general X, Y , and Z operations. It is built on three basic concepts: • By having the Q-network only output action values for one particular qubit, the representational complexity can be reduced significantly.
• Due to the periodic boundary conditions of the toric code, only the relative positions of syndrome defects are important, i.e., arbitrary translations and four-fold rotations are allowed.
• The converged decoder will never operate on a qubit which is not adjacent to any syndrome de- fect. Consequently, we have no need to calculate Q-values for such actions.
The Q-network takes input in the form of two channels of d × d matrices, corresponding to the location of vertex and plaquette defects, respectively. The output is the three Q-values for X, Y , and Z operations on one particular qubit, in a fixed location r 0 with respect to an external reference frame, as indicated in Fig. 4. To obtain the full set of action values for a syndrome, we thus successively translate and rotate the syndrome to locate each qubit at location r 0 . Each such matrix representation of the syndrome, with a particular qubit at r 0 , is called a "perspective", and the whole set of perspectives makes up an "observation", as exemplified in Fig. 5. In the observation, we only include perspectives for qubits that are adjacent to a syndrome defect.
To obtain the full relevant Q-function of a syndrome, the Q-function of each individual perspective of an observation is calculated. In decoding mode, the agent chooses greedily the action with the highest Q-value. After the chosen action has been performed, a new syndrome is produced and the process repeats until no defects remain. As discussed in the introduction, and exemplified in Fig. 6, the DRL decoding framework gives a compact structure for information storage and utilization: using a neural network to generalize information between syndromes and using step-by-step decoding to successively reduce syndromes to a smaller subset.
Schematic of the operation of the DRL decoder for several syndromes that successively reduce to a smaller subset of syndromes through step-by-step decoding. Top left are two syndromes that after one step of decoding reduces to the same syndrome, and similarly to the right. Both these branches in turn reduce to the same syndrome after the next decoding step. In this way, the complexity of the decoding problem is reduced, compared to decoding each high-level syndrome independently.

B. Training the Q-network
The neural network is trained using the deep Qlearning algorithm utilizing prioritized experience replay [50,54]. To increase stability, two architecturally equivalent neural networks are used, the regular Qnetwork, with parameters θ, and the target Q-network, with parameters θ T . The target network is synchronized with the Q-network on a set interval.
Experience replay saves every transition in a memory buffer, from which the agent randomly samples a minibatch of transitions used to update the Q-network. Instead of sampling the mini-batch uniformly, as is done with regular experience replay, prioritized experience replay prioritizes importance when sampling. This importance is measured with the absolute value of the temporal difference (TD) error, where the state/syndrome s j follows from action a j on state/syndrome s j , and where the expression Q(s, a; θ) implies choosing the appropriate perspective for the Qnetwork that corresponds to action a in syndrome s.
Following Ref. [54], the probability of sampling a transition j from the memory buffer is given by P j = |δ j | α / k |δ k | α such that values with higher TD-error are more likely to be sampled. Here, α controls the amount of prioritization used (α = 0 corresponding to uniform sam-pling) and k = 1, ..., M , with M the size of the memory buffer. Using non-uniform sampling in this way, however, skews the learning away from the probability distribution used to generate experiences. To partially compensate for this, importance-sampling weights are introduced according to w j = (M · P j ) −β , with the product of the weights and TD-error, w j · δ j , used as the loss during stochastic gradient descent training of the network. Here β controls the extent of compensation of the prioritized sampling, with β = 1 corresponding to full compensation.
The training can be divided into two stages: the action stage and the learning stage. Pseudo-code of the algorithm used for training is shown in Algorithm 1. The training starts with the action stage. Given a syndrome s t , the agent suggests an action a t following an -greedy policy, such that with probability (1 − ) the agent takes the action with the highest Q-value; otherwise a random action is followed. The agent receives a reward, r t , and the syndrome, s t = s t+1 , that follows from the action a t . The transition is stored as a tuple, T = (P t , a t , r t , s t+1 , Θ t+1 ), where Θ t+1 is a Boolean containing the information whether s t+1 is a terminal state (there are no defects left) or not.
Algorithm 1: Training the DRL agent decoder 1while defects remain do 2Get observation Ot corresponding to syndrome st.; 3With probability select random action at and corresponding perspective Pt.; 4Otherwise select: {Pt, at} = argmax P,a (Q(P, a; θ)P ∈O t .; 5Execute action at and observe reward rt and syndrome st+1.; 6Store transition (Pt, at, rt, st+1, Θt+1) in replay memory.; 7Sample random mini-batch of transitions, {Tj} N j=1 , from replay memory using prioritized sampling.; 8Calculate weights used for weighted importance sampling wj.; 9If terminal state reached, set yj = rj; otherwise, set yj = rj + γ maxa Q(s j , a; θT ).; 10Perform gradient descent step on wj · |yj − Q(Pj, aj; θ)| with respect to the network parameter θ.; 11Every C steps synchronize the target network with the policy network, θT = θ.; 12end After the action stage, the agent continues with the learning stage. For that we use stochastic gradient descent (SGD) and the tuples stored in the replay memory. A mini-batch of N transitions, {T j = (P j , a j , r j , s j , Θ j )} N j=1 , is sampled from the replay memory with replacement. The training target value for the policy Q-network is given by y j = r j if Θ j = 1, and y j = r j + γ max a Q(s j , a; θ t ) otherwise.
The agents are initially trained with an error rate of 10% and further during the training with syndromes up to 30% error rate. Details of network architectures and hyperparameters are found in Appendix B. Error correction success rate, Ps, for the DRL decoder on depolarizing noise, as a function of total error probability p, for system sizes d = 5, 7, 9 (blue circles, orange squares, and green triangles, respectively), and compared to the corresponding results using the MWPM algorithm (blue solid curve, orange dotted curve, and dashed green curve, respectively). The DRL-based algorithm outperforms the MWPMbased algorithm for all these system sizes and error rates.

A. Depolarizing noise
The main result of the paper is displayed in Fig. 7, where the error-correction success rate for depolarizing noise, p x = p y = p z = p/3, is shown for decoders trained at three different code dimensions. This is compared to MWPM, which treats the plaquette and vertex defects as separate graph problems. See comment [55] for a discussion about the MWPM decoder for depolarizing noise. We thus find that the DRL decoder has a significantly higher error-correction success rate, which is achievable by learning to account for the correlations between plaquette and vertex defects.
From the crossing of the d = 5 and d = 7 errorcorrection success rates, we can identify a threshold of around 16.5% (for MWPM, the crossing is close to 15%), below which error correction can be guaranteed, were we able to increase d arbitrarily. The deduced threshold is significantly below the theoretical limit of 18.9% [23,56], but, as discussed in the introduction, for a practical decoder this may not be the most important measure. We anticipate that the success rate and threshold can be enhanced by further developing the reward scheme to be based on success rate rather than minimum number of operations (work along these lines was recently presented by Colomer et al. [40]).
We also note that even though the d = 9 DRL decoder gives a significant improvement over MWPM, it has not fully converged to the optimal performance within the limitations of the algorithm, as indicated by the earlier crossing with d = 5 and d = 7. We do not anticipate that this is a fundamental limitation of the DRL type decoder, but could be improved by a more efficient training scheme.
In Fig. 8, we have employed the same DRL decoders, pre-trained on depolarizing noise, to decode pure bit-flip noise. Here, we find a performance for d = 5 and d = 7 which is very close to MWPM, thus reproducing the results of our first-generation DRL decoder from Ref. [36]. For d = 9, the decoder has slightly worse performance, confirming that this decoder has not yet converged to optimal algorithmic performance.

B. Asymptotic fail rates
In addition to the MWPM benchmark, we also benchmark the DRL decoders for small error rates p − → 0, by deriving analytical expressions (see Appendix A) for the fail rate for depolarizing noise to lowest non-vanishing order in p. We can derive such fail rates for both the MWPM algorithm and the algorithm based on finding the shortest correction strings. The latter is similar to, but not exactly equivalent with, what we expect for the DRL decoder based on our reward scheme. These algorithms both have a fail rate that scales as P L ∼ p d 2 , but with different prefactors.
In Fig. 9, we confirm that the DRL decoder indeed performs ideally for d = 5 and d = 7 for short error chains, following very closely the algorithm based on minimal X, Y, Z chains. Because of the excessive time consumption to generate good statistics for d = 9, we have only compared the performance in the true asymptotic limit, i.e., the rate for only the shortest fallible error chains,  as shown in Table I, again confirming the sub-optimal performance for d = 9. In this limit, data is generated by only considering the sub-group of error chains that are in a single row or column, in contrast to generating completely random error chains that will very rarely fail.

C. Biased noise
For the prospect of an operational decoder on a physical quantum computer, the noise is expected to be biased, such that phase-flip errors are relatively less or more likely [57][58][59][60][61][62]. To identify the exact error distribution is a challenging problem in itself (see, e.g., Ref. [63]), and the degree of bias can fluctuate in time [60][61][62], so a decoder that can adequately decode biased noise without retraining might be an alternative. To quantify the performance of the DRL decoder for biased noise, we consider the probability of an error of any type p, probability of phase-flip error p z = p rel p, and consequently p x = p y = (1 − p rel )p/2. Thus for p rel = 1 the syndromes contain only Z errors, which corresponds to uncorrelated noise, whereas p rel = 1/3 corresponds to depolarizing noise.
In Fig. 10, we show the success rate for the decoder on biased noise. We find that the highest success rate is attained for depolarizing noise, which also is what the decoder is trained for. We can understand this as a consequence of the superlinear decline (for low p) in success rate with the number of defects, such that the majority species dominates the outcome. At p rel = 1/3 there is an equal mean number of vertex and plaquette defects, while away from this limit, the number of either one or the other grows. That the operation of the trained DRL decoder is sub-optimal is clear from the limit p rel = 0, corresponding to only X and Y errors, which should, in principle, be a simpler decoding problem, similar to uncorrelated noise with independent error rates p/2 [64]. Nevertheless, the decoder gives fair performance for the full range of biased noise, which may be an advantage over having a decoder which is specialized to a particular, potentially unknown, bias.

V. CONCLUSION
We have shown how deep reinforcement learning can be used for quantum error correction of depolarizing noise (p x = p y = p z ) in the toric code, with significantly improved performance compared to the standard MWPM algorithm. The advantage is gained by learning to ac-count for the correlations between the vertex and plaquette defects. The super-MWPM performance for depolarizing noise was achieved for system sizes up to d = 9, corresponding to 162 qubits. However, by applying the trained decoder to decode pure bit-flip noise, ideal performance was only found for d < 9. For biased noise (p z = p x = p y ), the decoder gives fair, but sub-optimal, success rates.
Several improvements of the complete algorithm are being explored, or would be interesting to explore. This includes using distributed reinforcement learning [65] to enable the agent to explore the state space more efficiently and speed up the training. Moreover, it could be worth investigating the possibility of transferring the domain-specific knowledge (transfer learning) obtained from small grid instances to comparably larger grid sizes [66]. To combine the Q-learning with an element of active near-term exploration, such as that used by Al-phaGo Zero [67,68] would also be an interesting approach to investigate.
The reward scheme used in this work is based on the heuristic to minimize the length of correction chains. This is a fair assumption for depolarizing noise, where X, Y , and Z errors are equally likely. For biased noise, with greater or smaller probability of phase flip errors, training the decoder based on this assumption gives suboptimal performance. Instead, the reward needs to be more closely linked to the actual distribution of error chains and syndromes.
In addition to improving the prowess for the problem discussed in this paper, further developments of the DRL decoder should include addressing syndrome measurement errors and non-toric topological codes [35]. Even though the DRL-type decoder presented in this paper and in Refs. [36,40] is still limited in scope, we have shown that it can flexibly address various types of noise, and in some regimes give super-MWPM performance. In addition, the information gathered from exploration is stored and used in an efficient and generalizable way using a deep neural network and step-by-step error correction, limiting both the complexity of concurrent calculations and the need for massive information storage, which may be instrumental for future operational decoders.

ACKNOWLEDGMENTS
Computations were performed on the Vera cluster at Chalmers Centre for Computational Science and Engineering (C3SE). We acknowledge financial support from the Knut and Alice Wallenberg foundation. It is possible to derive a theoretical expression for the logical fail rate, that becomes exact in the limit of low error probabilities, by considering only the shortest possible error strings that may lead to an error given the decoding scheme. Here we derive such expressions for depolarizing noise p x = p y = p z = p 3 for an algorithm which is based on correction using the minimum number of correction steps, and for an algorithm which is based on using MWPM separately on the graphs given by plaquette and vertex errors. The former algorithm, which we refer to as "minimal correction chain" (MCC), is similar to, but not exactly equivalent to, our trained decoder since our reward scheme, in addition to penalizing steps, also gives reward for annihilating syndrome defects. The latter will give a slight priority to using Y operators (which can annihilate two pairs of defects) at an early stage of the decoding sequence. Nevertheless, we expect that this algorithm serves as a good benchmark for how well our DRL implementation of the algorithm works. In particular, we would like to see that our decoder outperforms the MWPM decoder also for low error rates.
The shortest error strings that can give an error with either of the algorithms are d 2 long, aligned along one row or column [36,69]. This means that the fail rate for both types of decoders will scale as P L ∼ (p/3) d 2 for small p, but with different prefactors. We will only consider odd d; the scaling is true for even d, but prefactors are different. Figure 11 gives a demonstrative example of an error string, for d = 7, where the outcome differs between the two algorithms. Here MWPM will fail, solving the vertex defects with one Z and the plaquette defects with two X to generate a logical bit-flip consisting of a vertical X loop. In contrast, the MCC algorithm will only fail 50% of the time (we assume draws are settled by a coin flip), either using the MWPM-prescribed sequence or using the actual error string (Y XXX) as the correction string. Interestingly, our specific decoder implementation should succeed 100% of the time for this particular error string, since it will prefer to use the Y , but it is not clear that this advantage is general.
To derive the general expressions for the asymptotic fail rates, we go through several examples of error chains. First, one has to keep in mind that we are interested in the minimum amount of steps to annihilate all excitations. The order in which the errors are placed in the chain does not matter (see Fig. 12). Also, the errors do not have to be connected; it is a sufficient criterion that they all are in one column or row. Now we can investigate the different combinations that can make the decoder fail. Length d 2 error chains containing either only X or Z errors will always generate a non-trivial loop (see Fig. 13). Moreover, combinations of X and Y errors can lead to a failure. Figures 11 and 14 show that we have to consider syndromes with exactly one Y error and the rest uniformly X or Z errors. For two or more Y errors, the decoder will always succeed with the error correction. Finally, we have to find out how X and Z errors in combination behave. Figures 15  and 16 show that for exactly one Z error and the rest being X errors, the decoder succeeds with a 50% chance. Here again, the reward scheme of the actual DRL decoder would disfavor using a Y if the Z is isolated, giving a slight discrepancy between this and the MCC algorithm.
We can convince ourselves that the cases presented here generalize to larger odd d, allowing for the derivation of an analytic expression for the logical fail rate. For the MCC algorithm, which we identify as close to the performance of our DRL decoder, the fail rate is given by  11. (a) The initial syndrome corresponding to one Y error and three X errors. (b) MWPM will always introduce a non-trivial loop and therefore fail. The "minimum correction chain" decoder has a 50% probability each for failure and success [correction chains (b) or (c), respectively].
FIG . 12. (a-c) For each of these syndromes, the shortest correction chains are of the same length (four steps in all cases). This is also true for other constellations of errors. The length of the error correction chain does not depend on the relative position of the syndrome defects in a row or a column.
X and Z for correction. This decoder (similarly to any reasonable decoder) will always fail for chains of length d 2 in a row or column containing all X or all Z. It will also fail if one or more of the X or Z in such a chain are replaced by Y . This is clear from, e.g., correcting a Y with a Z in a chain {Y XX . . .}, which will reduce the chain to a pure {XXX . . .} of the type that always fails.
where the ellipsis indicates chains with increasing numbers of Y . The general expression for N y ∈ {0, 1, · · · , d 2 } Y errors in a chain with d 2 − N y X (Z) errors reads where, compared to Eq. (A3), there is no 1 2 , as these chains always fail using MWPM, and where the chain consisting purely of Y is multiplied by a factor of 2 because it will fail on both types (X or Z) of rows and columns. Thus, the complete expression for the MWPM asymptotic fail rate reads (after summation over N y ) As expected, we find a higher fail rate for the de- (c) There is only one shortest correction chain with four steps. We can also conclude that with at least two or more Y errors in the chain, the MCC algorithm (and DRL decoder) always succeeds with the error correction. In contrast, MWPM will fail, using the middle chain (panel b).
coder that uses MWPM compared to the decoder using the minimum number of correction steps, with P L /P L M W P M = (1 + d 2 )/2 d 2 < 1 for d ≥ 3.
We also note that the asymptotic fail rate for pure bit-flip (or phase-flip) noise with error rate p is given by Eq. (A2) with p/3 → p, P L,X (p) = 2d · d Thus, under the assumption of uncorrelated X and Z errors with probability 2p/3 (corresponding to the rates for depolarizing noise) we find exactly that the total fail rate in Eq. (A7) is given by adding up two independent error channels: P L M W P M = 2P L,X (2p/3).
Another useful representation is to calculate the ratio of error chains with d 2 errors that lead to a failure compared to the total number of chains with d 2 errors: