General framework for constructing fast and near-optimal machine-learning-based decoder of the topological stabilizer codes

Quantum error correction is an essential technique for constructing a scalable quantum computer. In order to implement quantum error correction with near-term quantum devices, a fast and near-optimal decoding method is demanded. A decoder based on machine learning is considered as one of the most viable solutions for this purpose, since its prediction is fast once training has been done, and it is applicable to any quantum error correcting code and any noise model. So far, various formulations of the decoding problem as the task of machine learning have been proposed. Here, we discuss general constructions of machine-learning-based decoders. We found several conditions to achieve near-optimal performance, and proposed a criterion which should be optimized when a size of training data set is limited. We also discuss preferable constructions of neural networks, and proposed a decoder using spatial structures of topological codes using a convolutional neural network. We numerically show that our method can improve the performance of machine-learning-based decoders in various topological codes and noise models.


I. INTRODUCTION
In order to build a scalable quantum computer, quantum error correction (QEC) [1][2][3] is a vital technique for achieving reliable computation.According to the theory of QEC, if the noise strength is smaller than a certain threshold value, we can protect logical qubits encoded in physical qubits from the noise.Supported by extensive experimental efforts, the noise level of the quantum operations on arrays of qubits is now approaching and meets the threshold value.Therefore, a demonstration of QEC in a fully fault-tolerant settings is considered to be a milestone for the near-term quantum devices [4][5][6].Topological codes [7][8][9] are a family of quantum error correcting codes inspired by topological nature in the condensed matter physics [7].Since the topological codes such as surface codes [8,10,11] have both high experimental feasibility and high performance [12][13][14][15], they are considered as the most promising candidate of quantum error correcting codes.
In QEC, information on occurrence of physical errors is measured as a syndrome value.A suitable recovery operation is estimated from the syndrome so that the original state of the logical qubits is decoded with high success probability.Unfortunately, constructing an optimal decoder is computationally hard in general.Thus, massive efforts have been paid for developing efficient and nearoptimal decoders.One approach is to use the most likely physical errors that are consistent with the observed syndrome value as a recovery operation.This scheme is called the minimum-distance (MD) decoder.Though this decoding method is not necessarily optimal, it shows almost optimal performance [13][14][15].In the case of the surface codes, if we can assume that bit-flip (Pauli X) and phase-flip (Pauli Z) errors are uncorrelated, we can construct an efficient MD decoder using minimum-weight perfect matching.However, if bit-flip and phase-flip errors are correlated or if we use other codes, even MD decoding is not efficiently implementable [16].Some of these problems can be avoided by the use of geometrically local features of the topological codes.For example, as for color codes [17], we can perform decoding by projecting color code to a surface code [18].Another approach is to use renormalization group method [19], which is applicable to any topological codes including the surface and color codes.While these approaches have been improved, there is unavoidable trade-off between the performance and time efficiency of the decoder.For the first experimental realization of QEC on near-term devices, more efficient and near-optimal decoders are demanded.
In this article, we discuss a general construction of machine-learning-based decoders.Recently, the technology of machine learning has been applied to various theoretical and experimental researches of quantum physics, such as classification of readout signals in experiments [20], simulation of a quantum system [21], classification of the phase of matter [22], data compression of the quantum state [23], and decoding in QEC [24][25][26][27][28].In the machine-learning-based decoder, we construct a prediction model which outputs a recovery operator from a given syndrome value.The prediction model is trained with many correct pairs of syndrome values and correct recovery operations before prediction.While the training arXiv:1801.04377v1[quant-ph] 13 Jan 2018 task may take a long time, it is required only once before many runs of prediction, and each prediction is expected to be performed fast.Thus, the machine-learning-based decoder is one of the best solutions for demonstrating experimental QEC in near-term quantum devices.
As a prediction model, artificial neural network is believed to have large representation power, and is suitable for constructing machine-learning-based decoder.Recently, the performances of machine-learning-based decoders with various neural networks have been numerically studied, such as restricted Boltzmann machine [24], multi-layer perceptron [25], recurrent neural network [26], and deep neural network [27].The machinelearning-based decoder using a neural network is called neural decoders [24].All these existing methods numerically showed that the performance of the neural decoder is superior to the known efficient decoders when sufficiently large amount of the training data set is supplied.However, the following three points have yet to be understood.The first one is how the decoding problem should be translated to the task of machine learning in order to obtain faster learning and better prediction.So far, each of the previous studies introduces its own construction of the data set and neural network with little consideration on this point.Second, the spatial feature of the topological codes has not been considered in the construction of the neural decoder, except a very recent study [28] that was carried out independently of this work.While it is expected that the performance of the neural decoder is improved by explicitly considering the spatial arrangement of the syndrome, the spatial information has not been given to the neural network explicitly.Finally, the applicability of the neural decoder to various topological codes is not known.The neural decoder is benchmarked only with surface codes [24][25][26][27][28]. Therefore, it has not been known whether the neural decoder is applicable to other codes, such as color codes.
We have addressed all of these points in this paper.First, we discuss how the decoding problem should be formulated as the task of machine learning.We propose a general framework for constructing a neural decoder, linear prediction framework, to elucidate the factors that determine the performance of the decoders.We propose a criterion called normalized sensitivity which should be optimized for constructing a near-optimal neural decoder.Then, we propose specific construction of a training data set which minimizes the normalized sensitivity.We call these constructions as uniform data construction.We also propose the use of construction of neural networks, which explicitly utilize spatial structure of the topological codes.We show that the performance of the neural decoder is improved with these techniques, and it shows better performance than that of a decoder using minimum-weight perfect matching with 10 6 data set at distance d = 11 in the surface code under a depolarizing noise.We show that the neural decoder is also applicable to the color codes.The performance of the neural decoder for the color codes also reaches that of the MD decoder in small distances.

Organization of the article
In Sec II, we overview preliminary topics.We review a scheme of QEC in the case of stabilizer codes.We explain specific constructions of the topological codes, the surface and color codes.We also review the basics of the supervised machine learning with neural networks in this section.In Sec III, we address the question of how the neural decoder should be constructed.We propose a general framework, linear prediction framework, in this section.We introduce a quantity called the normalized sensitivity, and argue that it serves as a criterion for better performance of decoders for topological stabilizer codes.We also propose uniform data construction, which consists of specific instructions to optimize the normalized sensitivity for surface codes and color codes.We numerically confirm that the performance of the neural decoder is improved with this construction in the case of the surface and color codes.In Sec IV, we propose a network construction which explicitly utilize the spatial information of the topological codes.We confirm that this construction also improves the performance of the neural decoder.Finally, we summarize this paper in Sec V.

II. PRELIMINARY
In this section, we review the basic concepts and introduce notations used in this paper.We first review a scheme of QEC.We also introduce well-known topological codes and decoders.The scheme of supervised machine learning with neural network and its terminologies are also explained in this section.

A. Quantum error correction
We consider the case where k logical qubits are encoded in n physical qubits.We assume that any noise can be represented as a probabilistic Pauli operation on the n physical qubits.We denote Pauli operators on a single qubit as {I, X, Y, Z}, and the Pauli operator A on the ith physical qubit as A i .When we consider operations on the n physical qubits, we ignore the global phase of the state and operator.Then, we can represent any physical error as E ∈ {I, X, Y, Z} ⊗n .A weight w(E) is defined for a Pauli operator E on the n physical qubits as the number of the physical qubits to which the Pauli operator E is non-trivially applied.
In the framework of stabilizer codes [29], the code is defined by 2 n−k stabilizer operators L I generated by n − k Pauli operators ∈ {S i } , and they commute with each other.The logical space of the code is defined as the subspace which has eigenvalue +1 for all the stabilizer operators, i.e., S i |ψ = |ψ for all i.We denote the normalizer of the stabilizer operators as L. We call elements in L\L I as logical operators.Each stabilizer operator acts on the logical space trivially, and each logical operator acts on the logical space non-trivially.A distance d of the code is defined as d := min L∈L\L I w(L).The code which encodes k logical qubits in n physical qubit with distance The occurrence of a physical error is detected as the outcome of stabilizer measurement s, where s T ∈ {0, 1} n−k and the i-th element s i is the measurement outcome of the i-th stabilizer operator S i .We call s the syndrome vector.To recover the original state of the logical qubits, we estimate a recovery Pauli operator T (s) ∈ {I, X, Y, Z} ⊗n from the observed syndrome vector s so that the total operation including the physical error acts on the logical space trivially with high probability.The mapping from the syndrome s to the recovery operator T (s) is called decoder T .The logical error probability p L is defined as the probability with which the total operation becomes logically non-trivial.Our purpose is to construct efficient decoder T which minimizes the logical error probability p L .

B. Binary representation of stabilizer code
It is convenient to translate the calculation in the stabilizer codes into a binary calculation in GF (2).In GF(2), addition ⊕ is performed with modulo 2. We relate the Pauli operators on the i-th physical qubit to another representation Then, a Pauli operator P on the n physical qubits can be described as where α ∈ {±1, ±i} and v i ∈ {0, 1} (1 ≤ i ≤ 2n).We define a binary mapping where We use h(v) for the hamming weight of v as a binary string, namely, the number of indices i (1 ≤ i ≤ 2n) such that v i = 1.We denote the i-th row vector of the matrix M as (M ) i .The length of the vector v is represented as |v|.With this definition, the normalizer of the stabilizer operators L is defined as since the normalizer of the stabilizer operators is equivalent to the centralizer of that in the current formalism.
Note that the stabilizer group can be defined with the normalizer L as With this formalism, QEC is translated as follows.The physical error E can be represented as a row binary vector e := b(E) ∈ {0, 1} 2n which occurs with a certain probability p e .The syndrome vector s is given by a column vector s(e) := H c Λe T , where H c is an (n − k) × 2n matrix of which the i-th row vector (H c ) i is b(S i ).The matrix H c is called check matrix.In binary representation, we denote a decoder as r which maps a given syndrome vector s T ∈ {0, 1} n−k to a binary representation of a recovery operator r(s) ∈ {0, 1} 2n .It is convenient to define pure error t(s) [30] to represent various vectors succinctly.The pure error is a function which maps a syndrome vector s T ∈ {0, 1} n−k to a vector t(s) ∈ {0, 1} 2n , and satisfies t(s(e)) ⊕ e ∈ b(L) for an arbitrary e ∈ {0, 1} 2n .We also introduce a 2k × 2n generator matrix G such that the elements of L is uniquely represented as follows: Note that the generator matrix G satisfies H c ΛG T = 0. We define the cosets L w with w ∈ {0, 1} 2k as Note that L 0 = b(L I ).Given t(s) and G, an arbitrary physical error e ∈ {0, 1} 2n is uniquely decomposed as with l(e) ∈ L 0 and w(e) ∈ {0, 1} 2k .We say w(e) as the class of e.
A logical decoder with a recovery operation r(s) can correct an error e if and only if e ⊕ r(s(e)) ∈ L 0 .Under an error model {p e }, the logical error probability is given by C. Optimal and near-optimal decoders An optimal decoder is defined as the decoder which minimizes the logical error probability.Let us write the conditional probability of w(e) ∈ {0, 1} 2n for a given syndrome vector s as Since the decoder is only provided with s and distinct recovery operators are needed for correcting errors with different values of w(e), the maximum probability of successful correction given s is max w∈{0,1} 2k q s(w) .We thus say a decoder is optimal if r(s) satisfies for any s with Though the definition of w(e) is dependent on the choice of t(s) and G, the optimality of a decoder r(s) is independent of the choice.
Another important definition of a near-optimal decoder is the minimum-distance (MD) decoder.An MD decoder chooses the most probable physical error e * (s) which satisfies p e * (s) ≥ p e ∀e ∈ {e|s(e) = e} (14) as a recovery operation.Though the maximally likelihood physical error e * (s) does not necessarily satisfy the condition Eq. ( 12), it is empirically known that the MD decoder achieves near-optimal performance.It is known that the MD decoder can be constructed efficiently in limited cases of the code and the error model.For example, we can construct an efficient MD decoder for the surface code under independent bit-flip and phaseflip errors.In this case, we can reduce the decoding problem into minimum-weight perfect matching (MWPM), which can be efficiently solved with blossom algorithm [31].When bit-flip and phase-flip errors are correlated, we can still construct a decoder with MWPM by ignoring the correlation, resulting in an sub-optimal decoder.We call such a decoder as a MWPM decoder.

D. Topological code
We consider two types of the topological codes in this article: surface codes and color codes.The qubit allocation of the surface code is shown in Fig. 1 The color codes consist of the lattice which has 3colored faces: red, green, and blue.Two types of codes, the [4,8,8]-color code and the [6,6,6]-color code, are shown in Fig. 2(a) and (b), respectively.The physical qubits are also located on each vertex of the faces.Each colored face represents a stabilizer operator, including nontrivial Pauli operators for its vertices.The [4,8,8]

E. Supervised machine learning
Supervised machine learning is a branch of artificial intelligence that requires a training data set {(x 1 , y 1 ), . . ., (x N , y N )} which consists of feature data x i and its corresponding label data y i .Its aim is to prepare a model that takes the feature data as input and outputs an inferred label for it.The model has a predetermined structure and trainable parameters θ.
Unlike a simple dictionary, the model is expected to infer a label even for an unseen feature data.This is achieved by optimizing the model parameters θ for the training data set.This process is commonly called training.Specifically, during its training, the difference between the output of the model y to a feature and the correct label y is evaluated with real-valued loss function L(y, y ).The loss is minimized if and only if the prediction is exactly the same as the correct label.The training data is used to optimize the model parameters θ to reduce the loss.This can be done with standard optimization methods such as stochastic gradient descent: where γ ∈ R is a learning rate and L is calculated for a randomly chosen subset, called a batch, of the training data set.As we can see here, it is required that the loss function should be differentiable, such as L2 distance ||y − y || 2 2 .Once trained, we can apply the model to an unseen feature data, and obtain its predicted label with simple calculations of the network parameters and the input feature data.
Artificial Neural Network (ANN) is a machine learning model inspired by neural structure found in nature.Here, we assume that neurons are real-valued functions and a layer h is a vector of the neurons.Multilayer perceptron (MLP) is one of the simplest ANN which, as its name suggests, consists of multiple layers of neurons including the input and output layers.Here each neuron in a layer is connected to all neurons in the neighboring layers with trainable weights and biases, and yet completely independent of the other neurons in its own layer.Mathematically, this can be described as  FIG.2: The qubit allocation of the [4,8,8]-color code and the [6,6,6]-color code.Each vertex corresponds to a physical qubit, and each face corresponds to a stabilizer operator.
where A is a nonlinear activation function, h is the bias added to the i-th neuron in the n-th layer, and W (n,n−1) ij is the weight connecting the i-th neuron in the n-th layer to the jth neuron in the (n − 1)-th layer.We illustrate this in Fig. 3. Here, the model parameters are the weights and biases.In this model, the input information propagates in forward direction, i.e., from the input nodes to the output nodes.At the output nodes, the loss value is calculated from the model output and the correct label.In order to update the model parameters, the gradient of the loss function ∇ θ L is evaluated with the back-propagation method.According to the universal approximation theorem [32], any continuous function can be approximated by an MLP model of a finite size, though its structure is simple and compact.Thus, we expect a neural decoder with a MLP model can achieve near-optimal performance under an appropriate training process.

III. CONSTRUCTION OF TASKS OF MACHINE-LEARNING-BASED DECODERS
In general, achievable accuracy in machine learning with a given size of training data depends on the formulation of the prediction task.In order to construct a near-optimal neural decoder, it is vital to consider what is a preferable formulation of the prediction task.However, this point has not been discussed in a unified view in the existing methods [24][25][26][27].In this section, we discuss how the decoding problem should be formulated as a task of machine learning in order to achieve near-optimal performance.To this end, we propose a general framework, which we call linear prediction framework.In this frame-   $%  (()*)  (()   FIG.3: Feed forward network.The j-th neuron in the (n − 1)-th layer is connected to the i-th neuron in the n via weight W ij .
work, we can analytically study the behavior of the neural decoder, and can discuss requirements for achieving near-optimal performance.Based on the discussion, we propose a criterion, normalized sensitivity, which should be optimized in defining the label for constructing a good decoder.We show specific constructions which minimize normalized sensitivity for the surface codes and the color codes, which we call uniform data construction.Then, we numerically confirm that the performance of the neural decoder is improved with the construction.We also confirm that this construction is also applicable to the color codes.

A. Linear prediction framework
In order to discuss the behavior of the neural decoder in a unified view, we consider a neural decoder with the following two specifications.First, the neural decoder uses the syndrome vector s as the feature data to be fed to the trainable model.Second, the label data is a binary vector, and the correct label is linearly generated from the physical error vector e in GF (2).We call a linearly generated label vector g as a diagnosis, and a matrix H g which generates the diagnosis g := H g Λe T as a diagnosis matrix.The recovery operator r is calculated from the predicted diagnosis g and the syndrome s.We use an assumed physical error distribution {p e } only for generating a training data set {(s i , g i )}, and do not use it for constructing H g or in the calculation of the recovery operator r from g and s.Though this framework restricts the label to be linearly generated from the physical error, this is general enough to formulate all the constructions described in the existing methods as special cases [24][25][26][27][28] with small technical exceptions.
Since the actual performance of the neural decoder depends on many factors such as configurations of the training process, the size of the training data set, and details of the network construction, we start with considering the problem under an ideal limit.We first consider the problem under the simple 0-1 loss function with an unlimited size of the training data set.Then, we relax these impractical assumptions to practical ones.Though we numerically investigate the case of a single logical qubit (k = 1) later, we present the formalism for a general value of k.

The neural decoder with the 0-1 loss function and an unlimited training data set
We first consider a hypothetical decoder that can minimize any loss function with an unlimited number of the training data set.Though such an assumption is not practical, it is convenient to reveal the conditions for performing optimal decoding with machine learning in the ideal limit.We choose the 0-1 delta function δ(g, g ) as the loss function, which is zero if the predicted and the correct diagnosis are the same, and unity otherwise.Let us consider the portion of training data set with a specific value of s with Pr e∼{pe} [s(e) = s] > 0. If the neural decoder returns diagnosis g for the input s, the total loss for this portion is proportional to the following value, Let g (δ) (s) be the output of the ideally trained neural decoder.Since it should minimize the total loss for every s, it satisfies We call this ideal decoder a delta diagnosis decoder and g (δ) (s) a delta diagnosis vector.
We show the condition for a diagnosis matrix H g to guarantee that we can perform the optimal decoding with the delta diagnosis decoder.To this end, we define a property of the diagnosis matrix and introduce a set of diagnosis vectors as follows.
Definition III.1.faithful diagnosis matrix -Given a check matrix H c , we say diagnosis matrix or equivalently, where Definition III.2.faithful diagnosis vectors -Given a check matrix H c , a pure error t(s), and a faithful diagnosis matrix H g , we define 2 2k faithful diagnosis vectors {g s (w)} (w ∈ {0, 1} 2k ) associated with a syndrome vector s by Note that the faithful condition of H g implies that is injective and with s = H c Λe T .As a result, when H g is faithful, we have from Eqs. ( 17) and (24).Then the injective property of g s (w) leads to where q s (w) is defined in Eq. ( 11).
When the diagnosis matrix is faithful, we can construct an optimal decoder as follows.From Eqs. ( 18) and ( 26), we see that the delta diagnosis vector g (δ) (s) is one of the faithful diagnosis vectors.We can thus write it in the form Eqs. ( 18), (26), and ( 27) imply that Since g s (w) is injective, one can calculate w * (s) from the diagnosis g (δ) (s) and syndrome s.The recovery operator is then chosen as For the optimality, we have for any s with Pr e∼{pe} [s(e) = s] > 0, which satisfies Eq. ( 12).We can also prove a converse statement for the cases where H g is not faithful (see Appendix A), arriving at the following lemma.
Lemma III.1.If the diagnosis matrix H g is faithful, there exists a map r * (g, s) such that the decoder with r(s) = r * (g (δ) (s), s) is optimal for arbitrary distribution {p e }.If the diagnosis matrix H g is not faithful, no such map exists.
This lemma implies that we can perform optimal decoding with the delta diagnosis decoder only when the diagnosis matrix H g is faithful.Note that the set of the faithful vectors {g s (w)|w ∈ {0, 1} 2k } is independent of the choice of the generator G and the pure error t(s).Whether we can perform the optimal decoding or not is dependent only on the construction of H g .

The neural decoder with the L2 loss function and an unlimited training data set
In this subsection, we replace the 0-1 loss function with a more practical one, which is the squared L2 distance.We still consider the limit of an infinite size of the training data set and the perfect loss minimization.In this case, the total loss for a fixed s under an unlimited training data set is proportional to the following value.
We define a decoder which is ideally trained with the L2 loss function as an L2 diagnosis decoder.We also call the output of the L2 diagnosis decoder as an L2 diagnosis vector g (L2) (s).The L2 diagnosis vector satisfies the following equation.
When the chosen diagnosis matrix is faithful, we can analytically solve g (L2) (s) by differentiating Eq. ( 31), and the L2 diagnosis vector can be written as follows.
Let us define a column vector of order 2 2k as It satisfies the following matrix equation: where We can solve it for q s if D s has a left inverse D −1 s such that D −1 s D s = I in the real-valued calculation, namely, if the rank of D s as a real-valued matrix is 2 2k .If the rank is smaller, solution q s is not unique, and hence it is not always possible to determine w that maximizes q s (w), which implies we cannot perform the optimal decoding.Though the rank condition depends apparently on the syndrome s, we can formulate it as a condition which is independent of s.Any faithful diagnosis g s (w) can be written as We define a transformation σ δ by for δ ∈ {0, 1} 2k and v ∈ R 2k .It is affine, isometric, and involutory.Since g s (w) = σ δ(s) (H g Λ(wG) T ), we have We see that a transformation σ δ is an affine transformation, and this transformation satisfies Thus, when we apply the transformation σ δ(s) to Eq. ( 35), we obtain where Thus, we can uniquely calculate q s for an arbitrary s if a matrix D has a left inverse, which is equivalent to the condition that {H g Λ(wG) T |w ∈ {0, 1} 2k } is affinely independent.We will call a diagnosis matrix satisfying this condition to be decomposable: Definition III.3.decomposable diagnosis matrix -Given a generator matrix G, we say a diagnosis matrix H g is decomposable if a set of real vectors {H g Λ(wG) T |w ∈ {0, 1} 2k } is affinely independent, namely, the rank of a matrix D defined in Eq. ( 44) is 2 2k when we consider D as a real-valued matrix.
When H g is faithful, the above definition is independent of G, because the set We show a scheme to perform the optimal decoding using L2 diagnosis decoder when a diagnosis matrix is faithful and decomposable.When H g is decomposable, there exists a left inverse D −1 such that D −1 D = I in real vector space.When we observe a syndrome vector s, we obtain the L2 diagnosis g (L2) (s) using the trained L2 diagnosis decoder, and calculate δ(s) = H g Λt(s).Since the diagnosis matrix is faithful, the probabilities of the faithful diagnosis vectors are given by Then, we construct a recovery operator as where w * (s) satisfies With this recovery operator, we obtain and thus this decoder satisfies Eq. ( 12).When the diagnosis matrix H g is faithful, we can also prove a converse statement for the cases where a faithful diagnosis matrix H g is not decomposable (see Appendix A), arriving at the following lemma.
Lemma III.2.If the diagnosis matrix H g is faithful and decomposable, there exists a map r * (g, s) such that the decoder with r(s) = r * (g (L2) (s), s) is optimal for arbitrary distribution {p e }.If the diagnosis matrix H g is faithful but not decomposable, no such map exists.We show a simple example of a faithful and decomposable matrix H g in the case of k = 1.We choose vectors l 01 , l 10 , and l 11 from L 01 , L 10 , and L 11 , respectively.We construct H g and generator G as We see that span({(H g ) i }) = b(L), and thus H g is faithful.A set {H g Λ(wG) T |w ∈ {00, 01, 10, 11}} is which is affinely independent, and thus H g is decomposable.We can verify the same by checking the rank of D = g(00) g(01) g( 10) g( 11) to be 4 in real vector space.

The neural decoder with the L2 loss function under a finite training data size
In practical cases, the size of the training data set is limited, and hence the loss is not perfectly minimized.This implies that the output diagnosis from the model deviates from the L2 diagnosis vector.In such a case, it is desirable to construct a decoder such that its prediction is as robust against the deviations as possible.We introduce a slight modification to the optimal decoding scheme in the last subsection, so that it should applicable to an output diagnosis deviated from the L2 diagnosis vector.
We denote the predicted diagnosis as g P (s) ∈ R |g| , which deviates from the L2 diagnosis vector.Note that g P (s) cannot be represented as a linear combination of the faithful diagnosis vectors in general.In order to construct a decoding scheme which is robust to a small deviation, it is natural to extend the scheme employed in Sec.III A 2 such that we project g P (s) to the hyperplane formed by affine combinations of the faithful diagnosis vectors, and then extract the coefficients q P s from the projected point.This projection and extraction is achieved as follows.We perform QR decomposition for D, and obtain D = QR, where Q is an orthogonal matrix, and R is an upper-triangular matrix.We construct D −1 = R −1 Q T , which satisfies D −1 D = I.Then, we obtain a predicted vector q P s as where δ(s) = H g Λt(s).We construct a recovery operator as where w * (s) satisfies Note that though elements of q P s may be out of [0, 1], the above procedure is still well-defined.

Criterion for diagnosis matrix
In practice, the number of the training data set is far smaller than the total variation of syndrome vectors s when distance d is larger than about 7. For example, according to the existing methods [24][25][26][27], the size of the training data set is at most 10 9 .On the other hand, the number of variations in the syndrome, 2 n−k , becomes larger than 10 9 at the distance d = 7 for the [[d 2 , 1, d]] surface code.This implies that almost all the patterns of the syndrome vector s given in experiments are not found in the training data set.The model should infer the L2 diagnosis vector g (L2) (s) of s where s is not included in the training data set.The aim of this subsection is to propose a criterion for H g which we believe to reflect the robustness of the prediction when we use such a sparsely sampled training data set.
Since the problem is to estimate the vector-valued function g (L2) (s) from a sparsely sampled set of values, its difficulty should depend on how rapidly the function changes its output value as the input value s varies.From Eqs. (24) and (33), we see that the function is written as which shows that g (L2) (s) is implicitly determined from the two functions of errors, g(e) = H g Λe T and s(e) = H c Λe T .In order to quantify how rapidly these function change, let us introduce a sensitivity m(H) of a binary matrix H as Using the sensitivity, the variation of s(e) is bounded as In the case of topological codes, m(H c ) is a small constant.This is because each physical qubit is monitored by at most constant number of the stabilizer operators.Suppose that g (L2) (s) is close to one of the faithful diagnosis g s (w * ), and let S(s, w * ; 0) be the set of errors e satisfying w(e) = w * and s(e) = s.We further define a set S(s, w * ; h) := {e|∃e s.t. e ∈ S(s, w * ; 0), h(e ⊕ e ) ≤ h} (59) We see that any e ∈ S(s, w * ; h) produces a training data (s , g ) such that The choice of H g also affects how precisely g (L2) (s) should be estimated in order to determine w * correctly.
To quantify this, we consider how far g P (s) can be deviated from a faithful diagnosis g s (w) without affecting the decoding method of Eqs. ( 53) and (54).When the decoding result changes from w * = w to w * = w , the solution of Eq. ( 53) should satisfy q P s (w) = q P s (w ), namely, g P (s) should be written in the form We define the minimum boundary distance M (H g ) so as to assure that w * = w as long as ||g P (s) − g s (w)|| 2 2 ≤ M (H g ).Hence M (H g ) can be explicitly defined as Note that the above definition is independent of s, since the affine transformation σ δ(s) is isometric.M (H g ) is nonzero if and only if H g is decomposable.
Regarding M (H g ) as the relevant length scale, we define the following quantity to be used as a criterion for a better construction of H g .Definition III.4.Normalized sensitivity -We define normalized sensitivity N (H g ) of a faithful and decomposable matrix H g as where m(H g ) is a sensitivity of H g defined in Eq. ( 57), and M (H g ) is a minimum boundary distance of H g defined in Eq. (63).
Eqs. ( 61) and (63) implies that an error belonging to S(s, w * ; h) with h ∼ (m(H g )/M (H g )) −1 leads to a training data useful for estimation of g (L2) (s).We thus expect that the use of a diagnosis matrix H g with a small normalized sensitivity N (H g ) enables high-performance prediction with a small training data set.

Uniform data construction
We propose specific constructions which minimize the normalized sensitivity up to the order of d in the case of k = 1.We first consider a lower-bound of the normalized sensitivity.When a diagnosis matrix H g is faithful, each row vector of H g corresponds to a logical operator or a stabilizer operator.We denote the number of the logical operators in the rows of H g as n L .The minimum boundary distance M (H g ) is upper-bounded by Since any logical operator has at least d of one-elements in its binary representation, there are at least dn L of one-elements in the diagnosis matrix.By denoting the number of the one-elements in the diagnosis matrix H g as χ(H g ), we have Since there are 2n columns in H g , we also have The sensitivity m(H g ) is equal to the maximum hamming weight of the column vectors of the diagnosis matrix, namely, From Eqs. ( 65) -(68), we obtain In particular, when we focus on the two-dimensional topological codes such that n = Θ(d 2 ) , the order of the normalized sensitivity is lower-bounded as For surface codes and color codes with the single logical qubit, we found specific constructions of H g such that N (H g ) scales as Θ(d −1 ).See appendix C for the specific constructions.We named these constructions as uniform data construction of the data set, since logical operators corresponding to the rows of H g are chosen uniformly to cover all the physical qubits.

B. Construction of data set and example
Let us summarize the discussion in Sec III A. Given the check matrix H c of the code and the error model {p e }, the whole protocol can be described as follows.
• Preparation: We construct a faithful and decomposable diagnosis matrix H g with a small normalized sensitivity, possibly N (H g ) = Θ(d/n).We choose a pure error t(s) and a generator matrix G.
We perform QR decomposition to a matrix where g(w) = H g Λ(wG) T , and obtain Q and R.
We calculate the left inverse matrix D −1 as • Data generation: We generate a set of physical errors {e 0 , e 1 , . ..} with the probability distribution {p e }, and generate data set {(s 0 , g 0 ), (s 1 , g 1 ), . ..} from it, where s i := H c Λe T i and g i := H g Λe T i .
• Training: The model is trained so that it can predict g from s.The loss of the prediction is defined as the L2 distance between g and g P , where g P is a real-valued output vector of the model.
• Prediction: When an observed syndrome s is given to the trained model, it predicts g P (s).In parallel, we calculate δ(s) given by We calculate vector q s defined in Eq. ( 34) as where σ δ(s) is an affine transformation such that We choose w P that satisfies where {q P s (w)} is the elements of q P s .Then, we obtain an estimated recovery operator r(s) = w P G ⊕ t(s). (77) We emphasize that the choice of t(s) and G dose not affect the performance of the decoder, since the success of the estimation is independent of them.Only the construction of H g affects the performance of the decoder.We show a specific example of the decoding scheme.For simplicity, we consider the case where there is only bit-flip errors in the [[2d 2 − 2d + 1, 1, d]] surface code.In this case, it is enough for QEC to consider the stabilizer operator with Pauli Z operators.The simplified picture of the code is shown in Fig. 4.
In this picture, a bit-flip error on a physical qubit is represented by the color of the corresponding edge (green: no error, red: error), and the syndrome value is represented by the color of the circle (green: undetected, red: detected).As shown in Fig. 4(a), The matrix H g is constructed with logical operators each of which is the product of the Pauli Z operators on the edges crossing the dotted line.In this case, we see ). Suppose that bit-flip errors occur on a set of the physical qubits as shown in Fig. 4(b).The physical error is detected with the syndrome values as shown in the same figure.The diagnosis vector is calculated as the commutation relation of the chosen logical operators and the physical error.We show the calculated diagnosis on the right side of the lattice.In the training phase, the model learns the relation between the positions of the red circles and the values of the diagnosis vector.In the prediction phase, only the positions of the red circles are given.The trained neural network outputs a real-valued prediction of the diagnosis vector as shown in Fig. 4(c), for example.From this information, we extract the probabilities of the faithful diagnosis, and we choose the faithful diagnosis which is expected to be the most probable, as shown in Fig. 4(d).Since the chosen diagnosis vector is equivalent to the diagnosis vector generated by the actual physical error, this decoding trial is a success.

C. Relation to the existing methods
In this subsection, we explain how the existing methods [24][25][26][27] can be treated in the linear prediction framework.The method proposed by Varsamopoulos et al. [25] used an approach similar to the example shown in Sec III A 2 in the case of k = 1.In this method, a linear map is used for the pure error, which is called a simple decoder.The pure error is then written in the form t(s) T = T s, where T is a 2n × (n − k) matrix satisfying H c ΛT = I.The label vector used in this method can essentially be regarded as being generated by a diagnosis matrix defined by We see this is faithful and decomposable constructions.Let a generator matrix G be Then, a diagnosis generated from the diagnosis matrix is where w(e) = (w(e) 0 , w(e) 1 ).The method in Ref. [25] uses a different set of label vectors g called one-hot representation, which has a one-to-one correspondence with g as g = (0, 0, 0) T → g = (1, 0, 0, 0) T (81) g = (0, 1, 1) T → g = (0, 1, 0, 0) T (82) g = (1, 0, 1) T → g = (0, 0, 1, 0) T (83) g = (1, 1, 0) T → g = (0, 0, 0, 1) T . (84) The above relation as real vectors can be written as Since it is an isometric affine transformation, we expect that this transformation has little effect on the performance of the supervised machine learning.The matrix H g is faithful and decomposable, but its normalized sensitivity is O(1).We thus expect that this decoder becomes near-optimal when the training is ideally performed, but the prediction is not robust when the size of the training data set is small.The method proposed by Baireuther et al. [26] mainly focuses on a model applicable to quantum error correction when we perform various counts of repetitive stabilizer measurements by utilizing recurrent neural network.They use the commutation relation between the physical error and a logical Z operator as the label, since they only concerned about the logical bit-flip probability with the fixed initial state in the logical space.We can thus consider this method as a case of the linear prediction framework.
Torlai et al. [24], Krastanov et al. [27], and Breuckmann et al. [28] took a different approach from the above two [25,26].They used the binary representation of the physical error as the label vector.In the linear prediction framework, it corresponds to a choice of H g = Λ leading to Since H g is not faithful, it cannot constitute an optimal decoder even with the delta diagnosis decoder.Interestingly, the delta diagnosis decoder with this choice of H g works as an MD decoder, which can be shown by the following lemma.
Lemma III.3.If the matrix H cg has rank 2n in GF(2), there exists a map r * (g, s) such that the decoder with r(s) = r * (g (δ) (s), s) works as an MD decoder for arbitrary distribution {p e }.If H cg does not have rank 2n, no such map exists.
Proof.If H cg has rank 2n, there exists a left inverse binary matrix H −1 cg such that H −1 cg H cg = I.Then, we can obtain the physical error e as Thus, we can obtain the most probable physical error e * (s) from the most probable diagnosis.If H cg does not have rank 2n in GF(2), there exist two physical errors which generate the same pair of syndrome and diagnosis.We cannot determine which is more probable.Thus, we cannot perform MD decoding when H cg does not have rank 2n.
A drawback in this approach is difficulty arising when we replace a loss function with a practical one such as L2 distance.In order to satisfy a decomposable property in MD decoding, the length of the diagnosis must be no shorter than 2 n+k since there are 2 n+k possible candidates of the most probable physical error.This is not practical when the distance is large, and thus it requires heuristics such as repetitive sampling.

D. Numerical result
We numerically show that the uniform data construction improves the performance of the neural decoder in the case of k = 1.We trained an MLP model with the uniform data construction, and compare it with other data constructions of the neural decoders.We also make a comparison with known decoders such as the MD decoder and the MWPM decoder.We choose the [[d 2 , 1, d]] surface code for the comparison, since most of the existing methods were benchmarked with this code.We calculated the performance for two types of error models, the bit-flip noise and depolarizing noise.The probability distribution of the bit-flip noise is described as follows.
where p is an error probability per physical qubit, and w(e) is a weight of physical error e defined in Eq. ( 4).The probability distribution of the depolarizing noise is described as follows.
Note that the occurrences of the bit-flip and phase-flip errors are correlated in the depolarizing noise.We first calculated the performance when the physical error probability is around the error threshold, namely, p = 0.1 for the bit-flip noise and p = 0.15 for the depolarizing noise.The tunable hyper-parameters of the neural network, such as number of layers in network, number of neurons in each layer, and learning rate, are optimized with a grid search for each noise model and for each size of the training data set.See Appendix B for the details of the parameter optimization and implementation.
The performance of the neural decoder under the bitflip noise is shown in Fig. 5(a).The solid lines are the performance of the neural decoder with the uniform data construction.The bottom dashed lines represent the logical error probability achievable with the MD decoder.The colors red, green, blue, and cyan correspond to distances 5, 7, 9, 11, respectively.
Comparing these two types of decoders, we see that the logical error probability of the neural decoder is nearoptimal with 10 6 data set at distance 11.On the other hand, there are gaps between the converged logical error probabilities of the neural decoder and that of the MD decoder when the distance is large.We speculate that these gaps are caused by imperfect learning of the spatial information of the topological codes, since it is partially improved with the network construction discussed in the next section.
We also implemented the neural decoder with short diagnosis, i.e. the construction with N 01 = N 10 = N 11 = 1, where N w is a number of logical operators in the rows of H g corresponding to the class w.This is equivalent to the construction which we showed as an example in Sec III A 2. We call this construction, with the normalized sensitivity of O(1), as short diagnosis construction, which is shown as the pale plots in Fig. 5(a).Note that the performance of this decoder depends on the choice of the logical operators.We have tried this construction with various choice of the logical operators.The plotted data is the best among our trials.Although both constructions become near-optimal in the limit of large training data size, we see that the performance with the uniform data construction achieves smaller logical error probability than that with the short diagnosis construction for any size of the training data set.We have also confirmed that the performance of the neural decoder degrades when the row vectors of H g consist of the same O(d) logical operators of X, Y and Z.In this case, while the number of the rows in H g is the same as that of the uniform data construction, the sensitivity m(H g ) becomes O(d), which makes the normalized sensitivity m(Hg) M (Hg) to be O (1).Though these results are not plotted, the performance of this construction is almost the same as the short diagnosis construction.These results support our argument that it is essential for the performance of the neural decoder to minimize the normalized sensitivity.
The results with the depolarizing noise are shown in Fig. 5(b).Note that for the surface code under correlated noise such as the depolarizing noise, it is not known how an efficient MD decoder can be constructed.We see that the performance of the neural decoder becomes nearoptimal, and is superior to that of the MWPM decoder with 10 6 training samples at d = 5, 7, 9.
We also calculated the logical error probability in terms of the physical error probability.The plots in the vicinity of the threshold value are shown in Fig. 6.We chose the size of the training data set as 10 6 , and calculated the performance for the distance d = 5, 7, 9, 11 and for the bit-flip and depolarizing noises.We used hyperparameters which was the best in the calculation of Fig. 5 when the size of the training data set is 10 6 .For both of the noise models, the performance is near-optimal when the distance is small.On the other hand, when the distance becomes large, the logical error probability becomes larger than that of the MWPM decoder.The error threshold is usually estimated with the cross point of the performance in terms of the distance.We see that the error threshold based on the distance is worse than that of the MWPM decoder, though the logical error probability is smaller than that of the MWPM decoder.The actual experiment is expected to be performed with a physical error probability sufficiently smaller than the threshold value.Therefore, we calculated the performance of the decoder with a small physical error probability.The numerical results are shown in Fig. 7. Since the training data set generated with a small value of p is highly imbalanced, we trained the model with p = 0.08 for the bit-flip noise model, and with p = 0.11 for the depolarizing noise model.Then, we tested the trained model with the data set generated with p ≤ 0.1.We see that the logical error probability is smaller than the MWPM decoder in this region, for all the distances except d = 11.
We also calculated the performance of the neural decoder for two types of color codes.We chose the size of training data set as 10 6 , and calculated the logical error probability for the distance d = 3, 5, 7, 9. Note that we cannot construct an efficient MD decoder in the color code even under independent bit-flip and phase-flip noise.The plots of the logical error probability to the physical error probability p are shown in Fig. 8.The configura-tions of the plots and lines are the same as that for the surface code.In the case of the bit-flip noise, the nearoptimal performance is achieved.The performance is also near-optimal in the case of the depolarizing noise at distances except d = 9.We also see that the performance of the [4,8,8]-color code is better than that of [6,6,6]-color code.We speculate that this is because the number of the physical qubits required in the [4,8,8]-color codes is smaller than that of the [6,6,6]-color code at the same distance.These results suggest that the neural decoder with the uniform data construction is effective also for the color codes.

IV. UTILIZING SPATIAL INFORMATION
In this section, we describe the construction of the neural network with convolutional layers.We first discuss how the required size of the data set is expected to be suppressed if the model can utilize the spatial information of the two-dimensional quantum codes.Then, we introduce a construction of the neural network with convolutional layers to utilize the spatial information of the topological codes.We finally show the numerical results, and show that the performance of the neural decoder is improved.

A. Importance of the spatial information
In this section, we utilize spatial information of the syndrome by using Convolutional Neural Network (CNN) as a prediction model.When we use MLP model, each  layer is represented as a vector of neurons, and the neurons are densely connected from layer to layer.On the other hand, each layer of CNN is matrix-shaped, and each element of the next layer is calculated only from the local region of the previous layer using a map called a filter.The local filtering with the same filters can be considered as a convolution.For the mathematical formulation, see the next subsection.
Though the CNN model is frequently used for the image recognition, this can be used for the recognition of a feature where spatial information of the feature is essential for the task.For example, the CNN model is expected to be effective for the classification of the phase of spin-glass [22].In such a task, the local correlations of the spins are essential for prediction.Furthermore, since the patterns of local spins are translational invariant, filters which extract the feature in a local region are expected to be reused in the other regions.These properties match the premise of CNN, and we can expect that the performance of the CNN model is improved compared with other models such as MLP model for a fixed number of the training data set.
In this subsection, we explain why CNN is also expected to be effective for decoding in the topological codes.In the case of the two-dimensional topological codes, the syndrome values have natural two-dimensional arrangement.By carefully reshaping syndrome values as a matrix-shaped arrangement of the feature vector elements, we can explicitly let the model use local correlations of the observed syndromes using the CNN model.In the topological codes, a flip of a single physical qubit invokes at most a constant number (2 in the surface code, 3 in the color code) of local bit-flips in the syndrome value.This implies that whether two (or three) flipped syndrome bits are found in a local region or not is useful for predicting the property of the physical errors.For intuitive understanding, we elaborate the reason through examples.We consider the surface code under bit-flip errors.Suppose that a syndrome vector s is given in the prediction phase, and the model has encountered slightly different syndrome vectors s A and s B , where the difference from s are shown in Fig. 9, in the training phase.
The representation is the same as that of Fig. 4. We ignore boundary effects of the topological codes for simplicity.In both cases, the syndrome is two hamming distance away from the original syndrome vector s, namely, On the other hand, there is a difference between the two syndromes in light of whether it helps the prediction of the diagnosis for the observed syndrome s.In Eq. ( 59), we introduced a set of physical errors S(s, w * ; h) such that any e ∈ S(s, w * ; h) with h ∼ N (H g ) −1 produces a training data useful for estimating the L2 diagnosis vector of s.For a given s and s , if there is no vector e δ such that H c Λe T δ = s ⊕ s and h(e δ ) N (H g ) −1 , we see no errors e with s(e) = s are contained in ∪ w∈{0,1} 2k S(s, w; N (H g ) −1 ).In the case of s A and s B , there is such a physical error e δ with a small hamming weight for s A , but not for s B .Thus, if the prediction model can distinguish the samples with s A from those with s B , it can recognize that the samples with s B in the training data set are not relevant to the prediction for s.The CNN model can distinguish it since it FIG.9: Example of the difference of the syndrome values.Each node in the figure corresponds to each syndrome value, and edge corresponding to the error status of each physical qubit.The color of the circle corresponds to whether the syndrome measurement detects an error.
naturally utilizes the spatial information of the syndrome values.On the other hand, the MLP model cannot easily distinguish them since the model is not provided with the relevant spatial structure before training.This discussion implies the logical error probability under the fixed number of the training data set is expected to be improved with the use of a CNN model.

B. Construction of the network
A convolutional neural network extracts patterns from image data through trainable filters that activate (produce a high value) when there are specific local patterns in the input data.The network usually consists of multiple convolutional layers C (n) each of which consists of different filtered versions of the image data C (n) p , indexed by a channel number p.The (n−1)-th layer with Q channels are filtered to the n-th layer with P channels with Q × P filters which we can be represented by a matrix f (n−1,n) .We can describe this relation as follows.
i,j,p is the (i, j) element of the p-th channel in the n-th convolutional layer, and f dx,dy,q,p is the (d x , d y ) element in the (q, p)-th filter from the (n − 1)-th layer to the n-th layer.Parameter b (n) p is the bias added to the p-th channel of the n-th layer.A simple example is shown in Fig. 10 where one layer has three channels and the next layer has two channels.
To use a CNN in our decoding task, we have to express the syndrome vector s with an appropriate matrix representation.We reallocate the syndrome vector for the We have not tried it on color codes, since it is hard to interpret the allocation of the syndromes in color codes with rectangular shapes.
Our CNN decoder consists of three convolutional layers followed by a single fully-connected hidden layer as shown in Fig. 12.At the last convolutional layer, the output channel is flattened to a single one-dimensional vector.Then it is used as an input for the subsequent fullyconnected hidden layer.For each convolutional layer, the channel number is chosen to be 10d for the first two layers and 5d for the last layer.Details about the model architecture is described in Appendix B. It is ing that we used the same filters for decoding of both X and Z flip errors, and max-pooling is not used as it is observed to reduce the performance of the decoder.

C. Numerical result
We call a neural decoder with the MLP model as a MLP decoder, and one with the CNN model as a CNN decoder.We compare the performance of the CNN decoder with those the MLP decoder, MD decoder, and MWPM decoder.Note that the training data set is generated with the uniform data construction.
First, we compare the performance of the CNN decoder and that of the MLP decoder in the case of the surface codes.The numerical results are shown in Fig. 13.In this figure, the solid lines and dashed lines are the logical error probability for the CNN decoder and the MLP decoder, respectively.The colors red, green, blue, and cyan correspond to distances d = 5, 7, 9 and 11, respectively.For both types of the surface codes, the CNN decoder shows superior performances to that of the MLP decoder at large distances.In particular, in the case of the [[2d 2 − 2d + 1, 1, d]] surface code, the CNN decoder shows significant improvement of a logical error probability.We see that the CNN model is effective for improving the performance of the neural decoder at large distances.
On the other hand, we see that the CNN decoder shows inferior performances to the MLP decoder at a small distance.We speculate the reason of this as follows.The CNN model assumes that the local features can be extracted by using the same filter everywhere.Such an assumption is not necessarily true when the distance is small, since almost all the filtered local regions, of size 3 × 3 for example, are on or near to the boundary of the two-dimensional codes.Note that we tried to avoid this problem by padding the boundaries with various values, such as 0.5 or −1, but the performance in the small distance did not improve.Next, we compared the performance of the CNN decoder with those of the MD decoder and the MWPM decoder.The results are shown in Fig. 14.The solid lines, the dashed lines, and the dotted lines are the logical error probability for the CNN decoder, the MD decoder, and the MWPM decoder, respectively.The colors red, green, blue, and cyan correspond to distances d = 5, 7, 9 and 11, respectively.In the case of the bit-flip noise, we see that the logical error probabilities of the CNN decoder is equal to or slightly better than that of the MD decoder.In the case of the depolarizing noise, though there are gaps between the performances of the CNN decoder and the MD decoder, the performance of the CNN decoder is superior or comparable to that of the MWPM decoder even at the distance d = 11.
We also calculated the logical error probability of the CNN decoder at a small physical error probability p in the case of the [[2d 2 − 2d + 1, 1, d]] surface code.We trained the CNN decoder at p = 0.08 for the bit-flip noise model, and at p = 0.11 for the depolarizing noise model.Then, the decoder is tested with the data set generated with small physical error probabilities.The plots are shown in Fig. 15.In the case of the bit-flip noise, the CNN decoder achieves the performance close to the MD decoder also at small physical error probabilities.In the case of the depolarizing noise, the performance of the neural decoder with CNN decoder is superior to that of the MWPM decoder at d = 9, and comparable at d = 11.We can say that the CNN model is effective also for a use of neural decoders at small physical error probabilities.

V. CONCLUSION
In this paper, we theoretically analyzed mechanism of machine-learning-based decoders for QEC, and proposed a general direction to construct the data set and the neural network.Then, we have numerically shown that our direction is effective compared with the existing works.
Since the formalism of the machine learning is flexible, there are many possible ways to reduce the decoding problem in QEC to the task of the machine learning.In order to clarify what is the best way of reduction, we introduced the linear prediction framework.This framework essentially includes the existing methods as specific cases, and enables us to discuss conditions for satisfying natural requirements for a good decoder for QEC.In particular, we have derived the condition to perform the optimal decoding in the limit of a large training data size.We also introduced a measure, normalized sensitivity, which represents a properly-scaled bound on the deviation in the prediction target resulting from a small change in the physical error pattern.We proposed to use this measure as a criterion for constructing a better decoder.We then proposed a general direction for constructing the data set, uniform data construction, which can be applicable to general topological codes.We numerically confirmed that the performance of the neural decoder is improved with the uniform data construction.Our decoder was found to be superior to known efficient decoders, such as neural decoders proposed in the existing methods and the decoder based on the reduction to minimum-weight perfect matching.We also confirmed that the performance of our neural decoder is near-optimal in various situations by comparing it with the minimum-distance decoder, which is known to be near-optimal but not efficient in general.We also confirmed that the neural decoder can achieve near-optimal performance not only for surface codes but also for color codes.
Another important factor of the neural decoder is construction of the neural network.We discussed the importance of the spatial information of the syndrome measurement in order to let the prediction model recognize useful samples from a given training data set.To utilize the spatial information, we proposed a neural decoder with the convolutional neural network.We numerically observed that the performance of the neural decoder is further improved with this network construction in the surface code.In particular, we showed that the proposed We separately pass X and Z syndrome values through the same convolutional layers, and concatenate them before feeding to the following fully-connected hidden layer.
neural decoder achieves a smaller logical error probability than that of the decoder based on minimum-weight perfect matching even at distance d = 11 with a training data set size of 10 6 .Since using machine learning for QEC is an emergent field, there are still many possible extensions and directions of the neural decoders.As we detailed in Appendix B, the prediction time of the neural decoders is smaller than that of the MD decoder, but larger than that of the MWPM decoder in our desktop PC.Since the prediction of the neural decoders can be done with simple matrix multiplications, the time for prediction can be further made short by using an optimized hardware such as field-programmable gate array (FPGA), which is popularly used in experiments.While we have discussed only a label linearly generated in GF(2), the performance may be more improved by allowing labels nonlinearly generated from the physical error.For example, the relation between the syndrome values and the weight of the physical error, which cannot be generated linearly in GF(2), can be trained and predicted independently with a neural network.Then, the recovery map can be predicted with the syndrome values and the predicted weight with another neural network.The linear prediction framework also limits the sample in the training data set to that is sampled from the assumed physical error distribution.However, the distribution which is the best for the training is not necessarily the same as the actual distribution.For example, we saw that the prediction model trained at the physical error probability around the threshold value shows high-performance also at low physical error probabilities.There can be a more artificial way to construct the training data set to achieve the performance with a smaller size of the training data set.In the numerical investigation, we observed that the required amount of the data set becomes exponentially large in terms of the distance.This may be suppressed by renormalizing the  matrix representation of the syndrome with trained filters as done in the renormalization group decoder [19].
We expect that CNN is also applicable to the color codes by using non-rectangle filters.When the stabilizer measurements themselves suffer from noise, stabilizer measurements are often repetitively performed during QEC.In such a case, the length of the syndrome data is not fixed.In our construction, we need to train the neural network again whenever the length of the syndrome data changes.The studies of Refs.[26,28] focused on removing this drawback by utilizing recurrent neural network and convolutional neural network.Using the technique proposed in Refs.[26,28], our neural decoder may be also applicable to the cases when we perform repetitive stabilizer measurements.FIG.14: The performance comparison between the CNN decoder (solid lines), the MD decoder (dashed lines), and the MWPM decoder (dotted lines) in the surface codes.We calculated the performance for distance d = 5 (red), 7 (green), 9 (blue), and 11 (cyan).Here we prove the last statement of Lemma III.1.When Eq. ( 19) does not hold, either (i) there exists e 1 such that or (ii) there exists e 1 such that An optimal decoder for each case succeeds with probability 0.75 given s = 0. On the other hand, since g (δ) (0) = 0 in both cases, only the value of r * (0, 0) is relevant.Since w(0) = w(e 1 ), any choice of r * (0, 0) leads to a success probability no greater than 0.25 for at least one of the cases.For (ii), choose w = 0, and if H g Λ(wG) T = 0, define e 2 := wG. (99) An optimal decoder for each case succeeds with probability 0.6 given s = 0. On the other hand, since g (δ) (0) = g 2 in both cases, only the value of r * (g 2 , 0) is relevant.Since w(0) = w(e 2 ), any choice of r * (g 2 , 0) leads to a success probability no greater than 0.4 for at least one of the cases.
Proof of the converse part in Lemma III.2 When the diagnosis matrix is not decomposable, there exists a non-empty subset W ⊂ {0, 1} 2k such that From Eq. (107), the L2 diagnosis vector g (L2) (0) is identical for the two distributions.On the other hand, the most probable class w is different for the two prob-ability distributions.This means that a single decoder cannot perform the optimal decoding for both of the two distributions.

APPENDIX B : ADDITIONAL INFORMATION FOR THE IMPLEMENTATION OF THE DECODERS
We describe the detail of the implementation of our model, training process, and decoders for the reference.We chose rectified linear units (ReLU(x)=max(0,x)) and a sigmoid function (S(x)=1/(1 + e −x )) as the activation function for the hidden layer and that for the final output layer, respectively.Batch normalization was deployed in all of our models and was found to be effective.We also used L2 regularization to avoid over-fitting of the model.In the training phase, the Adam optimization method [33] was used.The learning rate was exponentially decreased, and its schedule was optimized by hand.The network was built with the tensorflow v1.2 platform.

Details about the multilayer perceptron
We optimized the following parameters of multilayer perceptron using grid-search: number of neurons per layer (#unit), number of hidden layers (#layer), size of the batch (#batch), and coefficients of the L2 regularization (β).The parameters were searched in the range #unit ∈ {d 2 , d 3 , d 4 }, β ∈ {0, 0.01, 0.1}, #batch ∈ {100, 500}, and #layer ∈ {2, 3, 4}.Note that in the case of d = 11, we tuned #unit by hand since we cannot choose #unit = d 4 due to the memory limit of GPU.We started the training with learning rate 10 −3 , and it was decreased to 10 −5 according to a schedule which was optimized by hand.We optimized these parameters for each construction of the diagnosis matrix, distance, physical error probability, error model, and size of the training data set.We chose the configuration which achieves the smallest logical error probability for an independently generated a validation data set of size 10 5 .Then, the logical error probability is calculated using another 10 6 test data set.

Details about the Convolutional Neural Network
Our CNN model consists of three convolutional layers on top of a single fully-connected hidden layer.For each convolutional layer, the channel number was chosen to be 10d for the first two layers and 5d for the last layer.We chose batch size as 100 in the training of the CNN model.
The network architecture was the same for both bitflip and depolarizing noise models in the [[2d 2 − 2d + 1, 1, d]] surface code, and is described in TABLE I.The filter stride was set to 1 in all directions.
As for the [[d 2 , 1, d]] code with the bit-flip and depolarizing noise models, we used the network architecture described in TABLE II.Here, for the code distances 7, 9, and 11, the initial layer filter stride was set to 1 in the   vertical direction and 2 in the horizontal direction.For the code distance 5, stride was set to 1 in all directions.

Implementation of the minimum-distance decoder
The minimum-distance decoder of the surface code under the bit-flip noise can be implemented by reducing the problem into the minimum-weight perfect matching.The minimum-weight perfect matching can be efficiently solved with Blossom algorithm [31].We used Kolmogorov's implementation of Blossom algorithm [34].In the other cases, we reduced the problem into the following instance of integer programming.
Minimize w(e) s.t.H c Λe T = s (111) This problem was solved with IBM ILOG CPLEX.We obtained at least 10 5 samples for each plot.In all the cases, the solver reached the optimal solution.

Time for single prediction, implementation and environment
We measured the time for single decoding on the [[2d 2 − 2d + 1, 1, d]] surface code with d = 11 and p = 0.15 under the depolarizing noise for the MD decoder, MWPM decoder, and the proposed neural decoders with the MLP and CNN models.Note that the times of the MD decoder and the MWPM decoder depend on the physical error probability.
We used IBM ILOG CPLEX via python-wrapper for constructing the MD decoder.The program was executed on Intel Xeon E5-2687W v4 with default settings.The MD decoder takes about 330 milliseconds per decod-ing.Note that the time may be improved by optimizing the settings of CPLEX.The Kolmogorov's implementation of Blossom algorithm [31,34] was used for the MWPM decoder.We compiled the codes with Microsoft Visual C++ 2015 and with O2 option.The program was executed on Intel Core i7-6700 without parallelization.The MWPM decoder took about 56 microseconds per decoding.
The proposed neural decoders were implemented with python and tensorflow.We measured the time for single prediction when we set batch size as 1, the number of layer as 2, the number of units per layer as 7000 for the MLP model.The configuration of the CNN model is shown in TABLE I.The computation was performed using Intel Core i7-6700 and GeForce GTX 1060 6GB.The proposed neural decoders with the MLP and CNN models took 2.2 milliseconds and 7 milliseconds, respectively, for feed-forwarding the input data and finding the most probable class w.Since the prediction of the neural decoders can be done with simple matrix multiplications, we expect that the time for single prediction of the neural decoder can be made shortened by using an optimized hardware, such as FPGA, for example.
is a row vector, for the Pauli operator P = α n i=1 σ vivn+i .For arbitrary two Pauli operators P and P , b(P ) = b(P ) means that the two Pauli operators are equivalent up to a global phase.The product of two Pauli operators P and P is represented by the sum b(P P ) = b(P ) ⊕ b(P ).With 2n × 2n matrix Λ = 0 I I 0 , the commutation relation of two Pauli operators P and P is given by b(P )Λb(P ) T , which is 0 if P and P commute, and 1 if anti-commute.We denote this commutation relation in terms of the binary representation v, v ∈ {0, 1} 2n as c(v, v ) := vΛv T .The weight of the binary representation of a Pauli operator w(v) is defined so as to be w(b(P )) = w(P ), which is equivalent to define the weight as the number of indices i (1 . The [[2d 2 −2d+1, 1, d]] code and the [[d 2 , 1, d]] code are shown in Fig. 1(a) and (b), respectively.In both figures, the physical qubits are located on the vertices of the colored faces.Each red face represents a stabilizer operator which is a product of Pauli X operators on the physical qubits of its vertices.Each blue face represents one with Pauli Z operators.

FIG. 4 :
FIG.4:The figures show the decoding process based on proposed scheme.Each picture shows only Z lattice, of which the edge corresponds to whether there is a bit-flip error on the physical qubit or not, and the circle shows whether an error is detected through the syndrome measurement.(a) Five logical Z operators which minimize the normalized sensitivity m(Hg) M (Hg) .(b) The actual physical error is drawn as red edges, and the detected syndromes as red circles.The binary numbers shown to the right is the diagnosis vector of the physical errors.The neural network learns the relation between the location of the detected syndromes and the diagnosis vector.(c) The real-valued diagnosis vector is predicted by the neural decoder.(d) With the syndrome pattern, faithful diagnosis vector is either 10000 or 01111.The chosen faithful diagnosis vector is 10000.Accordingly, we choose the recovery operator shown in the figure.In this case, the decoding succeeds.

FIG. 5 :
FIG.5:The performance comparison between the neural decoder with the uniform construction (solid lines) and that with short diagnosis construction (pale lines), the MD decoder (dashed lines), and the MWPM decoder (dotted lines) in the case of the [[d 2 , 1, d]] surface code.The logical error probabilities are plotted against of the sizes of the training data set with the fixed physical error probability p.We calculated the performance for distances d = 5 (red), 7 (green), 9 (blue), and 11 (cyan).(a) The case for the bit-flip noise with p = 0.1.Note that there are no lines of MWPM decoder since the MWPM decoder is equivalent to the MD decoder in this setting.(b) The case for the depolarizing noise with p = 0.15.

FIG. 6 :
FIG. 6: The performance comparison between the neural decoder with the uniform construction (solid lines), the MD decoder (dashed lines), and the MWPM decoder (dotted lines) in the case of the [[d 2 , 1, d]] surface code.We calculated the performance for distances d = 5 (red), 7 (green), 9 (blue), and 11 (cyan) with the same 10 6 training data set.(a) The case of the bit-flip noise.(b) The case of the depolarizing noise.

FIG. 7 :
FIG. 7: The performance comparison between the neural decoder with the uniform construction (solid lines), the MD decoder (dashed lines), and the MWPM decoder (dotted lines) in the case of the [[d 2 , 1, d]] surface code.The neural decoder is trained with the 10 6 training data set.We calculated the performance for distances d = 5 (red), 7 (green), 9 (blue), and 11 (cyan).(a) The case of the bit-flip noise.The training data set is generated at the physical error probability p = 0.08.(b) The case of the depolarizing noise.The training data set is generated at the physical error probability p = 0.11.

FIG. 8 :
FIG. 8: The performance comparison between the neural decoder with the uniform construction (solid lines), and the MD decoder (dashed lines) in the color codes.We calculated the performance for distances d = 3 (black), 5 (red), 7 (green), and 9 (blue) with the 10 6 training data set.(a) The case of the bit-flip noise in the [4,8,8]-color code.(b) The case of the depolarizing noise in the [4,8,8]-color code.(c) The case of the bit-flip noise in the [6,6,6]-color code.(d) The case of the depolarizing noise in the [6,6,6]-color code.

FIG. 10 :
FIG.10:A simple case of convolutional layer where the input channel is three and the output channel is two.

FIG. 11 :FIG. 12 :
FIG. 11: The above figure shows how to split and reallocate the syndrome vectors to the two input layer of the neural network.In the case of the [[2d 2 − 2d + 1, 1, d]]-code, the lattice is split into a (d − 1) × d array of syndrome, and π/4 rotated one.We input two d × (d − 1) matrix as the first layer of the neural network.In the case of the [[d 2 , 1, d]]-code, we split the syndromes into two (d − 1) × (d+1) 2 arrays.

FIG. 15 :
FIG. 15: The performance comparison between the CNN decoder (solid lines) and the MD decoder (dashed lines), the MWPM decoder (dotted lines) in the case of the [[2d 2 − 2d + 1, 1, d]] surface code, where the decoders are trained with the training data set generated at the fixed error rate.We calculated the performance for distance d = 5 (red), 7 (green), 9 (blue), and 11 (cyan).(a) The case of the bit-flip noise.The training data set is generated at the physical error probability p = 0.08.(b) The case of the depolarizing noise.The training data set is generated at the physical error probability p = 0.11.

3 FIG. 18 :
FIG.18: Logical operators used for the construction of a diagnosis matrix for the[6,6,6]-color codes.Each colored line corresponds to chosen logical operators.The lines are colored only for visibility, and are not related to the colors of color codes.

4 FIG. 19 :
FIG. 19: Logical operators used for the construction of a diagnosis matrix for the [4,8,8]-color codes.Each colored line corresponds to chosen logical operators.The lines are colored only for visibility, and are not related to the colors of color codes.