Optimal noise estimation from syndrome statistics of quantum codes

Quantum error correction allows to actively correct errors occurring in a quantum computation when the noise is weak enough. To make this error correction competitive information about the speciﬁc noise is required. Traditionally, this information is obtained by benchmarking the device before operation. We address the question of what can be learned from only the measurements done during decoding. Such estimation of noise models was proposed for surface codes, exploiting their special structure, and in the limit of low error rates also for other codes. However, so far it has been unclear under what general conditions noise models can be estimated from the syndrome measurements. In this work, we derive a general condition for identiﬁability of the error rates. For general stabilizer codes, we prove identiﬁability under the assumption that the rates are small enough. Without this assumption we prove a result for perfect codes. Finally, we propose a practical estimation method with linear runtime for concatenated codes. We demonstrate that it outperforms other recently proposed methods and that the estimation is optimal in the sense that it reaches the Cramér-Rao bound. Our method paves the way for practical calibration of error corrected quantum devices during operation.


I. INTRODUCTION
Quantum error correction is an essential ingredient in quantum computing schemes.When employing active quantum error correction via stabilizer codes, the decoding generally requires information about the error rates of all qubits.These error rates can vary significantly between qubits [1] and they might even vary in time.In contrast to traditional benchmarking before operation, a new approach is to estimate error rates online from the syndrome statistics of the code itself [2][3][4][5][6][7][8].It should be stressed that the syndrome statistics is the only information that can be measured without destroying the encoded information.As pointed out by Fowler et al. [3], this results in a noise model that is directly applicable for the decoder.Furthermore, it allows for the tracking of time-varying error rates [4,7].Experimentally, online optimization of control parameters in a 9-qubit superconducting quantum processor has been demonstrated in a Google experiment [9].
However, apart from the work of Spitz et al. [7], there has been very little theoretical investigation of the estimation problem, see Section IV A for a detailed discussion.For example, it is not clear for what combinations of noise models and codes the unknown parameters are identifiable from the syndrome statistics.Evidently, some restrictions must apply since estimating completely general noise would require measurements which destroy the logical state.For some codes and noise models, including surface codes with independent Pauli noise on each qubit, the analytical method developed by [7] proves parameter identifiability.On the other hand, for many other important codes such as the 5-qubit code [10], the Steane code [11], and more general color codes [12] this method is not applicable.
In this work, we address this question by deriving a general condition for parameter identifiability, and using it to explicitly prove results for the 5-qubit code and the Steane code.Furthermore, we introduce an explicit error rates estimator, similar to techniques employed in classical distributed source coding [13], for concatenated codes and simulate it on the concatenated 5-qubit code.This estimator outperforms previously proposed methods [3][4][5] in this setting, because it does not require the assumption of very low error rates.

Stabilizer Codes
Let us introduce our notation while briefly summarizing stabilizer codes.The Pauli group P n on n qubits is the group of Pauli strings generated by the Pauli operators {X, Y, Z, I} with phases, The Pauli group modulo phases is called the effective Pauli group.We denote the i-th tensor factor of e ∈ P n as e i .The Pauli operator acting as e ∈ P 1 on qubit i and as the identity elsewhere is denoted e (i) ∈ P n .A stabilizer code encoding k = n − l qubits is defined by a commutative subgroup S of P n with generators g 1 , . . ., g l [14].The code-space is the simultaneous +1-eigenspace of the generators.Phases are generally not important for quantum error correction, so we consider data errors as elements of the effective Pauli group.For an error e ∈ P n , we define the syndrome S(e) ∈ F l 2 entry-wise by To correct an error e ∈ P n , a recovery r ∈ P n is applied based on the measured syndrome.Since errors that only differ by stabilizers act equivalently on the encoded information, the recovery is successful if the equivalence class [er] is trivial, i.e.

II. IDENTIFIABILITY CONDITIONS
In this section, we derive a general condition for the identifiability of the error rates.Then we prove for perfect codes that this condition is always fulfilled whenever the error rates are sufficiently close.

A. General Conditions
We consider a stabilizer code with n qubits and l stabilizer generators.Let us first define identifiability.Given is a parameterized noise model, mapping a vector of error rates θ to a vector (P [E] E∈Pn ) specifying the probability P [E] for each error E ∈ P n .This induces the map M : θ → (P [S]) S∈F l 2 , mapping a parameter vector to the corresponding syndrome statistics via where P [S] is the induced probability of observing the syndrome S ∈ F l 2 .Error rates are identifiable from the syndrome statistics if the map M is injective.This will usually not be the case, due to symmetry around error rates of 0.5.However, we can still hope that the parameters are at least identifiable if we restrict to some region in the space of parameters θ.
Definition 1 (Local identifiability).We say that error rates are locally identifiable at θ if there exists ε > 0 such that the map M is injective on the ball For ease of exposition, we will focus in this section on independent single qubit Pauli noise, which is a simple but widely studied error model.A substantial generalization of Proposition 1 and Theorem 1 to much more general error models, including measurement errors, can be found in Section IV B. For now let us assume that errors on the i-th qubit of the code are modeled by the Pauli channel The parameter vector θ for this error model is given by the error rates (θ i e ) i∈{1,...,n},e∈{X,Y,Z} of all non-trivial single qubit errors.By the inverse function theorem, local identifiability at θ holds if and only if the Jacobian matrix J = D θ M at θ has full (column) rank.We will label the rows of the Jacobian by syndromes S and the columns by parameters θ i e , and denote entries with square brackets, e.g. as J[S, θ i e ].In the limit of low error rates, it is intuitive that identification of error rates is possible since a syndrome always arises from the matching single qubit error, and no combined errors occur.Thus, the only requirement is that the single errors can be identified from the syndromes.This just means that the code has distance at least 3, i.e.only trivial codes are excluded.This leads to the estimators proposed in Refs.[3][4][5].We confirm this intuition by calculating the Jacobian of M and checking its rank: Proposition 1 (Identifiability for small error rates).For a quantum code subject to independent single qubit Pauli noise, error rates are locally identifiable at θ = 0 if and only if S(e) = S(e ) for every choice of two different single qubit errors e, e .
A proof is provided in Section IV B 2. Our first central result is an identifiability condition without the assumption of low rates.This establishes a connection between local identifiability and the posterior distribution of errors for each qubit.
Theorem 1 (General identifiability condition).Consider a quantum code subject to independent single qubit Pauli noise.Assume that all error rates are non-zero and that P [S] > 0 for all syndromes S ∈ F l 2 .Then error rates are locally identifiable at θ if and only if the matrix J with entries has full column rank.Here, P [E i = e|S] is the conditional probability that the i-th qubit is affected by the error e ∈ P 1 given that the observed syndrome is P [S].
The proof is provided in Section IV B 3.

B. Identifiability for Perfect Codes
We demonstrate the analytical application of Theorem 1 by considering the class of perfect codes.
Definition 2 (Perfect single error correcting quantum code [15]).A quantum code C on n qubits is called a perfect single error correcting code if there is a bijection between the set of non-trivial single qubit errors and the set of non-trivial syndromes, i.e. there exists a bijective map These codes are called "perfect" because they saturate the (quantum) Hamming bound.A well known example of such a code is the 5-qubit code [10].Other families of perfect codes are cyclic Hamming codes [16] and a class of twisted codes [17].The main result of this section is that error rates for such codes are locally identifiable around the points of equal rates, even for high error rates.This provides another concrete class of codes where identification of error rates is possible.
Theorem 2 (Identifiability for perfect codes).Let C be a perfect single error correcting quantum code on n qubits subject to independent single qubit Pauli noise.Then the error rates are locally identifiable around any point θ with equal error rates, i.e. if there exists p ∈ (0, 1) such that θ i e = p for all i and all e ∈ {X, Y, Z}.
Note that the condition on θ above does not mean that we restrict ourselves to a simple single parameter model.We still allow all estimated error rates to vary individually, but require that the actual error rates are close to being equal.In order to prove Theorem 2 via Theorem 1, we have to check the rank of the matrix J given in (6).Using Bayes Theorem, we can express its entries as The key insight, which might be of independent interest, is that most of the conditional probabilities in this expression are equal: Lemma 1.Consider a perfect single error correcting code on n qubits subject to independent single qubit noise where all error rates are equal.Let e, e ∈ P 1 .Then for any syndrome S ∈ F l 2 \ {0} and qubit i such that S = S(e (i) ) and S = S((e ) (i) ) we have The proof is provided in Section IV C.This Lemma immediately implies that if S = 0 and S = S(e i ), then J[S, θ i e ] = 0. We can ignore the case S = 0 due to normalization, and in the case S = S(e i ) we have J[S, θ i e ] = 0. Thus the columns of J are linearly independent unit vectors, i.e.J has full rank.This proves Theorem 2.
The arguments of the proof straightforwardly generalize to other noise models as long as the perfect code condition is fulfilled, i.e. there is a bijection between syndromes and elementary errors.For example one could consider simple noise models where Pauli X and Pauli Z errors occur independently.The rates of such a model are locally identifiable on the Steane code around points of equal rates, since the Steane code reduces to the classical Hamming code when only one type of errors is considered.The Hamming code is known to be a perfect code.Estimation of such a model on the Steane code was considered in Ref. [4].Thus, Theorem 2 also provides a theoretical background for the results presented there.

III. NUMERICAL ESTIMATION METHOD
In this section, we complement the previous results with a practical estimation method, which is based on the combination of belief propagation (BP) and expectation maximization (EM).In the limit of low error rates, methods based on "hard assignments" were proposed independently by [3][4][5].They use either the recovery output by a ("hard") decoder or the lowest weight error corresponding to a syndrome.Inspired by techniques from classical distributed source coding [13], we instead consider an estimation method that uses the full information about the distribution of errors given a certain syndrome, by combining a "soft" decoder [18] with the expectation maximization algorithm [19,20].

A. Concatenated Codes and Belief Propagation
Let us briefly summarize concatenated codes and their maximum-likelihood decoding [18].We consider independent single qubit Pauli errors.A concatenated quantum code is obtained by encoding each qubit of a quantum code again in the same code.This defines a tree structure, where the logical qubit of a code-block is a "physical qubit" in the next layer, as illustrated in Figure 1 for the 5-qubit code.We can view this as a graphical representation of the probability distribution over all possible errors given the measured syndrome, called a factor graph.The root node represents the total logical error.Maximum-likelihood decoding is done by computing its marginal distribution to find the most likely logical operator.Computation of marginal probabilities is efficiently possible using the BP algorithm (see e.g.Ref. [21]).BP works by passing messages along the edges of the graph.To compute the marginal of the root node it suffices to pass messages upwards, starting from the leaves.Doing an additional downwards pass, we can also calculate the marginals of the leaf nodes, i.e. the distribution of errors on a qubit given the measured syndrome.The computational effort of this method scales linearly in the number of qubits.

B. Error Rates Estimation and Expectation Maximization
Starting from an initialization θ (0) of the estimated error rates and given a set D of measured syndromes, we can calculate a new estimate of the error rates using the EM algorithm, i.e. iterating the following steps until convergence.
1. Expectation step: Compute the expected sufficient statistics based on the current estimate θ (k) of the error rates.
2. Maximization step: Compute a new estimate θ (k+1) of the error rates by normalizing the expected sufficient statistics: Computationally, the main effort is in calculating the conditional probabilities needed for the expectation step.
The key point is that this can be done efficiently using BP.In an online estimation setting, the first iteration of EM introduces almost no overhead, since the marginals calculated during decoding can be used.Further iterations require re-decoding of the syndromes and are thus roughly as expensive as decoding.We will also compare our estimator with the "hard assignment" method [3][4][5], which is the best known scalable method.We extend this method slightly by allowing for multiple iterations.It can then be expressed as a variant of EM, called hard assignment expectation maximization (HEM) (see Ref. [20]).It consists of iterating the steps: 1.For each syndrome S ∈ D, compute the most likely error Obtain the new error rates by counting how often each single qubit error appears: Here δ is the Kronecker-delta.
Instead of the marginals, only the most likely error for each syndrome is considered.It can be computed efficiently using the max-sum algorithm which works similar to BP, see e.g.[21].

C. Numerical Results
In the following, we present a numerical comparison of our estimator (EM) and the "hard assignment" estimator (HEM).In light of our previous identifiability results, we consider the 5-qubit code, concatenated with itself, subject to independent depolarizing noise with error rate p on each qubit.Extending the method to a phenomenological noise model with measurement errors is straightforward, and some results are shown in Section IV D. We initialize the algorithm randomly around the actual rates, with a precision controlled by a real parameter α (higher is more accurate).To be precise, for each qubit i, we sample error rates θ i from a Dirichlet distribution where is a normalization constant.Such an initialization could be obtained from previous benchmarking or an educated guess.We then run the estimator for n it iterations on a data set of n est syndromes generated from the actual distribution.Using a fixed initialization and random actual error rates was also tested for α = 20, n est = 1000 and did not significantly change the mean squared error (MSE) of the estimate of the parameter vector.We chose p = 0.13 which is close to the threshold of the code [18,22], both because we are interested in the regime of high error rates, and because estimating logical error rates is difficult in the regime of low rates.A comparison of logical error rates before and after the estimation, using a relatively bad initialization, is shown in Figure 2. We also compare with the "perfect knowledge decoder" that is given knowledge of the actual error rates.Logical error rates were estimated by decoding 10 5 −10 6 random errors, except for the perfect knowledge decoder where 10 8 random errors were used.A clear improvement is observed even after 1 iteration, and for 5 iterations EM was able to reach close to optimal error rates, while HEM showed no further improvement after the first iteration.We also confirmed that the MSE of the EM estimator is optimal in the sense that it reaches the Cramér-Rao bound, which lower bounds the MSE of any unbiased estimator (Figure 3, more details and results in Section IV D).The HEM estimator showed significantly higher MSE.Finally, we note that since it is a form of maximum-likelihood estimation we expect the estimator to be robust in case of model-misspecification [23] -quantifying the robustness is left for future research.The parameters were p = 0.13, α = 20, and n est = 1000.

IV. DETAILS AND PROOFS
In this section, we will give further details and generalizations on some topics and provide all the proofs that were previously omitted.Furthermore, we present more extensive numerical tests of our estimator.

A. Analytical Solution Under a Conditional Independence Assumption
Spitz et al. [7] have derived an analytical solution of the estimation problem for certain models.Here, we re-derive this solution in a slightly more general setting and discuss the underlying assumptions and limitations by giving examples of quantum codes that cannot be treated in this way.
The estimation method is considered for a circuit noise model, where errors can affect each part of the error correction circuit, including measurements.Definition 3 (Independent binary circuit noise).Let {X q } q=1,...,m denote a collection of (multi-qubit) Pauli errors, where each error may affect one or multiple sites in the error detection circuit.Under independent binary circuit noise, each error occurs independently, and the error X q occurs with probability θ q .
The errors in {X q } q=1,...,m will also be referred to as elementary errors.In such a model, the errors can be treated as binary variables, where X q = 1 with probability θ q and X q = 0 with probability 1 − θ q .Furthermore, the outcomes of the stabilizer generator measurements can be denoted by binary variables S i , where S i = 1 if the total error anti-commutes with the i th generator and S i = 0 otherwise.Then, the rates of errors that affect multiple stabilizers can be estimated using the following proposition.
Proposition 2. Consider a stabilizer code subject to independent binary circuit noise.Let S 1 ,S 2 be two syndrome bits and X be an elementary error such that the following three conditions are fulfilled: where ⊕ is addition modulo 2 and E[ • ] denotes expectation values.
The idea is that the correlation between S 1 and S 2 gives us the rate of the error X.Note that the first two conditions are automatically fulfilled for any error X that anti-commutes with both S 1 and S 2 .The third condition however is interesting.It essentially states that X is the only elementary error in our noise model that affects both S 1 and S 2 .
Proof of Proposition 2. Since the syndromes are binary variables, we have This can be rewritten using the law of total probability.
Now we regroup the second term.
Finally, we use assumptions 1,2 and 3 to finish the calculation.
where we also used P X = 1 − P [X].This is equivalent to (12).
Errors that only affect a single stabilizer can be estimated once the other rates are known, using the following proposition, Proposition 3. Let S be a stabilizer and let {X 1 , . . ., X k } be the set of all elementary errors in our noise model that anti-commute with S.Then, Proof.By assumption, the outcome of measuring S is completely determined by the errors X 1 , . . ., X k .Therefore, Since the elementary errors are independent we can factor out one of them.A connection between an error and a stabilizer means that they anti-commute. Applying and doing some algebra leads to The claim now follows by induction.
If e.g. the rates of X 2 , . . ., X k are already determined by using the estimation from the previous section, Proposition 3 can be used to estimate X 1 .
The estimation using propositions 2 and 3 is in closed form, however there are some limitations.First of all, the assumption of binary noise is relatively restrictive.For example, such a model does not include the commonly used depolarizing noise, since the probability of a Pauli Y error is not the product of the probabilities of X and Z errors.It is possible to work around this problem to some extent by modeling depolarizing noise as independent X,Z and Y errors with some effective rates, which works for low error rates.The second problem is that one only considers correlations between pairs of stabilizers, but not higher order correlations.This is generally not sufficient to fully characterize a code.For example, considering the well known 5 qubit code subject to independent Pauli noise on each qubit, there are 15 parameters to be estimated (the probabilities of each of the 3 non-trivial Pauli errors for each of the 5 qubits), while the two propositions provide at best 4  2 + 4 = 10 equations.However, we have shown in the main text that it is possible to estimate error rates of this code at least in certain parameter regimes (Theorem 2).Furthermore, Proposition 2 requires that one can find pairs of stabilizers that are only correlated by a single elementary error.It is not always possible to find such pairs.As an example, consider the 7 qubit Steane code subject to only independent Pauli X errors on each qubit.The stabilizers of this code are illustrated in Figure 4. We see that because of the central error node X 7 , there are no two stabilizers that are connected only through a single elementary error.Therefore we cannot apply Proposition 2 here.However, Theorem 2 implies that parameters of this model are identifiable at least in a certain regime.Note that similar problems occur for color codes, since the Steane code is the smallest example of a color code [12].The noise model decomposes into channels that act independently, as illustrated by the red boxes.For example, the channel N 1 applies an X error to the first qubit with some probability θ 1 X⊗I⊗I .N 2 applies the error X ⊗ X ⊗ I with probability θ 2 X⊗X⊗I and the error I ⊗ X ⊗ I with probability θ 2 I⊗X⊗I .N 4 flips the outcome of the measurement of S 1 with probability θ 4  (1,0) .

B. Generalized Identifiability Results
In this section, we provide generalized versions of Proposition 1 and Theorem 1 as well as their proof.Furthermore, we provide the proof of Theorem 2.

Formal Definition of Error Model
We consider a quite general error model that includes independent single qubit Pauli noise as a special case.There are two main underlying assumptions.The first is that errors on the data qubits and syndrome bits are stochastic Pauli errors and bit-flips, which is common in the treatment of quantum error correction codes.The second is that there is some independence between different kinds of errors, which is both of fundamental importance for error correction and often physically reasonable.The first assumption implies that errors can be modeled as elements of the group E l n = P n × F l 2 , where the first component represents a Pauli error on the data and the second component represents a bit-flip on the measured syndrome.The product in this group is thus (e, f ), (e , f ) → (ee , f ⊕ f ) and the identity element is I = (I Pn , 0).The syndrome of (e, f ) ∈ E l n is S(e) ⊕ f .Definition 4 (decomposable error model).
For each i ∈ {1, . . ., m}, let θ i = (θ i e ) e∈Ni∪{I} be a probability vector over N i ∪ {I}, and define θ = (θ i e ) i∈{1,...,m},e∈Ni by grouping together all these probability vectors and excluding the rates of trivial errors.An error model is decomposable with error sets N 1 , . . ., N m and parameters θ if errors from the different sets occur independently, i.e. the probability of a given error combination X ∈ (( where δ is the Kronecker-delta and An example of such a model is given in Figure 5. There, the error sets would be (Since errors here either only act on data qubits or only on syndrome bits we omitted the other trivial part of the errors.)For independent single-qubit Pauli noise the error sets would be given by N i = {X (i) , Z (i) , Y (i) }.We will refer to the elements of the individual error sets as elementary errors.Since there can be overlap between the supports of the different error channels, we often consider the vector X ∈ ((N 1 ∪ {I}) × • • • × (N m ∪ {I})), containing all the elementary errors that occurred.The combined error E ∈ P n on the qubits and syndrome bits is then the product of all elementary errors that occurred, i.e.E = i X i .For independent single qubit Pauli noise X and E coincide.The map M : θ → (P [S]) S∈F l 2 introduced in Section II A can now be written as Our identifiability conditions can now be straightforwardly generalized by considering the elementary errors as the new "single qubit errors".We also note that in the presence of measurement errors, it might be appropriate to include redundant stabilizer measurements such that the length l of a syndrome is larger than the number of stabilizer generators [24][25][26].Our results also apply to such a scheme.

Proof of Proposition 1
Explicitly, Proposition 1 is generalized as follows: Proposition 4. Consider a quantum code subject to a decomposable error model with error sets N 1 , . . ., N m and parameters θ.Then the parameters of the channel are locally identifiable at θ = 0 if and only if S(e) = S(e ) for every choice of two different elementary errors e, e ∈ m i=1 N i .
Proof.We have to show that the map M defined in Section II is locally invertible at 0. The probability of the error where P [X] is given in (14).The probability of observing syndrome S is Thus the map M decomposes as M = T • g, where describes the distribution of total errors, and describes the probability of each syndrome.Since T is linear, we have We begin by calculating the derivative of P [X].

∂P [X]
where X −i denotes X without the ith component.Since we consider θ = 0, P We then have where the last line follows because there is always at most one non-zero summand, since the different error sets are by definition disjoint.Therefore the derivative of g has a very simple form: where u ei denotes the unit vector associated to the corresponding elementary error e i , and k = m i=1 |N i |.Since the error sets N 1 , . . ., N m are disjoint, i.e. there are no duplicate elementary errors, this matrix has k independent columns and thus full column rank.As long as no two elementary errors have the same syndrome, the images of these columns under T are again linearly independent.Then D (θ=0) M = T • D (θ=0) g has full column rank, and the inverse function theorem completes the proof.

Proof of Theorem 1
Our general version of Theorem 1 is: Theorem 3. Consider a quantum code subject to a decomposable error model with error sets N 1 , . . ., N m and parameters θ.Assume that θ i e > 0 for all i and all e ∈ N i , and that P [S] > 0 for all syndromes S ∈ F l 2 .Then the error rates are locally identifiable at θ if and only if the matrix J with entries has full column rank.
Proof.We have to show that the map M defined in Section II is locally invertible at θ. Since we assume that the rates of all errors and syndromes are strictly greater than 0, the M will be locally invertible at θ if and only if the entry-wise logarithm ln(M) is locally invertible at θ. Thus we consider the derivative of the log-likelihood ln(P [S]) for each syndrome.Remember that the probability of a syndrome S can be expressed as where P [X] is given in (14).As in the proof of Proposition 1 we compute the derivative Using the fact that we obtain ∂ ln(P [S]) By the inverse function theorem, this completes the proof.
C. Proof of Lemma 1 We will now proof Lemma 1 in order to finish the proof of Theorem 2. Remember that the i-th tensor factor of E ∈ P n is denoted E i .Furthermore, the Pauli acting as e ∈ P 1 on qubit i and as the identity everywhere else is denoted e (i) .Finally, for E ∈ P n , we use E |i as a shorthand for (E i ) (i) .We define the weight of a Pauli error in the standard way.
Definition 5 (weight).The weight of a Pauli error In the case of equal error rates p, the probability of an error is determined by its weight.Let us denote p := 1 − p.We obtain a convenient expression for P [E i = e, S].For e = I we have and, analogously, for e = I By the definition of conditional probability we obtain Lemma 1 is thus equivalent to the following lemma: Lemma 2. Consider a perfect single error correcting code on n qubits.Let e, e ∈ P 1 .Then for any syndrome S ∈ F l 2 \ {0}, error rate p ∈ [0, 1] and qubit i such that S = S(e (i) ) and S = S((e ) (i) ) the following equality holds: In other words, we have to show that the sums in the expression do not depend on e except if S = S(e (i) ).This is the case if for all w = 0, . . ., n − 1 and e, e ∈ P 1 such that S = S(e (i) ), S((e ) (i) ) we have since then the coefficients for each of the exponents appearing in the expressions will be equal.Therefore, in the following, we will derive an expression for the "modified" weight distribution We will show that this distribution is independent of e if S = S(e (i) ).For the rest of this section, we fix a qubit q ∈ {1, . . ., n}, an error ê ∈ P 1 which will act on q and some syndrome 0 = S * ∈ F l 2 , and we denote k w := k w (ê, q, S * ) and K w := K w (ê, q, S * ).For now, we do not assume that S * = S(ê (q) ).
Notation 1 (Perfect Code Property).Since we consider a perfect single error correcting code, for each syndrome S there exists a unique single qubit error e (q) with S(e (q) ) = S.We denote this error by S −1 (S).
The core idea of the proof is to construct the sets K w iteratively.We can use the perfect code property to construct weight w errors with syndrome S * from weight w − 1 errors with any syndrome S by adding the unique single qubit error S −1 (S ⊕ S * ).We formalize this as follows.
Definition 6 (S * -modification and ê(q) -extension).Let E ∈ P n .We say an error E * is a S * -modification of E if S(E * ) = S * and there exists a single qubit error e (q) with E * = Ee (q) .
We say E * is an ê(q) -extension of E if E * is a S * -modification of E with wt(E * −q ) = wt(E −q ) + 1 and E * q = ê.
Note that this definition does depend on the choice of ê(q) and S * , which is fixed for the rest of this section.
It is simple to construct a S * -modification for each error.
Lemma 3.Each error E ∈ P n has a unique S * -modification.We denote it E * .
Because we consider a perfect code this implies e (q) = (e ) (q ) .Thus the S * -modification is unique.
However, it is possible that an error E does not have a ê(q) -extension.This happens for example if the unique single qubit error that needs to be added to obtain the S * -modification is already in E, or if it is on q.We formalize this in the following corollary.
Corollary 1.Let E ∈ P n be an error with E q = ê and wt(E −q ) = w.E does not have an ê(q) -extension if and only if one of the following mutually exclusive conditions is true: where as always E * is the unique S * -modification of E. If we write E * = Ee (q) , where e (q) is a uniquely determined single qubit error acting on qubit q, these conditions are equivalent to (i') e = I (ii') E q = e ∧ q = q ∧ e = I (iii') E q = e ∧ E q = I ∧ q = q ∧ e = I (iv') q = q ∧ e = I Proof.By definition E * = Ee (q) is an ê(q) -extension of E if and only if where we have used that E q = ê.Negating this statement and using that wt(E * −q ) ∈ {w − 1, w, w + 1} leads to the conditions above.
Since similar reasoning will be used repeatedly throughout this section, let us illustrate some of the cases in Corollary 1 with an example.Consider the 5 qubit perfect code with stabilizer generators For this example, let q = 1, ê = X, and S * = (1, 0, 0, 1).The error E = X ⊗ X ⊗ X ⊗ I ⊗ I has the syndrome (0, 1, 0, 1), and thus its S * -modification is obtained by applying S −1 ((1, 1, 0, 0)) = X (3) , resulting in E * = X ⊗ X ⊗ I ⊗ I ⊗ I.This is not a valid ê(q) -extension since the weight was reduced, corresponding to case (ii) in Corollary 1.The single qubit error we applied canceled with an existing error in E. On the other hand, the error E = X ⊗ I ⊗ Z ⊗ Z ⊗ I has the syndrome S(E) = (1, 0, 1, 0).Thus its S * -modification is obtained by adding e = S −1 ((0, 0, 1, 1)) = X (5) , resulting in E * = X ⊗ I ⊗ Z ⊗ Z ⊗ X.This is a valid ê(q) -extension.Notice that the additional single qubit error was applied on qubit 5 where E acts trivially, or equivalently, E * 5 = e.In Corollary 1 we categorized errors without a valid ê(q) -extension by their S * -modification.Now we characterize k w in terms of ê(q) -extensions.Lemma 4. For any w > 0 Proof.By definition of k w (35), we have to show that ⊇: By definition of ê(q) -extension.⊆: Let E ∈ P n be an error such that E q = ê, wt(E −q ) = w and S(E) = S * .Chose a qubit q = q such that E q = I.Then E is an ê(q) -extension of E := EE |q .Furthermore wt(E −q ) = wt(E −q ) − 1 and E q = ê by definition of E .
While this establishes a connection between the weight distribution k w and the concept of ê(q) -extension, it is difficult to count all errors that are valid ê(q) -extensions.A number easier to characterize is This is similar to the characterization in Lemma 4, but l w > k w because two different errors can have the same ê(q) -extension.We have to correct for this "double counting".
Proof.By Lemma 3 we have a well defined function g : P n → P n that maps an error E ∈ P n to its S * -modification E * ∈ P n .By Lemma 4 and the definition of L w , g maps L w to K w , and the restriction g |Lw : L w → K w is surjective.
Because the S * -modification is unique, the pre-images of two distinct elements of K w under g are disjoint.Thus, We want to determine the size of these pre-images.So let E ∈ K w .From the definition of L w and the definition of ê(q) -extension, it follows that E ∈ g −1 |Lw (E) if and only if there exists a qubit q = q such that E q = I and E = EE |q .Thus, since by definition wt(E −q ) = w, the pre-image has w elements.This concludes the proof.Thus, we can characterize the weight distribution k w through the numbers l w , for which we derive a recursive formula.Lemma 6.There exists a recursive formula relating k w to k w−1 and k w−2 Proof.We prove that for any 2 ≤ w ≤ n.Lemma 5 then gives the corresponding equation for k w .The total number of errors E ∈ P n with wt(E −q ) = w − 1 and E q = ê is 3 w−1 n−1 w−1 since there are n−1 w−1 ways to chose w − 1 positions in n − 1 positions, and 3 possible Paulis on each position.Next we count how many of them do not have an ê(q) -extension.The different conditions for this are given in Corollary 1, where errors without an ê(q) -extension are categorized by their S * -modification.We count the number of errors E ∈ P n with wt(E −q ) = w − 1 and E q = ê fulfilling each of these different conditions.We can group errors without a valid ê(q) -extension by their S * -modification, i.e.

{E ∈ P
where all the individual sets are disjoint because the S * -modification is unique.To do this, we have to consider the following cases, for each of which wt(E −q ) = w − 1 and E q = ê hold.
Case (i): E * = E.This condition is equivalent to S * = S(E).By definition there are k w−1 such errors.
Case (ii): wt(E * −q ) = w − 2 ∧ E * q = ê.For each error E fulfilling this condition, we have that E = E * e (q) for a Pauli e ∈ P 1 \ {I} and a qubit q = q with E * q = I.For a given error E with wt(E −q ) = w − 2, E q = ê and S(E ) = S * , there are n − 1 − (w − 2) = n − w + 1 possibilities to chose a qubit q = q with E q = I.For each of these, there are 3 different Paulis one could add to this position.Each of these gives a distinct error E with E * = E .The total number of errors E with wt(E −q ) = w − 2, E q = ê and S(E ) = S * is by definition k w−2 , and because the ê(q) -extension is unique they all give distinct contributions.Thus, Case (iii):: For each such error E it holds E = E * e (q) for a Pauli e ∈ P 1 \ {I, E * q } and a qubit q = q with E * q = I.For a given error E with wt(E −q ) = w − 1, there are w − 1 choices for q = q such that E q = I, and for each choice of q there are 2 possible choices of e ∈ P 1 \ {I, E q }.The total number of errors E with wt(E −q ) = w − 1, E q = ê and S(E ) = S * is by definition k w−1 , and again they give distinct contributions.Thus, Case (iv):: E * q = ê.For each such error E there exists a corresponding E = E * such that E = E e (q) for an appropriate Pauli e ∈ P 1 \ {I}.Note that wt(E −q ) = wt(E −q ) = w − 1.The total number of errors E with wt(E −q ) = w − 1, E q = ê and S(E ) = S * is by definition e ∈P1|ê =e k w−1 (e , q, S * ), and because the S * -modification is unique the different e give different contributions.
There is no double counting because the union in (41) is disjoint.Finally we obtain the number of errors that have a valid ê(q) -extension by subtracting the number of errors without a valid ê(q) -extension from the total number of errors, which yields the recursion (40).
With this recursive formula we can easily prove by induction that for a given qubit q, k w (e, q, S) is (almost) independent of e and S.
Proof of Lemma 2. We consider again a fixed qubit q and syndrome S * ∈ F l 2 , and prove that the numbers k w (e, q, S * ) are equal for any e ∈ P 1 such that S(e (q) ) = S * .Let ê ∈ P 1 with S((ê) (q) ) = S * .We consider two different cases, corresponding to different initial conditions for Lemma 6.The two cases are: 1. S * = S((e ) (q) ) for some e = ê Consider case 2 first.For w = 0, we have k 0 (ê, q, S * ) = 0 independent of (ê, S * ) because S * = S(e (q) ) ∀e ∈ P 1 .For w = 1, k 1 (ê, q, S * ) = 1 is independent of (ê, S * ) because the only error e (q) with S(ê (q) e (q) ) = S * is S −1 (S(ê (q) )⊕S * )) (and this error does not act on q because S * = S(e (q) )∀e ∈ P 1 .).For w > 1, the claim follows by induction since the right hand side of the recursive equation in Lemma 6 is now independent of ê and S * .This concludes the proof for case 2. In case 1 the initial conditions are k 0 = 0, k 1 = 0.The rest of the proof is analogous.The only caveat is that the last term of (40) now also contains a term k w−1 (e , q, S * ) for an error e with S(e (q) ) = S * .But this term can be computed using the same recursive equation, and does not depend on e.
This finally concludes the proof of Lemma 2, and thus also the proof of Theorem 2. As mentioned above, Lemma 6 can also be used to calculate the numbers k w (e, q, S * ) for the remaining case S * = S(e (q) ).The correct initial conditions are k 0 = 1, k 1 = 0.

D. Additional Numerical Results
Here, we provide data complementary to the results shown in Section III C. In particular, we consider the mean squared error (MSE) of the proposed estimator, and we show results with noisy measurements.

MSE of the Estimator
First, we will demonstrate that the EM estimator achieves the Cramér-Rao bound (CRB).The MSE of the estimator T of a parameter θ can expressed by the bias-variance decomposition MSE = bias(T ) 2 + var(T ) . (42) Assume we want to estimate the error rates θ of a code from m independent syndrome observations.Then the covariance of any unbiased estimator T of θ is bounded by the CRB i.e. cov θ (T ) − I(θ) −1 m is a positive semi-definite matrix; here, the Fisher information matrix I is defined by In particular, the variance in the estimate of a single parameter is bounded by the diagonal entries of the inverse of the Fisher information.The derivative ∂ ln(P [S|θ]   ∂θ i e of the log-likelihood with respect to a parameter θ i e was already computed in (28) as Since the probabilities P [e i |S] can be computed using BP, we can numerically evaluate this bound for concrete codes and noise models and compare our estimator to this bound.However, for concatenation levels beyond the first, it was necessary to approximate the expectation value over all syndromes by Monte-Carlo sampling.We used 10 6 samples to do this.As a side note, it is not sufficient to consider the CRB for direct observation of the errors (which is much easier to evaluate).It can be shown that the Fisher information always decreases when post-processing the data, and thus the bounds for syndrome observations must necessarily be higher than for direct measurements of the errors (in our cases the difference was about a factor 2). Finally, it should be noted that in our simulations we have access to the actual error rates which makes it possible to compute the MSE.In a real experiment, one could for example consider the variance instead.In our tests, the EM estimator exhibited a squared bias that was very small compared to the variance, such that the variance coincides with the MSE.However, the HEM estimator showed significant bias in some settings.In the following, we always consider the MSE in the estimation of θ 1 X .However, plots for the other parameters look similarly.The MSE was always determined over 10 3 simulations for each data point.We consider the MSE of the estimation at error rate p = 0.13.For a relatively bad initialization results were already shown in Figure 3. Here, we consider the situation where an accurate initialization is available, demonstrated by using α = 200.An example comparing the MSE at the first concatenation level is shown in Figure 6.For low data sizes, the initialization is more accurate than the estimate using the data set.In this case, HEM outperforms EM and even beats the CRB (remember that the CRB as it is used here only applies to unbiased estimators).The reason is that HEM has a strong bias towards the initial parameters, which did not decrease with the size of the data sets or the number of iterations in our simulations.At larger data sizes this bias is detrimental, and it can be seen that EM outperforms HEM.Especially for low numbers of iterations, EM also exhibits some bias towards the initialization.This can be desirable in case of a good initialization, since it explains why EM also slightly beats the CRB at low data sizes.In particular, we see that at n est = 100 and n est = 1000 EM performs better if a low number of around 3 iterations is used.Note that a small bias remains at higher iterations, which explains why EM also slightly beats the CRB.Especially interesting is the case n est = 1000, where EM both improves over the initialization and clearly beats the CRB at low numbers of iterations.Since we do not know beforehand after how many iterations the procedure should be stopped, it is sensible to instead regularize the estimator in such a setting, such that it does not converge away from the improved value at low iterations.The regularization is done by introducing a Dirichlet prior over the initialization, representing information on its accuracy (see Ref. [21]).Here, β i e = (1 − (θ (0) ) i e )β and the real hyper-parameter β controls the strength of the regularization.The effect of this regularization, using β = 200, is also demonstrated in Figure 6 (the cross-shaped markers).It can be seen that the regularized EM algorithm converges roughly to the minimum of the unregularized version, which was the desired effect.For large data sizes, the regularization introduces a minimal increase in the estimation error.We also tested regularizing the HEM version in the same manner, but no improvements were obtained.Similar results could be obtained for higher concatenation levels.The main difference is that HEM performs worse at higher levels.

Estimator with Measurement Noise
We consider a phenomenological noise model as described in the main text, where Pauli errors occur independently between qubits and bit-flips independently on each syndrome bit.The error rates can be different on each data qubit and syndrome bit.The maximum-likelihood decoder, described in the main text, can be easily modified to include these measurement errors.This is done simply by including the measurement errors as additional nodes, connected to the factor corresponding to the syndrome bit that they flip.This does not destroy the tree structure, and thus decoding and determination of marginals can still be done via BP.Using this adapted maximum-likelihood decoder, EM is only about 5, compared to about 30 for the bad initialization case.

V. CONCLUSION
We investigated the estimation of stochastic error models from the syndrome statistics of a quantum error correction code, establishing both theoretical results on parameter identifiability as well as a practical estimation method.The results do not rely on the limit of low error rates, and our estimator outperforms other recently proposed methods [3][4][5].Our work opens up a number of new research directions.On the theoretical side, it will be interesting to use our identifiability condition to prove results beyond perfect codes.It might also be possible to extend the result on perfect codes beyond the case of equal rates, since numerical results suggest that this assumption is not crucial.The proposed estimator could be straightforwardly applied to quantum low density parity check codes, although the problem arises that belief propagation is no longer exact in this scenario.One could also combine our estimator with methods from Refs.[4,27] to estimate time-dependent error rates and avoid the re-decoding overhead, or consider its application to fault-tolerant circuits as was done for the hard assignment method in Ref. [3].

VI. CODE AVAILABILITY
Our Python implementation of the estimator is available on GitHub.

Figure 1 :
Figure 1: The factor graph representation of the 2 times concatenated 5-qubit code.The circles depict variable nodes representing (logical) errors, i.e. each variable takes values in P 1 .The squares depict factor nodes representing the stabilizer checks.

Figure 2 :Figure 3 :
Figure 2: Logical error rate of the maximum likelihood decoder.Each point is a box-plot including 100 runs with random initializations and estimation data, except for the perfect knowledge decoder, where the error bars indicate a 95% Clopper-Pearson confidence interval.The boxes extend from the lower to the upper quartile of values, with a line at the median.The whiskers extend to the last data point within 1.5 interquartile ranges of the box in each direction.Outliers beyond this are shown individually as circles.The parameters were p = 0.13, α = 20 and n est = 10 3 .

Figure 4 :
Figure 4: Errors X i and stabilizers S i for the 7 qubit Steane code with only X errors (only the 3 relevant stabilizers are shown).A connection between an error and a stabilizer means that they anti-commute.

Figure 5 :
Figure 5: Representation of a simple decomposable error model on the repetition code.The circles represent the 3 qubits of the code.The green boxes represent the two stabilizer generators S 1 = Z ⊗ Z ⊗ I and S 2 = I ⊗ Z ⊗ Z.The noise model decomposes into channels that act independently, as illustrated by the red boxes.For example, the channel N 1 applies an X error to the first qubit with some probability θ 1 X⊗I⊗I .N 2 applies the error X ⊗ X ⊗ I with probability θ 2 X⊗X⊗I and the error I ⊗ X ⊗ I with probability θ 2 I⊗X⊗I .N 4 flips the outcome of the measurement of S 1 with probability θ4  (1,0) .

Figure 6 :
Figure 6: Comparison of the MSE in θ 1 X between EM HEM (triangles) and regularized EM (crosses) for different amounts of estimation data n est for a good initialization at the first concatenation level.The parameters were p = 0.13, α = 200, n concat = 1.β = 200 was used for the regularized version.