Learning-based quantum error mitigation

If NISQ-era quantum computers are to perform useful tasks, they will need to employ powerful error mitigation techniques. Quasi-probability methods can permit perfect error compensation at the cost of additional circuit executions, provided that the nature of the error model is fully understood and sufficiently local both spatially and temporally. Unfortunately these conditions are challenging to satisfy. Here we present a method by which the proper compensation strategy can instead be learned ab initio. Our training process uses multiple variants of the primary circuit where all non-Clifford gates are substituted with gates that are efficient to simulate classically. The process yields a configuration that is near-optimal versus noise in the real system with its non-Clifford gate set. Having presented a range of learning strategies, we demonstrate the power of the technique both with real quantum hardware (IBM devices) and exactly-emulated imperfect quantum computers. The systems suffer a range of noise severities and types, including spatially and temporally correlated variants. In all cases the protocol successfully adapts to the noise and mitigates it to a high degree.


I. INTRODUCTION
It is widely believed that we are entering the era when the computational power of quantum machines surpasses any classical resource for certain specific problems [1,2]. One of the main obstacles to achieving the practical application of quantum computing is the noise caused by decoherence and imperfect control. There exist wellunderstood solutions involving quantum error correction, which can suppress the computing error to an arbitrarily low-level when the error rate of elementary gates is lower than the threshold. However, implementing this approach involves a multiplicative increase in the number of physical qubits, potentially by a factor of a thousand or more [3]. This appears prohibitive for the near future. Therefore for noisy intermediate-scale quantum (NISQ) devices, alternative approaches which are usually termed quantum error mitigation have been developed.
At the base level, it is of course essential to minimise noise during the physical execution of a gate, through optimising control parameters et cetera [4], and here we take it as read that such measures have been taken. Above this level, one can use error extrapolation and probabilistic error cancellation [5][6][7]; here the estimator of the computing result is carefully constructed and optimised using the knowledge of error distribution, such that the impact of errors is minimised [8][9][10]. Similar ideas have been used to correct measurement errors [11][12][13]. By exploring the symmetry of the quantum circuit, some errors in the circuit can be detected and eliminated using post-selection [14,15]. A number of related ideas such as subspace expansion [16] and continuous error mitigation [17], among others, are being explored.
In many potential NISQ applications, for example the use of variational quantum algorithms (QVAs) in eigensolver or simulation [5,18,19], a key task is to evaluate mean values of some observables -in essence, to measure the expected value of one or more qubits as the output of a circuit. The estimator of the mean is usually biased as a result of the noise. Then the role of error mitigation is to remove the bias by modifying the estimator. Because of the linearity of quantum mechanics, a linear combination of noisy circuits with appropriate coefficients (both positive and negative) can be equivalent to a noise-free circuit [6]. One can implement such a combination by randomly sampling from a particular set of quantum circuits, derived from the primary circuit by (typically) the addition of certain gate(s), and taking a weighted average over the recorded outcomes. This can be called probabilistic error cancellation [6]. If the error model, i.e. a precise theoretical characterisation of the errors in the physical gates, is available then it may be possible to analytically derive the ideal distribution of circuits, both their nature and the proper weightings with which their outputs should be combined. Then perfect compensation for errors is achievable [6,7]. However, for this to be practical, the error model must be determined through some form of tomography [7]; this may be difficult [20] or infeasibly costly unless error correlations (either spatial or temporal) involve only a few qubits. Nevertheless, when such conditions are even approximately met then the approach can be very valuable, as has been successfully demonstrated in small systems of superconducting qubits and trapped ions [9,10].
In this paper, we present a novel and intuitive way to mitigate the errors. Instead of determining the error model that afflicts the experimental system and deriving the proper circuit distribution (i.e. combination coefficients), the distribution is determined via an ab initio learning process. We choose the distribution by minimising the error in the final computing result for a set of arXiv:2005.07601v2 [quant-ph] 22 Mar 2021 training computing tasks. The efficiency of the learningbased error mitigation is due to its simplicity and intuitivity. All potential error correlations, i.e. spatial and temporal correlations, are automatically taken into account in the learning process. Therefore, it is a promising way to realise reliable quantum computing with deep circuits on large systems.
An obvious difficulty for a learning-based error mitigation process, if it is to be relevant to real quantum computers implemented at scale, is that one cannot determine the correct value of a given observable (the 'goal' of the mitigation) by any means other than the execution of an ideal quantum circuit! Here we show that learningbased error-mitigation is indeed feasible because Cliffordcircuit training tasks are sufficient to find an optimal circuit distribution, regardless of the error correlations. We derive suitable Clifford circuits from the original (primary) circuit, and for such circuits we can evaluate the correct result by using efficient simulations on a classical computer [21][22][23]. We note that in the present work, the sufficiency is proved under the assumption of negligible single-qubit gate errors. In most quantum computing systems, single-qubit gates do indeed attain a much higher fidelity than other gates, e.g. an average gate fidelity of 99.9999% has been achieved with trapped ions [24] whereas the record for two-qubit fidelity is three orders of magnitude lower at 99.9% [25,26].
In the following we will consider two types of quantum computers, and argue that they are practically equivalent. The distinction concerns the question of whether it is trivial (zero resource cost) to reconfigure the computer from one circuit to another. The more convenient theoretical assumption is that it is indeed cost-free to reconfigure, in which case the learning process is a structureless random sampling. In real systems an experimentalist may prefer to configure a circuit once and sample from it many times before reconfiguring. The learning-based error-mitigation has two stages: the learning, in which we need to evaluate a loss function, and the error-mitigated computation. If the quantum circuit can be updated after each run, we use the Monte Carlo summation in both the loss function evaluation and computing, in order to maximise the number of training circuits. For scenarios where reconfiguration is costly, we propose a method using significant-error interventions.
We demonstrate our protocol both with real quantum hardware and with exactly-simulated virtual devices. We consider various tasks that these devices are attempting to perform, including variational quantum algorithms. For the simulated machines we are of course able to specify the noise model as we wish. In order to compare with previously reported tomography-based methods [6,7], we specify that there are local (twoqubit) errors that conform to a known noise model but that the real noise model also involves additional correlated errors: either spatial 'cross talk' or temporal correlations. The learning-based protocol outperforms the tomography-based protocol by a factor of approximately 4-to-5 depending on the task at hand. Indeed in all cases that we explore, with the real or virtual quantum systems, we find that the learning-based protocol performs very well.
We comment and show numerical data for the scalability of our method to larger systems, and we consider various learning strategies (including single-parameter versus multi-parameter, and summation versus product ansatz) and we consider the distinction between ideal 'infinite time' learning and resource-constrained learning. Because of the simplicity, effectiveness and flexibility of the learning-based approach, we conclude that it is a promising way to realise reliable value from NISQ-era quantum computing.
This paper is organised as follows. In Section II and Section III we introduce and describe the general protocol. In Section IV we discuss practical implementation of the protocol, and focusing on a Pauli error model (Section VI) we describe three practical methods in detail in Section VII, Section VIII and Section IX with numerical results in the first two. Section V separately introduces an alternative way to establish the cost function.
In Section X we demonstrate our protocol on real quantum hardware. Finally, in Section XI we summarise the protocol, conclude the main results and discuss future directions.

II. THE GENERAL PROTOCOL
We consider the quantum circuit as shown in Fig. 1.
In the circuit, all qubits are initialised in the state |0 and measured in the Z basis at the end. Most errors are caused by multi-qubit quantum gates, e.g. controlled-NOT and controlled-phase gates. We call these gates frame gates. Suppose that the circuit has n qubits and N layers of frame gates, we use G j , where j = 1, . . . , N to denote the overall n-qubit gate for the j-th layer. We assume that these multi-qubit gates are all Clifford, which is the only requirement for frame gates. Between frame operations, single-qubit unitary gates R = (R 1 , R 2 , . . . , R n(N +1) ) are performed [see Fig. 1(a)], which specify the quantum computation. We call them computing gates. To implement the error mitigation, we introduce single-qubit Pauli gates before and after each computing gate [see Fig. 1(b)], which are denoted by P = (P 1 , P 2 , . . . , P 2n(N +1) ). We call these Pauli gates error-mitigating gates. In our protocol, the frame gates G j are fixed, and other gates (i.e. R and P ) are treated as variables. We remark that circuits composed in this way is universal for quantum computing, and our protocol can be generalised to other circuit configurations.
Let µ be a binary vector that represents measurement outcomes of n qubits. A specific computation is to evaluate the mean value of a function f (µ). For example, if the observable is Z of the first qubit, the function is f (µ) = 1 − 2µ 1 , where µ 1 is the measurement outcome of the first qubit. We use com ef (R, P ) to denote the Simple example of circuit without and with error mitigation. Each circuit has four qubits and two layers of frame gates G1 and G2 (dashed boxes). Note that errors afflict the frame gates and may correlate over arbitrarily many qubits within a box and between boxes (i.e. spatial and temporal errors). Frame operations (orange) include the qubit initialisation, frame gates and measurement. There are three layers of computing gates Ri (blue) in each circuit. To implement the error mitigation, two layers of Pauli gates Pi (brown) are introduced before and after each layer of computing gates. Usually, single-qubit gates next to each other in the circuit can be combined into one single-qubit gate in the physical implementation.
mean value when the circuit is error-free and com(R, P ) to denote the mean value in the actual noisy circuit.
In probabilistic error mitigation, we use a linear combination of computing results with different P to estimate the error-free result. Given the combination coefficients q(P ), i.e. quasi-probabilities, the error-mitigated computing result is where I means that all error-mitigating gates are identity. Compared to the error-free result, the computing error is Our goal is to find an optimal distribution q(P ) such that the error is minimised. We consider the loss function in the quadratic form: where T is a set of training computing tasks. To evaluate the loss function, we can compute com(R, I) using the actual noisy quantum computer and com ef (R, I) using a classical computer. Because Clifford circuits can be efficiently simulated on a classical computer according to the Gottesman-Knill theorem [21][22][23], we choose the training set T as a subset of Clifford circuits, i.e. T ⊆ C ≡ {R | All R j are Clifford}. By minimising the loss function, we can find the optimal distribution q opt (P ) for the training set. Then, we apply the same distribution q opt (P ) to our primary computing task(s) R. We remark that R will be non-Clifford in non-trivial quantum computations.

III. KEY PROPERTIES
An optimal distribution q(P ) that works for all R exists if single-qubit gates are ideal. The error-free computation result can be expressed as (see Appendix A) Each term in trace brackets is a map on n(N + 1) qubits: is a tensor that describes the effect of all error-free frame operations, R represents computing gates, P L and P R respectively represent errormitigating gates in odd-and even-layers, and S S is a swap map. Here, [U ](•) = U • U † is the completely positive map of the operator U , G ef 0 (•) = ρ ef i Tr E ef f • describes the qubit initialisation and measurement, and ρ ef i and E ef f are respectively the error-free initial state and measurement operator. The actual computation result with error can be expressed in the same form: where F describes the effect of all frame operations with errors. In general, F cannot be written as a tensor product similar to F ef , specifically in the presence of correlated errors. The error-mitigated computation result is com em (R, I) = Tr (S S RF em ), where F em = P q(P )P L FP R . Therefore, the error is zero for all R if q(P ) is a solution of the equation F em = F ef . The solution always exists if for every non-zero element of F ef the corresponding element of F is also non-zero in the the Pauli transfer matrix representation [27,28]. It is very unlikely that this condition does not hold, especially when the error rate is low. See Appendix B for the proof.
The training set T = C is sufficient for finding an optimal distribution q(P ) that works for all R. The set C contains all Clifford R. A single-qubit unitary map [R j ] can be written as a linear combination of singlequbit Clifford maps [7]. For an arbitrary R, we have R = R ∈C α R,R R , where α R,R are coefficients. See Appendix C for details. Therefore, if we find the optimal distribution q opt (P ) such that Loss = 0 with T = C, the error is zero for all Clifford and non-Clifford R after the error mitigation.
We remark that these two properties are proved under the condition of ideal single-qubit unitary gates but do not depend on the error model of frame operations. When single-qubit-gate errors are gate-independent, the proofs still hold after some adaptation. The protocol works for all Pauli, damping and coherent, uncorrelated and correlated errors.

IV. PRACTICAL ISSUES
The spaces of computing gates R and error-mitigating gates P increase exponentially with the circuit size. Therefore, it is impractical to compute Error(R) for every training circuit R ∈ C and optimise the quasiprobability q(P ) of each P . There are two approaches for the practical implementation as follows.
In the first approach, we truncate spaces of training circuits and error-mitigating gates. We then require some rationale for choosing truncated sets that can be expected to be effective. This can be called the significant-error approach. An effective approach is to consider the Pauli error model. General errors can be converted into Pauli errors using the Pauli twirling, which will be discussed later. Pauli errors are erroneous Pauli gates. Usually only a small subset of Pauli errors are significant, which can be corrected by corresponding error-mitigating gates. Let SigE be the set of significant Pauli errors, we can take quasi-probabilities q(P )| P ∈SigE as optimisation parameters and set the rest q(P )| P / ∈SigE to zero. Then, the number of optimisation parameters is the same as the number of significant errors, which usually increases polynomially with the circuit size. Similarly, we choose a selected subset T ⊂ C as the training set. Later, we will show the numerical evidence that the error mitigation works well when the size of T is three times the size of SigE.
In the second approach, instead of truncating the space of training circuits and error-mitigating gates, we consider an error ansatz whose distribution admits a product form. That is, we consider a case where each of significant Pauli errors has its own independent quasiprobability distribution that we optimise. Then an application of error mitigation consists of applying chains of individual significant errors and has a corresponding quasi-probability distribution described as a product of independent quasi-probabilities for each significant error.
Finally, we may generalise the previous approach. We parameterise the quasi-probability distribution as a variational function and compute the loss using the Monte Carlo method. We take q(P ) ∝ B(P , λ), where λ de-notes a set of parameters that determine the distribution. Here, B(P , λ) can be any real-valued function describing the distribution on the large space of P but only using a relatively small number of parameters λ, e.g. the restricted Boltzmann machine [29,30]. Instead of the truncated training set, we can use the full set of Clifford circuits, i.e. T = C. The loss function can be efficiently computed using the Monte Carlo summation. We find that the sampling cost scales polynomially with respect to the accuracy of the Monte Carlo summation regardless of the size of C and the space of P .
All three approaches will be discussed in this paper. We note that one can combine the approaches in different ways in a practical implementation. For example, we can use the significant-error approach to parameterise the distribution and use the Monte Carlo summation to evaluate the loss function. Having obtained an optimised quasiprobability distribution q opt (P ) by any such method, we can implement the error-mitigated computation by using either the truncated space of error-mitigating gates or the Monte Carlo method.

V. MULTIPLE OBSERVABLES AND FIDELITY LOSS
So far, we only considered the case of one observable f (µ). In some algorithms, e.g. the variational quantum eigensolver [18], we need to measure multiple observables. The loss function can be generalised accordingly. Let Loss f be the loss of the observable f (µ). Then, we can take the loss of N o observables as A further option is to base the cost function on the output state fidelity, a measure of the correctness that is independent of the observable. We can also use the fidelity to find an optimal quasi-probability distribution. Let |ψ(R) be the ideal final state (just before the measurement) of the circuit with gate sequences R and P = I. The quadratic fidelity loss function reads where F (R) = ψ(R)|ρ em (R)|ψ(R) , the error-mitigated state is ρ em (R) = P q(P )ρ(R, P ), and ρ(R, P ) is the actual noisy final state of the circuit with gate sequences R and P . We remark that F (R) is a pseudo fidelity, because ρ em (R) may not be positive. The training circuit R ∈ C is Clifford, therefore |ψ(R) is a stabiliser state [21]. Suppose S R is the stabiliser group of the state |ψ(R) , we have (see Appendix D) By measuring the group elements g, which are Pauli operators with ± signs, we can evaluate the fidelity and then the loss function. Compared with the loss of one observable, the fidelity loss has an additional summation over the stabiliser group, which can be realised using the Monte Carlo method.
To measure the operators g, usually we need to change the measurement basis. Given the physical measurement setup in the Z basis, we can effectively change the basis by adding single-qubit Clifford gates before the measurement, i.e. another layer of computing gates. We remark that single-qubit gates next to each other in the circuit can be combined into one single-qubit gate in the physical implementation. Therefore, an additional layer of computing gates does not increase the physical complexity.

VI. PAULI ERROR MODEL
In this section, we discuss the Pauli error model, which is the underlying picture of the protocol. By using the error-mitigating gates, we can convert general errors into Pauli errors. In the Pauli twirling method, stochastic Pauli gates are implemented before and after a Clifford gate. Because the gate is Clifford, two sets of Pauli gates cancel with each other if they are properly chosen. Therefore, the Clifford gate is unchanged if it is error-free, but the noise is symmetrised. In Eq. (5), we have Pauli gates before and after the frame-operation tensor, i.e. P L FP P , which is similar to the setup of Pauli twirling of a Clifford gate. Note that F ef is a tensor product of Clifford gates except G ef 0 . Errors in the qubit initialisation and measurement, i.e. G 0 , can also be converted into Pauli errors. See Appendix E for details.
In the following, we assume that errors are Pauli for simplification. We use [σ 1 ] to denote the initialisation error, which occurs after the qubit initialisation, we use [σ 2j+1 ] to denote the error of the j-th layer frame gate, which occurs after the corresponding frame gate G j , and we use [σ 2N +2 ] to denote the measurement error, which occurs before the measurement. Here, σ j are n-qubit Pauli operators, σ 1 , σ 2N +2 ∈ {I, X} ⊗n and σ 3 , . . . , σ 2N +1 ∈ {I, X, Y, Z} ⊗n . Referring to Fig. 1(b) and its obvious generalisation to deeper circuits, we can understand [σ j ] as the j-th layer of Pauli gates describing errors.
We can use σ = σ 1 ⊗ σ 3 ⊗ · · · ⊗ σ 2N +1 ⊗ σ 2N +2 to describe the pattern of Pauli errors distributed in spacetime. If the probability of σ is p([σ]), the error model can be written as a map N = σ p(σ) [σ]. Usually, there is an inverse map of N , which can be written as The distribution q(σ) is a solution of the equation F em = F ef and, therefore, can correct all errors for all R. With the quasi-probability q(σ), we take the j-th layer of errormitigating gates as [σ j ], where j = 1, 3, . . . , 2N + 1, 2N + 2; error-mitigating gates in other layers are set to identity.
We can observe that, if the error model is Pauli, errormitigating gates in j = 2, 4, . . . , 2N layers are not in fact needed. These layers are only used for general Pauli twirling.

VII. SIGNIFICANT-ERROR APPROACH
The number of terms in the inverse map N −1 = σ q(σ) [σ] increases exponentially with the circuit size, and so naively we would require an optimisation of an exponentially many quasi-probabilities q(σ), which is impractical. In this section and the next three we describe three approaches to practically implement our protocol and provide convincing numerical and quantum hardware experiments with various error models, circuits and tasks.
As mentioned in Section IV, one approach is to assume a Pauli error model N ≈ σ∈SigE p(σ) [σ], where SigE as the set of significant errors including the trivial error (i.e. identity operator). Probabilities of other errors are negligible. If p(σ) 1 for all nontrivial errors, the inverse map is approximately N −1 ≈ σ∈SigE q(σ) [σ], which is used as the ansatz in the learning process. This leaves us with a truncated set of optimisation parameters q(σ) and, by choosing an appropriate construction of the set SigE, it may be truncated to a degree where q(σ) scales polynomially with the circuit size.
An example construction of a polynomially scaling set SigE, which we have used in our numerical simulations, is as follows: 1. Use gate set tomography to find the naive initialisation, measurement and two-qubit gate errors, calculate the respective quasi-probabilities q(σ)) ini for all σ assuming the error model for the whole circuit is only composed from the combinations of these Pauli errors. Here by 'two-qubit gate errors' we mean the error model inferred by an experimentalist purely from tomography of the two-qubit gate mechanism operating on two otherwiseisolated qubits. This task is tractable but will fail to capture the spatial (e.g. cross talk to other qubits) and temporal correlations that will generally occur in the real, comprehensive noise model. Our learning procedure will then adapt the mitigation to encompass these more complex errors. Because the quasi-probabilities eventually used in the error mitigation are determined in the learning, a highly accurate gate set tomography for this initialisation is not required.
In our numerical simulations, we assume the gate set tomography is accurate, up to the neglected time dependence and correlations, in order to be compared with the learning-based approach. According to the quasiprobability decomposition, the error-correcting gate set is {σ|q(σ) ini = 0}. This set, however, still scales exponentially with the circuit size. This step draws parallels with the protocol introduced in the original probabilistic error cancellation works [6,7].
2. To restrict ourselves to a polynomially scaling set SigE, we truncate the error-correcting gate set by leaving only errors up to a constant order k, i.e. in any given instance σ there will be error-mitigating gates (P 1 , P 2 , ...) associated with at most k of the two-qubit gates. A straightforward extension would be to encompass the initialisation and measurement phases too in order to adapt to correlated errors occurring there, but for our numerical simulations we focus on noise associated with the two-qubit operations.
Similarly, the loss function can be estimated by truncating the complete training set C. We numerically show that the randomly selected subset (i.e. truncated training set) T ⊆ C to a size which is comparable to c|SigE| for some overhead constant c is adequate for the learning process.
After the truncations, the loss function becomes (8) Since the sizes of T and SigE scale polynomially with the circuit size, we may evaluate com ef (R, I) ∀R ∈ T and com(R, σ) ∀R ∈ T, ∀σ ∈ SigE using classical and quantum hardware, respectively. Finally, we optimise the truncated quasi-probability q(σ) using the method of least squares (see Appendix N 1). Error-mitigated computation with any circuit R is implemented using q opt (σ) and the error-mitigation overhead cost C = σ |q(σ)|. As previously mentioned, this can be implemented either by estimating each com(R, σ) ∀σ ∈ SigE or by the Monte Carlo summation over SigE.
Alternative ways to parameterise the quasi-probability distribution will be discussed later.

Numerical simulations
We present demonstrations of the learning-based quantum error mitigation using the significant-error approach discussed above. We use exact classical simulations of quantum computers with 8 qubits and certain practically-motivated correlated error models. Our simulations are performed using QuESTlink -a Mathematica library which integrates the framework of Quantum Exact Simulation Toolkit (QuEST) [31,32]. The circuits are n = 8 qubits wide and N = 8 layers deep for a total of 100 gates in a pattern following Fig. 1(a), where all twoqubit gates are controlled-NOT gates. See Appendix G for the detailed circuit.
We test our error-mitigation scheme with two distinct correlated Pauli error models, one representing spatially and the other temporally correlated noise. In both of these models the local noise, i.e. the noise afflicting the two qubits that are nominally involved in the gate, is homogeneous (dephasing or depolarising) and is assumed to be fully characterised by the experimentalist (either by gate set tomography or pre-existing knowledge). No such assumption is made for the correlated part of the error model. For detailed model please refer to Appendix H.
The set of significant errors SigE is generated from the knowledge of the local noise model and truncated to the k = 1 order (|SigE| = 85 or 421 for dephasing or depolarising noise model respectively). In the loss function we use the deviation from the ideal expectation value of the observable Z = diag(1, −1) on the first qubit. The distribution q opt (σ) is found as indicated above in this section using |T| = 3|SigE| filtered randomly generated Clifford circuits (see Appendix I), where we have chosen Clifford overhead constant c = 3 (see Appendix J for our rationale).
For a full assessment of the approach, we generate 500 pseudo-random circuits that satisfy | Z 1 ef | > 0.3 to represent a variety of computational tasks. The restriction to cases with substantial | Z 1 ef | focuses us on cases where noise can be fully impactful; typically the effect of noise without mitigation is to decrease expected values and thus if a randomly generated circuit happens to produce an expected value close to zero even with zero-noise, then the impact of noise will be minimal. This would obfuscate the performance difference between schemes that provide good mitigation and those that do not.
Each circuit is formed by drawing its single-qubit computing gates randomly from a circular unitary ensemble. Having performed the learning-based error mitigation once, we apply the same optimised solution to all 500 circuit instances. For direct comparison to earlier work, we execute each circuit M = 10000 times, selecting an appropriate σ ∈ SigE probabilistically and simply recording a +1 or −1 for the observable Z 1 in each case (inverted if the sign of q(σ) is negative). In this way we obtain Z 1 em as a fairly sampled instance of the value that an experimentalist estimates after M samples. We record the absolute deviation for that circuit, and repeat the process for alternative strategies (tomographic mitigation and no mitigation), before moving to the next of the 500 circuits. The results are displayed in Fig. 2, 3. In the figure, the label 'tomography-based error mitigation' refers to the case where the experimentalist has knowledge only of the local error model (i.e. the errors that directly afflict the two qubits nominally involved in a gate) and she samples according to q(σ) ini ∀σ ∈ SigE generated with k = 2 and with the same sample size M .
The results presented here are for multi-parameter learning, i.e the elements q(σ) are independently adjusted during the learning process. In Appendix K we include results where the optimisation of q(σ) is constrained to a single adjustable parameter , which describes the severity of the local noise. q opt (σ) is then completely defined just by opt . Note that opt is not necessarily equal to the severity of the local noise found from two-qubit tomography. From the results we can see that such an optimisation strategy yields no better results than tomography-based error mitigation with q(σ) ini generated with k = 2, which is slightly above its lower bound on performance set by tomography-based error mitigation with q(σ) ini generated with k = 1. How- Empirical cumulative distribution function of estimated ∆ Z1 for 500 pseudo-random circuits with temporally correlated dephasing noise. Results for circuits without error mitigation (black), with tomographic error mitigation (red) and with learning-based error mitigation (green) are presented. Additionally, we include the results for learning-based error mitigation when sample size M → ∞ (dashed green). ever, for sufficiently random circuits and observables, we can expect its performance to increase beyond that of tomography-based error mitigation.
To further test our protocol, we apply it to a hardware efficient variational circuit presented in Appendix L. The circuit has 8 qubits and consists of 8 layers of random single-qubit rotations around y axis of the Bloch sphere and two-qubit controlled-Z gates and we wish to extract an expectation value of σ Z observable on the bottom qubit, which we denote Z 1 . Qubits are assumed to be laid out in a cycle graph pattern such that a local two-qubit gate may be applied between qubit i and i + 1 mod n or i − 1 mod n.
To this circuit we introduce an error model which closer mimics the errors of current NISQ devices compared to the previous error model -single-qubit gates are considered error free compared to two-qubit gates, but are followed by a small probability of relaxation γ while twoqubit gates are followed by an error channel where Here η is noise bias between reduced dephasing D * Ph and depolarising channels D * Pol with η = 0 describing a fully depolarising channel and η = ∞ describing a fully dephasing channel. In our simulations we use η = 10, = 0.01 and γ = 0.001. Our protocol is particularly powerful with dealing with correlated noise. To that extent, similarly to the previous numerical study, we introduce additional cross-talk errors that are often unnoticed in local tomographic noise characterisation processes. We simulate these errors by an error channel D = D( 10 ) which occurs after each two-qubit gate (and its respective error channel described above) between each qubit that is involved and a qubit FIG. 4. Error model after each set of single qubit gates (blue) and two-qubit gates (black). Unitary single-qubit gates U may be either single-qubit Clifford gates or arbitrary rotations around y axis of the Bloch sphere. γ (orange) describes the amplitude damping channel, while D (red) describes a biased dephasing and depolarising channel (the channel is described in the main text, Eq. 9). that is not involved in the two qubit gate, but is locally connected (see Fig. 4 to see full error cycle after each controlled-Z gate, for completeness we also show errors after every single-qubit gate layer).
With this error model we generate a single 8 qubit noisy circuit (Appendix L) which satisfy | Z 1 ef | > 0.5 to better quantify the effect of our error mitigation protocol. We perform the learning part of the protocol to find to q(σ) opt ∀σ ∈ SigE using the significant-error approach with SigE being the set of Pauli two-qubit gates after each controlled-Z gate truncated to k = 1 order. We compare 10 4 error mitigated expectation values of this circuit to that of an expectation value Z 1 ef from an error-free circuit after sampling M = 10 6 shots according to |q(σ) opt | for each estimation of the expectation value Fig. 5. For direct comparison we include non-mitigated expectation values as well as expectation values from previous work on probabilistic error cancellation. In the figure, similarly to the previous numerical study, the label 'tomography-based error mitigation' (TEM) refers to the case where the experimentalist has knowledge only of the local error model and she samples according to q(σ) ini ∀σ ∈ SigE generated with k = 2 and with the same sample size M . Due to the learning set being truncated to the k = 1 order and due to the circuit involving non-Pauli error processes (relaxation gates), our protocol does not perfectly mitigate the error, but has substantial improvement compared to the previously studied tomography based error mitigation. Notice that the variance of the LBEM approach is less than the variance of the TEM approach, while achieving closer expectation values to the ideal value. This is because during the learning process, the algorithm finds the required result dependencies on gate errors, and if some error does not affect the computational result outcome (in this circuit some two-qubit gate errors do not affect the resulting expectation value on the bottom qubit), then this error is not corrected to reduce the total quasi-probability overhead and hence the variance. This is another feature of our protocol.

Performance with a variational quantum algorithm
In order explore the efficacy of learning-based mitigation using significant-error approach in a realistic setting, we employed it in the context of a quantum variational algorithm (QVA). The goal of our QVA is to find the ground state energy of a closed chain of four nearestneighbour interacting spins specified by with spins labelled 0 to 3. Here the σ are the Pauli matrices. We chose J = 1 and randomly selected the A values. For the data presented in Fig. 6 we used A = {0.270777, 0.192014, 0.0802803, 0.123018}), however other simulations had very similar results. This class of system is believed to be classically hard to simulate as the system size grows [33]. We used a four qubit 'ansatz circuit' within which there are 28 gates: 8 two-qubit phase gates and 20 single-qubit rotations (see Fig. 6). Each single-qubit gate was associated with a unique classical parameter (the rotation angle) and the VQA proceeded by adjusting these parameters in order to minimise the expected energy H of the output state, which is therefore the task's cost function. The optimisation method was a canonical gradient descent using the 'parameter shift' method [34,35] to estimate the gradient with respect to each parameter.
Note that while the circuit noise severity and system size in this task are consistent with currently available quantum hardware 'in the cloud', the very large number of circuit executions required for QVA execution make it cost-prohibitive to use such a device in this context; instead we employed the QuESTlink emulation environment which, as mentioned earlier, has comprehensive and exact noise modelling capabilities.
The noise model here is similar to the previous one used in above numerical analysis but instead uses higher severity errors with = 0.04 and γ = 0.002 (see Eq. 9 and Fig. 4). Noise severity was increased in order to achieve a higher contrast between the mitigation schemes given QVA's remarkably high resistance against general noise (as explored in e.g. [36]). We execute our learning-based QEM algorithm by optimising quasi-probability distributions of a significant-error ansatz for each term in the Hamiltonian separately. In our learning process SigE is truncated to the k = 1 order and we take c = 3.
We would expect that the QVA with the use of learning-based mitigation would far surpass the performance of the same process without any mitigation; therefore for a more meaningful appraisal we compare the learning-based method with the most commonly used alternative mitigation method, i.e. 'extrapolation'. In this approach, the desired observables are evaluated both with the lowest possible error rates and with an intentionally boosted error rate, so to estimate the impact of noise and thus to extrapolate to the zero-noise limit. For the present case we assume that the dominant noise type, i.e. the biased mixedness increasing channel, is fully controllable by the experimentalist in the sense that it can be increased to any level with perfect accuracy. However the minor correlated noise contribution is not under the experimentalist's control in this fashion, and is instead fixed.
The orange line in panel (c) of Fig. 6 shows how the QVA performs when extrapolation-based mitigation is replied upon. The method works reasonably well considering the very high noise burden; the expected value of the output energy falls from an initial +4.58 to −7.57 whereas the true ground state energy of the target systems is −8.002. Thus the extrapolation method has an absolute energy defect of 3.4% of the spectral width. The blue line indicates the performance when learning-based mitigation is activated at the point when the extrapolation method becomes slowly-evolving. The abrupt downward shift is due to the change in the means of evaluating the energy, i.e. even without changing the ansatz parameters we immediately gain advantage from switching the energy estimation method. There is then a further period of optimisation; ultimately the energy estimate drops slightly below the true ground state to −8.09, so that the absolute defect is 0.71% of the spectral range. The performance in both methods is in the limit of high sampling, i.e. we presume that the experimentalist is willing to dedicate sufficient repetitions to the process to achieve these optimal trajectories.
It is notable that although the dominant noise component can be perfectly adjusted for extrapolation (an idealisation that favours that technique), and the nonadjustable component is an order of magnitude smaller, nevertheless the ultimate output of the QVA when us- ing learning-based mitigation is nearly five times superior to the extrapolation protocol (achieving a defect of only 0.71% rather than 3.4%).

VIII. PRODUCT-FORM ANSATZ APPROACH
Another practical approach to implement the learning based error mitigation protocol is by considering an error ansatz whose distribution admits a product-form described below.
We denote the quasi-probability of each Pauli gate P i ∈ SigE as q i , and we shall optimise q i in the learning process. According to the product-form ansatz, we have where b = (b 1 , b 2 , . . . , b |SigE| ) is a binary vector, and b i = 0, 1 denotes that the i-th Pauli gate P i ∈ SigE is off or on. Here, is the quasi-probability distribution of the Pauli gate configuration We note that we have used P P = (P 1 P 1 , P 2 P 2 , P 3 P 3 , . . .) to denote the product of two Pauli strings, and P j P j is a Pauli operator up to a phase that can be ignored. In this approach we always have b q b = 1.
Details for evaluating and minimising the loss function, as well as, learning rates used for different size circuits can be found in the Appendix M.

Numerical simulations
We demonstrate the product-form ansatz approach by numerically simulating various size noisy quantum circuits with the same layout as in Appendix G. We take the error model to be spatially correlated depolarising error model, introduced in Section VII. That is, for each two-qubit gate on qubits i and i + 1, the error rate of two-qubit depolarising channel on qubits i and i + 1 is = 2 /N n for an n-qubit N -layer circuit, where is given in Table I. To add spatially correlated noise, we also apply two-qubit depolarising channels on qubits i−1 mod n and i as well as qubits i + 1 and i + 2 mod n with the error rate /10.
To show the effect of error mitigation, we test the computation accuracy before and after error mitigation using random unitary single-qubit gates. The results are shown in Fig. 7. Because simulating quantum circuits with general single-qubit unitaries is costly, we only benchmark circuits with the size up to 8 qubits with 8 layers. Note that in these simulations we do not require the pseudorandom circuits to satisfy some value of | Z 1 ef |.

Average error rescaling factor
To numerically demonstrate the effect of error mitigation for larger circuits (up to 20 qubits), we test the computation accuracy before and after error mitigation using configurations of Clifford computing gates, such that the circuit can be efficiently simulated using a classical computer. We use the average error rescaling factor to quantify the effect of error mitigation, which is defined as The error rescaling factor as a function of the circuit size is plotted in Fig. 8. We can find that the error rescaling factor does not increase with the circuit size when the size is larger than 9 qubits, indicating the efficient scalability of our protocol. The remaining error after error mitigation is mainly due to the statistical fluctuations caused by a finite number of samples in the learning and error mitigation stages.

IX. VARIATIONAL DISTRIBUTION AND MONTE CARLO EVALUATION
In addition to the summation and product form ansatz of the Pauli error model described above we can potentially use variational functions such as the restricted Boltzmann machine to tackle error models with unknown features [37]. The restricted Boltzmann machine [29] can efficiently express the distribution in a large state space, which can represent the complex-valued wavefunction by using complex weights [30]. The quasi-probability distribution is real-valued in our case. Average error rescaling factor for various circuit sizes. The circuit size is equal to the number of qubits and layers in the circuit. For each circuit layout, we randomly generate 1000 configurations of Clifford computing gates R. We choose the configuration such that the error-free computation result is non-zero. For each computing-gate configuration, M = 1000000 random configurations of error-correcting gates P are generated to evaluate the error mitigated result.
In general, we can express the ansatz in the form q(P ) = CB(P , λ)/A(λ), where C and λ are variational parameters that are optimised in the learning process. The function B(P , λ) must be computable on the classical computer, and A(λ) = P |B(P , λ)| is the normalisation factor. Even if we cannot compute A(λ), samples of the distribution |B(P , λ)/A(λ)| can be efficiently generated using the Metropolis method. The number C = P |q(P )| is the error-mitigation overhead cost [7]. When we already have the optimal parameters, we can implement the error-mitigated computing by using the Monte Carlo summation with samples of P generated according to the optimal distribution. The variance of the error-mitigated computing is Var [ĉom em ] ≤ 1 M |f | 2 max C 2 , where M is the number of samples, and |f | max is the maximum value of |f (µ)|. Here, we have assumed that the circuit only runs for once (without repeating) for each sample of P .
The Monte Carlo method can also be used to compute the loss function. The loss function is in the quadratic form with respect to C. Therefore, it is straight-forward to find the optimal C given the value of λ. To find the optimal λ, we usually need to evaluate the loss for different values of λ. Instead of generating samples for each value, we can compute the loss for λ using samples generated according to a different value λ . In this way, we can reduce the sampling cost in the learning process. Once the optimal λ is found, we need to generate samples according to the optimal λ in order to compute the optimal C. The variance of the loss is Var L oss The details of the Monte Carlo summation, including the application in the significant-error approach, can be found in Appendix N.

Two-qubit DQCp circuit
We demonstrate our learning-based QEM protocol on three IBMQ machines, ibmq 5 yorktown, ibmq ourense and ibmq santiago. On all three cases we observe an improvement of the computation result when executing a two-qubit DQCp circuit given in Fig. 9. Taking R = e −i θ 2 Z and P = I, the error-free result (which we take to be the mean of Z of the upper qubit) is given by com ef (e −i θ 2 Z , I) = cos(θ), as shown in Fig. 10. To perform this demonstration we simplified our protocol to reduce the amount of Pauli gates we introduce in the circuit compared to the original protocol in which layers of Pauli gates are being used, Fig. 1. This is done by assuming a Pauli error model and noting that all Clifford gates map Pauli errors back to other Pauli errors. Hence, we only need to introduce error correcting Pauli gates before any non-Clifford gate (gate R in Fig. 9). In our circuit only a single Pauli gate is inserted to correct the error labeled Error-1, while errors labeled Error-2 and Error-3 either do not impact the computational result or act as a measurement error. The measurement errors can be corrected by modifying the original formula of error mitigation Eq. 1.
For the two-qubit DQCp circuit, the computation result with the error mitigation can be written as com em (R, I) = P =I,X,Y,Z q(P )com(R, P ) + q 0 , (14) where q 0 is the term associated with the measurement error. Suppose the measurement error can be modelled as follows: The measurement outcome is flipped with probability p µ for some correct output state |µ , where µ = 0, 1. According to this model, the mean value Z c describing the correct measurement outcome and the mean value Z e describing the erroneous measurement outcome have a simple relation Z c = ( Z e + p 0 − p 1 )/(1 − p 0 − p 1 ). Hence, the optimal value of q 0 is given as (p 0 − p 1 )/(1 − p 0 − p 1 ). We may include the factor 1/(1 − p 0 − p 1 ) in the quasi-probability distribution q(P ), e.g. if the quasi-probability distribution for correcting Error-1 is q (P ), we have q(P ) = q (P )/(1 − p 0 − p 1 ).
In the learning part of the algorithm, we ran 24 different circuits on each of the three IBMQ quantum machines to evaluate com(C i , I), where C i is one of the 24 singlequbit Clifford gates. By minimising the loss function we obtain optimal quasi-probability distribution q(P ) and optimal q 0 . We note that com(C i , P ) = com(C i P, I), and that C i P is one of the 24 single-qubit Clifford gates, therefore all com(C i , P ) can be derived from the set {com(C i , I)}. Since the loss is a quadratic function, the minimisation is straightforward. Next, we test our error mitigation protocol by taking R = e −i θ 2 Z , where θ = 2mπ/10, and m = 0, 1, . . . , 9. The results are shown in Fig. 10. For each machine, we implement 40 circuits to evaluate com(e −i θ 2 Z , P ) with θ taking ten different values, and P = I, X, Y, Z. We write com(e −i θ 2 Z , I) to denote computation results without error mitigation, which deviate from the error-free values due to the quantum hardware being noisy. The errormitigated results are computed according to Eq. (14), in which we take R = e −i θ 2 Z and the optimal values of q(P ) and q 0 obtained by minimising the loss function. It is clear that the error mitigation reduces the computation error.
Potential causes of residual errors after error mitigation are statistical fluctuations and non-Pauli errors. For each circuit, we run 8192 shots to evaluate the mean value Z . Pauli twirling is not used in this experiment, i.e. general errors are not converted into Pauli errors. Even then, our error mitigation protocol can significantly improve the computation result accuracy, which demonstrates the robustness of the learning approach. We remark that the error mitigation of two-qubit DQCp circuit has been demonstrated in Ref. [9], in which gate set tomography is used to work out the quasi-probability distribution. This tomography of a two-qubit gate requires at least 256 circuits, while in our approach only 24 circuits are used in the learning part for determining the error mitigation parameters.

Variational quantum eigensolver
In addition to the two-qubit DQCp circuit, we also experimentally demonstrate our learning-based quantum error mitigation (LBEM) protocol by applying it to the variational quantum eigensolver (VQE) algorithm. We compute the ground state energy of H 2 molecule at different nuclear separations on IBMQ machine ibmq santiago with and without LBEM. The results are shown in Fig. 11. We find that LBEM can significantly improve the accuracy of VQE.
In this demonstration, we compute the ground state energy of H 2 in the minimal basis (STO-3G basis), which includes 4 spin-orbitals (each atom contribute two spinorbitals {1s ↑ , 1s ↓ }). The electronic wavefunction are projected onto these 4 spin-orbitals, and then we use the Jordan-Wigner transformation to map fermions to qubits. The corresponding Hamiltonian of qubits reads where and Here we have written H into two parts according to the commutation relation between Pauli operators. Pauli operators in H 1 (H 2 ) commute with each other, therefore they can be measured using the same circuit. We use Qiskit to calculate the coefficients in Eq. (17) and Eq. (18). Similar to the two-qubit DQCp circuit, we implement the error-mitigated VQE by randomly inserting a Pauli gate into the UCCSD-circuit (unitary coupled cluster ansatz truncated to single and double excitations). We directly adapt the simplified UCCSD-circuit given in Ref. [38], shown in Fig. 12(a), parameterized by only one rotational angle of a single-qubit gate R = e −i θ 2 Z . The Pauli gate P is inserted before the gate R. Without gates in the dashed box, the circuit can be used to evaluate the mean of H 1 , while gates in the dashed box effectively change the measurement basis and transform H 2 into Then, we can use the circuit with gates in the dashed box to evaluate the mean of H 2 .
There are ten non-trivial Pauli operators in H 1 and four Pauli operators in H 2 . We apply LBEM to each Pauli operator individually, i.e. com(R, P ) is the mean of one Pauli operator, hence, the error mitigation for one Pauli operator is the same as for the two-qubit DQCp circuit -the error-mitigated computation result of the Pauli operator is given by Eq. (14), and the loss function is in Eq. (15). At the learning stage, similarly 24 Clifford gates are implemented instead of the R gate for each one of the two circuits, with and without gates in the dashed box in Fig. 12. The data are used to obtain coefficients q(P ) and q 0 . We note that the coefficients are different for each Pauli operator. To demonstrate the effect of LBEM, we take ten different R gates, where each of them is the optimal gate in VQE that minimizes the mean of the Hamiltonian for a given nuclear separation, Fig. 11.

XI. CONCLUSIONS
In this paper we present a novel way of mitigating quantum errors based on probabilistic error cancellation technique. We introduce a new learning component of the protocol which replaces the need of reconstructing an error model in the experiment. The learning component exploits the efficient simulatability of Clifford circuits and finds the optimal quasi-probability distribution which then defines the next step of probabilistic error cancellation. Numerically, we have shown that the learning-based protocol can be practically implemented for the circuit sizes comparable to those currently run on NISQ era quantum computers. In the presence of correlated noise, it outperforms the tomography-based protocol for which tomography on a smaller subset of qubits is only available. We confirm that our protocol maintains its high performance on real quantum hardware by running multiple experiments on the IBM quantum devices. Different tactics may be employed for learning the optimal quasi-probability distribution depending on the quantum device at hand and the required computations. For example, if one wants to evaluate multiple observables of a computation, so does the learning process needs to include these observables. Using fidelity as a cost function is also a valid strategy, which is not difficult to estimate for Clifford circuits, and then error mitigated expectation value can be estimated for any observable. Similarly, procedures for estimating mean values of functions can be specifically tailored, for example, in cases where modifying the circuit between consecutive runs is difficult or expensive.
Possible extensions to this work include specifically modifying the learning component of the protocol in cases where some information about the noise model is given or easily accessible, for example, a scenario where the whole circuit undergoes an unknown global phase shift. Another extension would be to sample Clifford circuits respective to the unitary circuits they replace. This would lead to a circuit specific error mitigation. The same learning approach based on Clifford circuit sampling can also be applied for finding the optimal physical parameters of a quantum computing system.
There are of course a number of quantum error mitigation techniques that have already been proposed in the literature. Each of the these protocols have their advantages and disadvantages; in some cases our method represents an alternative and in other cases the methods can be concatenated. For example: given complete knowledge of the noise processes in a quantum system, theoretically one can compensate perfectly for errors using the quasi-probability error mitigation technique [6,7]. In the previous literature (see [7]), the required decomposition formula is worked out using the gate set tomography, which practically yields high performance only when the error correlations are negligible. The present protocol is an alternative where adaptive learning is shown to be capable of replacing the exhaustive tomography, achieving a near-ideal outcome with profoundly reduced cost.
Another common protocol of quantum error mitigation is the noise extrapolation technique which can efficiently suppress the errors by boosting the noise and extrapolating to the zero limit [5,6,8]. However, due to the discrepancy between the fitting curve and the genuine curve of "computation result vs. noise", and challenge of homogeneously boosting noise, practically speaking extrapolation cannot be expected to perfectly compensate for errors [39]. Finally, error mitigation based on symmetries post-selection can correct errors that violate such symmetries; it is a powerful method where such symmetries exist [14,15].
One day when we use quantum computer to solve some meaningful practical problems, we may need to combine different error mitigation protocols in order to achieve the required accuracy. The learning-based approach is flexible and can be used as a framework of error mitigation that serves the purpose of integrating different protocols. For example, the learning-based approach can be applied to noise extrapolation, e.g. use the loss function to determine how to choose the fitting curve and how to boost the noise. We can combine noise extrapolation, post-processing based on symmetries and quasiprobability decomposition techniques, and optimise the overall error mitigation strategy by again using the loss function. In this way, advantages of all the mentioned protocols may be exploited.
Overall, our protocol paves a new way of implementing NISQ era quantum error mitigation and is especially suitable for remote users without any access to the information about the noise model. It is intuitively simple and can be readily implemented on current quantum computers.
Note added. Shortly after the original version of this work was posted online, a related work was posted by P. Czarnik et al. [40]. They exploit 'near-Clifford' circuits to obtain an error-mitigated estimator of some observable for a circuit of interest. Both papers reveal the power of Clifford variants in error mitigation, however the methods differ fundamentally in ways the error mitigation is applied, noise assumptions made and necessity of non-Clifford gates in the learning part of the algorithm.
We consider a circuit with n qubits and N layers of frame gates between the qubit initialisation and measurement. All qubits are initialised in the state |0 at the beginning and measured in the Z basis at the end. Each layer of frame gates is formed by multi-qubit Clifford gates, as shown in Fig. 1(a). Single-qubit unitary gates are between frame operations (including the qubit initialisation, frame gates and measurement), and we call them computing gates. The frame operations are fixed, but computing gates are treated as variables. For the error mitigation, single-qubit Pauli gates are introduced before and after each computing gate, as shown in Fig. 1(b).
We call these Pauli gates error-mitigating gates, which are also variables.
We use µ k = 0, 1 to denote the measurement outcome of the k-th qubit. µ ≡ (µ 1 , µ 2 , . . . , µ n ) is the binary vector that represents the outcome of all qubits. The task is to compute the mean value of a function f (µ).
We use com(R, P ) to denote the mean value of the function f (µ) given the gate sequences R and P . com ef (R, P ) is the value of com(R, P ) when the entire computing is error-free.
We use q(P ) to denote a quasi-probability function, and the error-mitigated computing result is The error function is The loss function of the computing error is where T is a set of computing gate sequences. The training set T is a subset of Clifford gate sequences, i.e. T ⊆ C ≡ {R | All R j are Clifford}. We use U ≡ {R | All R j are unitary} to denote the set of unitary gate sequences, then T ⊆ C ⊂ U.

Quantum formalism
We use ρ ef i ≡ |0 0| ⊗n to denote the error-free initial state. We use [U ]• ≡ U • U † to denote the completely positive map of the unitary operator U . If frame gates are error-free, the overall map of the j-th-layer frame gates is where G j is an n-qubit Clifford gate, as shown in Fig. 1. We use E ef µ ≡ n m=1 |µ m µ m | to denote the error-free POVM operator of the measurement outcome µ.
In our theoretical analysis, we assume that all singlequbit unitary gates are error-free. The overall map of the j-th-layer computing gates is Similarly, the overall map of the j-th-layer errormitigating gates is P j ≡ [ n m=1 P (j−1)n+m ]. Both R j and P j are error-free. Then, the error-free computing result is [see Fig. 13(a)] G ef N +1 = [1 1 S ] is the identity map, and 1 1 S is the identity operator of n qubits. Here, R j and P j depend on R and P , respectively.
In order to describe temporally-correlated errors, we introduce the environment in addition to the system (i.e. n qubits in the circuit). We use ρ i to denote the initial state of the system and the environment. We use G j to denote the actual map acting on both the system and the environment for the j-th-layer frame gates. We use E µ to denote the actual POVM operator of the system and the environment corresponding to the measurement outcome µ. We define R j ≡ R j ⊗[1 1 E ] and P j ≡ P j ⊗[1 1 E ], where 1 1 E is the identity operator of the environment. Then, the computing result with errors is [see Fig. 13 where and G N +1 = [1 1 S ] ⊗ [1 1 E ] is the identity map on both the system and the environment.

Tensor-product representation of quantum circuits
Let {|l } be the orthonormal basis of the Hilbert space. The trace of a map M reads We can express the identity map as Tr [|l 2 l 2 | ⊗ |l 1 l 1 |S 1,2 where S 1,2 is the swap map on two systems defined by Similarly, for a product of M maps, we have where S ≡ S 1,2 S 2,3 · · · S M −1,M . Here, we label the Hilbert spaces with M, . . . , 2, 1 from left to right in the tensor product.
Let S S be the swap map on N + 1 systems. As shown in Fig. 14(a), we have Here, we have used that

. Erroneous frame-operation tensor
Similar to the error-free case, we define Let S E be the swap map on N +1 environments and S = S S ⊗ S E be the swap map on N + 1 system-environment composite systems.Then, where the erroneous frame-operation tensor, as shown in Fig. 14(b), is defined as

Appendix B: Existence of a solution
Using the tensor-product representation, the errormitigated computing result reads where the error-mitigated frame-operation tensor is It is straightforward to prove that the error-mitigated computing is error-free, i.e. Error(R) = com em (R, I) − com ef (R, I) = 0, (B3) for all R ∈ U, if there exists a quasi-probability distribution q(P ) satisfying To solve this equation, we introduce the Pauli transfer matrix representation, i.e. express maps using Pauli operators as the basis of the operator space. Let τ be the Pauli operator of n(N + 1) qubits. The Pauli transfer matrix of a n(N + 1)-qubit map M is (B5) Using Pauli transfer matrices, the equation becomes The Pauli transfer matrix of a Pauli gate is always diagonal, i.e. P L(R);τ1,τ2 = δ τ1,τ2 P L(R);τ1,τ1 , where P L(R);τ1,τ1 = ±1. Therefore, we can rewrite the equation as For the 4 2n(N +1) error-mitigating gate sequences P , the corresponding Pauli transfer matrices P L ⊗ P R are linearly-independent. Therefore, the solution of the equation always exists.
One can check that Pauli transfer matrices of Pauli gates are linearly-independent diagonal matrices by computing Pauli transfer matrices of single-qubit Pauli gates. The Pauli transfer matrices of multi-qubit Pauli gates are tensor products of single-qubit matrices.

Appendix C: Information completeness
We have proven the existence of a quasi-probability distribution q(P ) satisfying The training set T is information complete if Error(R) = 0 for all R ∈ U when q(P ) is a solution of Loss = 0. The set T = C containing all Clifford gate sequences is information complete. We only need to consider a subset of C, which is B = {R | R j ∈ B 1 } ⊂ C, where B 1 is a set of ten single-qubit Clifford gates The maps of these ten Clifford gates are linearly independent. An arbitrary single-qubit unitary map [R] can be decomposed as Accordingly, where α R,R = n(N +1) j=1 α Rj ,R j . When T = C, Loss = 0 if and only if Error(R) = 0 for all R ∈ C, which means com em (R, I) = com ef (R, I) for all R ∈ B. Then, we have for all R ∈ U. Therefore, T = C and T = B are both information complete.

Appendix D: Fidelity measurement
The Pauli group of n qubits is The stabiliser group is a subgroup of the Pauli group, which reads where s i = s † i ∈ P n are n independent operators, [s i , s j ] = 0 for all i and j, and b i = 0, 1 are binary numbers.
The stabiliser state |ψ S of the stabiliser group S is the common eigenstate of all generators with the eigenvalue +1, i.e. s i |ψ S = |ψ S . The density matrix of the state can be written as For a state ρ, the fidelity in the stabiliser state is

Appendix E: Pauli twirling and error model
We decompose error-mitigating gates into Paulitwirling gates and error-correcting gates, i.e.
where Pauli-twirling gates are and error-correcting gates are Here, the total frame gate is We define which are n(N + 2)-qubit Pauli gates. We use to denote a n(N + 2)-qubit Pauli operator, where σ j are n-qubit Pauli operators. The Pauli-twirling gates are selected from the set To implement the Pauli twirling, we take q(P ) = q c (P c )/4 n(N +1) , i.e. P t is uniformly distributed. Then, we have where the Pauli-error frame-operation tensor reads Here, we have assumed that the measurement is balanced (see Sec. E 1). See Sec. E 2 for the proof. The Pauli errors are denoted by p(P e ) ≥ 0 is the probability of the error, and P e ∈{[σ] | σ∈Errors} p(P e ) = 1.

Balanced measurement
We use X b = n m=1 X bm to denote an n-qubit Pauli operator, where b = (b 1 , b 2 , . . . , b n ) is a binary vector. The balanced measurement is defined as a measurement that satisfies For a balanced measurement, we have Under the condition that Pauli gates are error-free, an arbitrary raw measurement with POVM operators {E raw µ } can be converted into a balanced measurement by randomly applying the gate X b before the measurement and record the outcome taking into account the applied gate, i.e. record the outcome as µ if the raw measurement outcome is µ ⊕ b. As a result, POVM operators of the effective measurement is One can find that {E µ } is a balanced measurement.

Pauli error model
In this section we prove Eq. (E13).
Let ρ i = a,b |a b| ⊗ ρ E;a,b be the initial state, where a = (a 1 , a 2 , . . . , a n ) is a binary vector, and |b = X b |0 ⊗n = n m=1 |b m . Here, ρ E;a,b are matrices acting on the Hilbert space space of the environment and satisfy ρ † E;a,b = ρ E;b,a , ρ E;b,b ≥ 0 and Tr(ρ E ) = 1, where the initial state of the environment ρ E = b ρ E;b,b . By applying the twirling gates, we get the effective initial state For a gate G j , because G ef j is a unitary map, we can always rewrite it in the form where N j is the noise map acting on both the system and the environment, which is completely positive and tracepreserving. By applying the twirling gates, we get the effective gate where the effective noise map N E;j,σ are completely positive maps acting on the environment, and N E;j = σ N E;j,σ is trace-preserving. To prove Eq. (E20), we consider a completely positive map acting on one qubit and an ancillary system A. The map reads M = K [K], and K = P =I,X,Y,Z P ⊗ K P , where {K P } are matrices acting on the ancillary system. The effective map with the Pauli twirling reads where 1 1 A is the identity operator of the ancillary system, and M P = K [K P ] is a completely positive map acting on the ancillary system.
Because K K † K = K P,P P P ⊗ K † P K P , we have K P K † P K P = 1 2 Tr qubit K K † K . Therefore, if M is trace-preserving, P M P is also trace-preserving. By applying this approach to qubits one by one, we can obtain Eq. (E20). Now, we can see that the frame-operation tensor with the Pauli twirling is in the Pauli-error form, as given in Eq. (E13). For the Pauli error P e = [σ] = [X b ⊗ σ 2N +1 ⊗ · · · ⊗ σ 3 ⊗ X a ], the corresponding error probability is We have p([σ]) ≥ 0, because ρ E;a,a and E E;b,b are positive, and N E;j,σ are completely positive. We also have because ρ E is normalised, E E is identity, and N E;j are trace-preserving.
The second step truncates the set SigE by excluding variations with the lowest chance of being selected when randomly picking one of the circuit variations. For example, circuit with σ 2j+1 = Z ⊗n ∀j has an order constant k = P , there are non-identity Pauli gate/s directly after all P two-qubit gates, and all circuits with that order constant have a probability |η P 2 |/γ P of being randomly chosen. In this step we exclude all circuit variations with lowest probabilities up to some order constant k = z, meaning that all variations in SigE will have at least probability |η P −z 1 η z 2 |/γ P of being implemented. In our simulations we have set k = 1.
In this way we neglect the lowest chance variations of the circuit in the optimisation stage of the protocol, and by doing that we limit the number of quasi-probabilities that we need to optimise to a polynomially scaling number with the circuit size. Normalised inner product qx.q20/(|qx||q20|) between quasi-probability distributions that are obtained in the learning part of the protocol, as described in Section VII, with Clifford overhead constant c = x (qx) and c = 20 (q20). We plot the inner product for different circuit sizes nxn where n is both the number of qubits and the number of layers of twoqubit gates. Here we have chosen q20 as the reference for other distributions assuming that q20 deviates only marginally from q opt which is obtained from the full training set T. This allows us to estimate the impact on errors due to the truncation of the training set and, hence, let us choose a suitable Clifford overhead constant c for our simulations.

Appendix J: Clifford circuit overhead
Here we present a numerical study showing the effect of different Clifford overhead constants c for circuits up to 8 qubits using the significant-error approach (Fig. 16). While no asymptotic scaling for c can be undeniably determined with respect to the circuit size, our data suggest that for small systems a Clifford overhead constant c = 7 is sufficient to introduce only negligible errors due to the truncation of the training set T. In our simulations we chose c = 3, since the error due to a finite sampling (shot noise) dominates the error due to the size of the training set being |T| = 3|SigE|.

Appendix K: Single-parameter optimisation
For single-parameter learning we optimise q(σ) ini ∀σ ∈ SigE generated with k = 1 where the optimisation is constrained to a single adjustable parameter. The severity of the local noise is chosen as the parameter and, hence, the respective q(σ) is then classically derived by inverting the local noise channel (see Appendix F). Finding q opt (σ) is equivalent to finding opt . The lower bound on performance of single-parameter learning is then set by tomography-based error mitigation with q(σ) ini generated with k = 1 assuming the optimiser can always find the global minima.
As an example, here we present results for a singleparameter learning compared to a multi-parameter one Empirical cumulative distribution function of estimated ∆ Z1 for 500 pseudo-random circuits with spatially correlated dephasing noise. Results for circuits without error mitigation (black), with tomographic error mitigation with k = 1 (dashed red) and k = 2 (red), with a single-parameter learning based error mitigation (orange) and with a multiparameter learning-based error mitigation (green) are presented. for a 7 qubit and 7 depth circuit with spatially correlated dephasing errors described in Appendices G and H, Fig.  17.
The numerical results indicate that single-parameter learning just marginally outperforms its lower bound and is comparable to a tomography-based error mitigation with q(σ) ini generated with k = 2. Results for the other two error models follow suit.
Values of the loss function in the gradient descent. We label an n-qubit N -layer circuit with n × N . For each circuit size, we randomly generate N = 1000 computinggate configurations R, and for each computing-gate configuration, we randomly generate M = 1000 configurations of error-correcting gates P . Note that the estimated cost function does not converge to zero with a finite number of samples due to a biased estimator. com em (R, I) and Loss are estimators of com em (R, I) and Loss, respectively.
To minimise the loss function, we compute the gradient of the loss function with respect to the quasi-probability q, i.e.

∂ Loss
Then, we update the quasi-probabilities according to where γ is the learning rate. To make sure that parameters are updated at a reasonable level, a trick we used for gradient descent is using a dynamical learning rate and we take a fixed value of γ , which is listed in Table. II. The decreasing of estimated loss functions are plotted in Fig. 19.

Appendix N: Monte Carlo summation
We consider two cases. In the first case, the quasiprobability q(P ) is non-zero only if P ∈ SigE, where SigE is the set of significant Pauli errors including the trivial error I, and the value of each q(P ) is the variational parameter, i.e. the number of parameters is |SigE|. In the second case, the quasi-probability is expressed as where B(P , λ) is a real-valued function with an explicit and computable expression, and A(λ) = P |B(P , λ)|.
Here, λ and C are variational parameters, in which λ is a set of parameters that determine the distribution, and C = P |q(P )| is a real number that represents the overhead cost of the error mitigation. Let f be the measurement outcome of the quantum circuit specified by R and P , and its distribution is Pro(f |R, P ). Then, the computing result, i.e. the mean value of f , reads Similarly, the error-free computing result can be expressed as In the Monte Carlo summation, the distribution Pro(f |R, P ) is realised using the quantum computer, and all other distributions, including Pro ef (f |R, P ), are realised on the classical computer.

Significant-error parametrisation
We consider the first case. The error-mitigated computing result reads com em (R, I) = P ∈SigE q(P )com(R, P ).
(N4) Now, we consider the loss function, which is The optimal quasi-probability distribution is where q is a |SigE|-dimensional column vector with the elements q(P ), a is a |SigE|-dimensional matrix with the elements a P ,P , and b is a |SigE|-dimensional column vector with the elements b P . The minimum value of the loss function is The estimator of a P ,P iŝ  (f |R, I).
The variance of the estimator is The estimator of c isĉ The variance of the estimator is where δa =â − a, δb =b − b, δc =ĉ − c and |q opt | 2 = P q opt (P ) 2 . The overhead cost of the error mitigation is C = P |q opt (P )|. Because C 2 ≥ |q opt | 2 , we have Given the optimal quasi-probability distribution q opt (P ), we can implement the error-mitigated computing accordingly. Taking q(P ) = q opt (P ), we have is a normalised distribution. Because the number of P grows exponentially with the number of single-qubit Pauli gates in the circuit, it could be difficult to compute the normalisation factor A(λ) given the explicit expression of B(P , λ). Although we may not be able to compute A(λ), we can sample the distribution Our purpose is to minimise the loss function and find the optimal C and λ. Given the quadratic form of the loss function, the optimal value of C is and the corresponding minimum value of the loss function is which is still a function of λ. We note that c − b 2 a ≥ 0 is always true.
We can find that, in expressions of a and b, coefficients are normalised distributions. Therefore, we can compute a and b using the Monte Carlo summation and generate samples using the Metropolis method.

a. Importance sampling
Usually, only a small subset of Pauli errors are dominant. Accordingly, the optimal solution q(P ) is only significant for a small subset of error-mitigating gate sequences, and q(P ) is close to zero for most of P . Therefore, if the variance of f is finite, generating samples according to q(P ) (i.e. B(P , λ)) is sub-optimal for the Monte Carlo summation.
We evaluate the loss function in order to find the optimal distribution. Usually, we need to actively update the distribution q(P ) (i.e. λ and C). For efficiently utilising the samples, we will need to use the samples generated according to the distribution B(P , λ ), which is close to B(P , λ), to compute a and b. Then, it is not necessary to generate new samples every time when we update λ.
The estimator ofã iŝ The variance of the estimator is To computeb, we generate independent and identically distributed samples {(R i , P i , P i , f i , f i )|i = 1, 2, . . . , N s } according to the distribution Pro(R)Pro(P )Pro(f |R, P )Pro ef (f |R, I).

The estimator ofb iŝ
The variance of the estimator is If λ = λ, we have where δã =â −ã and δb =b −b. If λ = λ, we have C opt =b a and Var L oss min (λ) 1 N s |f | 4 max (1 + C 4 opt + 4C 2 opt ).(N47) e. Computation of com em By minimising Loss min (λ), we can obtain the optimal value of λ, which is λ opt . Then, the optimal quasiprobability distribution is given by λ opt and the corresponding C opt , and we can implement the error-mitigated computing accordingly. We remark that, to compute C opt = b a , we need to generate samples with λ = λ, then a =ã and b =b. Taking λ = λ opt and C = C opt , we have