Unified approach to data-driven quantum error mitigation

Achieving near-term quantum advantage will require effective methods for mitigating hardware noise. Data-driven approaches to error mitigation are promising, with popular examples including zero-noise extrapolation (ZNE) and Clifford data regression (CDR). Here we propose a novel, scalable error mitigation method that conceptually unifies ZNE and CDR. Our approach, called variable-noise Clifford data regression (vnCDR), significantly outperforms these individual methods in numerical benchmarks. vnCDR generates training data first via near-Clifford circuits (which are classically simulable) and second by varying the noise levels in these circuits. We employ a noise model obtained from IBM's Ourense quantum computer to benchmark our method. For the problem of estimating the energy of an 8-qubit Ising model system, vnCDR improves the absolute energy error by a factor of 33 over the unmitigated results and by factors 20 and 1.8 over ZNE and CDR, respectively. For the problem of correcting observables from random quantum circuits with 64 qubits, vnCDR improves the error by factors of 2.7 and 1.5 over ZNE and CDR, respectively.


I. INTRODUCTION
Quantum computers are approaching the important milestone of having a demonstrable advantage over classical computers for practical applications, such as chemistry and materials science [1]. Such a quantum advantage is expected to be demonstrated with nearterm devices that do not have the number of qubits or the gate fidelities required to implement full quantum error correction [2]. Nevertheless, the noise of such devices remains a serious obstacle to practical applications [3]. While near-term devices will not be able to completely remove errors caused by device noise, it is often possible to mitigate them.
Such so-called error mitigation (EM) techniques are sure to be an essential part of demonstrating the utility of quantum technologies, for example, for achieving chemical accuracy in chemistry applications. To this end, many distinct EM methods have been proposed [4,5]. One approach is to optimize quantum circuits using compiling and machine learning [6][7][8], while another employs variational quantum algorithms [9][10][11] to reduce circuit depth and potentially remove the effects of incoherent noise [12][13][14][15][16][17]. More recently, quantum phase estimation has been employed for error mitigation [18].
Zero-noise extrapolation (ZNE) is a classical postprocessing approach to EM that has received a significant amount of attention [4]. ZNE combines observables evaluated at several controlled noise levels through stretching gate times or inserting identities [19][20][21][22][23], enabling extrapolation to the zero-noise limit. Despite much success [24], this method is not without its limitations. Due to the uncertainty of the extrapolation, performance guarantees are difficult in general. In particular, ZNE struggles when a low degree polynomial fit to the noisy expectation values fails to * The first two authors contributed equally to this work.
FIG. 1. The Variable Noise Clifford Data Regression (vnCDR) method. The first step constructs a set of near-Clifford training circuits that are close, in some sense, to the circuit of interest. The second step increases the size of the training set by adding variable amounts of noise to the circuits generated in the first step. The third step involves both classical simulation and quantum evaluation of the training circuits to generate the noise-free and noisy training data, respectively. The fourth step trains the parameters of an ansatz, which we take as a hyperplane, to fit the training data. Finally, one uses this fitted ansatz to predict the desired observable for the circuit of interest.
match the behavior in the zero-noise limit. For simple noise models or very low depth circuits this extrapo-arXiv:2011.01157v2 [quant-ph] 2 Aug 2021 lation can be well behaved. But in real devices using less trivial circuits, the lowest error points available are often too noisy for such fits to be helpful.
Recently, alternative mitigation methods have been developed that make use of learning from data sets constructed using Clifford quantum circuit data [25,26]. These methods are attractive based on their relative simplicity and scalability due to the classically simulable nature of quantum circuits comprised mainly of Clifford gates (gates that map Pauli operators to Pauli operators).
For example, the Clifford Data Regression (CDR) method [26] first chooses a training set of near-Clifford quantum circuits related to the circuit of interest. A scalable classical simulator of near-Clifford circuits and a noisy quantum computer are used to compute the noise-free and noisy data, respectively. Finally, the trained ansatz is used to predict the noise-free observable for the quantum circuit of interest.
Both ZNE and CDR are data-driven approaches to error mitigation, but they use different types of data. ZNE uses variable noise data while CDR uses variable Clifford circuit data. A natural question is whether combining these approaches could lead to a unified technique that is more powerful than the individual ones. In this work, we propose a novel method that answers this in the affirmative.
Our approach is called variable noise Clifford data regression (vnCDR). vnCDR considers a collection of near-Clifford training circuits like CDR, each evaluated at multiple noise levels as in ZNE. One can think of this process as either informing the extrapolation in ZNE about the zero-noise limit for similar circuits or as adding relevant features to the regression model in CDR. In the latter view, this is philosophically similar to data augmentation techniques that introduce artificial noise in machine learning [27]. The ansatz employed in vnCDR is motivated by Richardson extrapolation and by noting it perfectly removes the effects of global depolarizing noise (see Appendix A). We also comment that training on the set of Clifford circuits is sufficient as the Clifford gates span the space of single qubit unitaries (see Appendix B). Figure 1 gives a schematic illustration of vnCDR.
Below we first provide background on ZNE and CDR, and then we present our unified method. Using a noisy simulator based on a gate set tomography of IBM's Ourense quantum computer, we compare the performance of ZNE, CDR, and vnCDR for two tasks. The first task is estimating the energy of an 8-qubit transverse Ising model with the Quantum Alternating Operator Ansatz (QAOA). Correcting circuits of this form is relevant for both combinatorial optimization problems and condensed matter studies [28,29]. Our second task involves random quantum circuits for large qubit numbers (up to 64 qubits with 6 CNOT layers) and large circuit depth (up to 16 CNOT layers with 8 qubits). The lack of structure in these random circuits makes them a difficult use case for these EM methods, and they give us a notion of these methods' utility in more general settings.
For both use cases, vnCDR outperforms ZNE and CDR. For the QAOA task we analyze the absolute energy error and obtain with vnCDR a factor of 20 improvement over ZNE and a factor of 1.8 improvement over CDR. For the random circuit task we obtain in the case of 64 qubits factors of 2.7 and 1.5 improvement over ZNE and CDR, respectively, while for the case of 16 layers we obtain factors of 2.3 and 1.3 improvement over those methods.

II. BACKGROUND
A. Zero-noise extrapolation ZNE [4] involves varying the noise level of a quantum circuit to infer the noise-free behavior. Assuming a dependence on noise parameter , the correction is performed by taking linear combinations of the noisy expectation values in such a way that errors attributable to terms of order n or less are canceled, where n is the number of additional noise levels employed.
Following the presentation in Ref. [4], denote the noise-free expectation by µ and consider the task of correcting the expectation value obtained from a noisy quantum device with noise characterized by parameter . First, one chooses a set of noise levels C = {c 0 , c 1 , . . . , c n |c 0 = 1, c j < c j+1 } and runs the device with amplified noise c j to obtain an estimatê µ j for all noise levels c j ∈ C. The final correctionμ can then be computed aŝ where the set of coefficients {γ j } are chosen to satisfy n j=0 γ j = 1, n j=0 γ j c k j = 0 ∀k ∈ {1, . . . , n}.
This technique, known as Richardson extrapolation [4,30], ensures the error of the final estimate is of the order O( n+1 ). Though ZNE was originally proposed in a context where one can stretch gate times to achieve the various noise levels c j , recent work has suggested a hardware agnostic implementation based on identity insertions [19,20]. For example, inserting 2 CNOT gates applied one after the other is an identity matrix in the noise-free circuit evaluation, but is likely to affect the output in the noisy case. In the fixed identity insertion method (FIIM), the noise levels are taken to be the number of additional gates added in this manner, so inserting 2 additional CNOT gates for every CNOT in the original circuit results in noise levels c j = 1, 3, 5, . . . being implemented.
As noted in Ref. [20], Richardson extrapolation is equivalent to performing a polynomial interpolation on the various noisy expectations, treating the noise levels c j as the independent variable. To see this, note that for any solution {b k } n k=0 to the system of equationsμ it holds that A unique solution exists to (3) for n + 1 distinct noise levels c j when performing an order-n interpolation.
(See for example Ref. [31].) Alternatively, one can adopt a lower degree polynomial fit as done in our numerical experiments by computing the least-squares solution to the resulting system of equations. For example, if one wishes to perform a linear fit on the data from n + 1 distinct noise levels, one may write and express the system of equations as Taking the y-intercept b 0 of the least-squares solution, yields the extrapolated expectation value. Crucially, the value of the correction is completely determined by a fixed set of noisy expectations for any choice of the noise amplification and extrapolation techniques above. In particular, Eq. (1) enforces an n th degree polynomial fit when one has data points at n + 1 noise levels, which may not provide a good approximation to the behavior near = 0 when the data points that are experimentally accessible all reflect a fairly high amount of noise. While some authors have also successfully used lower degree polynomial fits, there is evidence to suggest the resulting corrections can still be fairly inaccurate [5,19,26]. This motivates our proposal for a method based on learning from efficiently simulable circuits, avoiding some of the drawbacks of ZNE.

B. Clifford data regression
In CDR [26] the expectation values obtained from a quantum device are corrected using a straightforward linear regression based on examples from circuits comprised mainly of Clifford gates. These Clifford circuits are efficiently simulable and generated in such a way as to remain similar to the original circuit of interest. Explicitly, the goal is to learn a function which takes noisy expectations to their error mitigated values: whereμ 0 is the noisy expectation and the a 1 , a 2 are parameters chosen optimally by least-squares regression on the Clifford circuit dataset i.e., for a training set of m noisy Clifford circuit expectations {x i } and corresponding targets {y i } obtained via classical simulation, one computes The form of the ansatz can be physically motivated using a simplified noise model. Let ρ be the density matrix for the state of a device which has undergone some noise-free evolution and consider a global depolarizing noise channel E which acts on this state before a measurement of the observable X. It then holds that where d is the dimension of the system and is a parameter characterizing the noise. Identifyingμ 0 = Tr(E(ρ)X) and we see that the desired quantity Tr(ρX) can be recovered using Eq. (8).
When applied to a more realistic noise model obtained from an IBM quantum device, empirical results suggest CDR yields significantly more scalable corrections than those from ZNE [26], at least in the plausible setting of being limited to coarse-grained noise amplification. In the following sections, we improve upon the CDR method by incorporating data obtained at variable noise rates, which leads to more accurate predictions of the noise-free expectation values.

III. THE VNCDR METHOD
Let U be a quantum circuit, |0 its initial state, and X an observable of interest. Consider the task of estimating the expectation value µ = 0| U † XU |0 from measurements of a noisy quantum device. The variable noise Clifford data regression (vnCDR) method is performed with the following steps.

(Clifford data) Choose a set of circuits
based on U which will be used to form the training set T in step 3. The circuits in S must be efficient to simulate classically, which is ensured by constructing them primarily from Clifford gates. The number of non-Cliffords used is denoted by N . Note that N is assumed to be a constant parameter here, so the simulations are classically tractable.
2. (Noise data) Choose a set of noise levels C = {c 0 , c 1 , . . . , c n } where 1 = c 0 < c 1 < · · · < c n which will be used to form the training set T in step 3. If the noise is characterized by a parameter then running the device with noise level c j means the new parameter is c j .
3. (Training set) For each of the m circuits V i in S and n + 1 noise levels c j ∈ C, produce an estimate of the observable expectation called x i,j . Also, for each of the m circuits compute y i = 0| V † i XV i |0 using a classical simulation. The training set T is then defined as is the vector of noisy estimates originating from the i th circuit.

(Learning)
Learn a function f : R n → R that takes a set of noisy estimates at the n + 1 different noise levels and outputs an estimate for the noise-free value. Specifically, we take the linear ansatz g : R n × R n → R, We use least-squares regression on the dataset T to pick optimal parameters a * , i.e., so that we expect f (x) = g(x; a * ) to output a good estimate for the noise-free expected value given a vector of noisy ones.
Our method shares common features with both ZNE and CDR. Specifically, the functional form of the ansatz we choose resembles a Richardson extrapolation on the noisy expected values as shown in Section II A. However, the method differs in its approach to relating the noisy values to the final estimate. Namely, in ZNE, the output is a fixed function of the variousμ j , whereas vnCDR attempts to learn the best candidate from a family of functions parametrized by a. In a certain sense, vnCDR is thus choosing the best possible extrapolation of the noisy data from the original circuit using examples from Clifford circuits which are similar in structure.
The method can also be viewed as adding relevant variables to the CDR method, before performing a multiple linear regression on the new dataset. This description comes with the caveat that -in contrast with the CDR ansatz (see Eq. (8)) -the new parametrization g(x; a) is a linear mapping without a constant term. There are two motivations for this. Firstly, such a parametrization corresponds well with the linear combination of noisy expectations that is utilized in the ZNE method (Equation (1)). Secondly, restricting the class of functions we are searching over to be linear awards us an intuitively desirable property: if the function we arrive at achieves zero error on all circuits composed of Clifford gates, then it will predict the expectation values of arbitrary circuits with zero error. This result boils down to the observation that Clifford gates span the space of single-qubit unitaries. (see Eq. (10)). We note that this simple model was proposed recently to effectively describe dominant effects of the noise in real devices [32,33]. The vnCDR ansatz can be shown to completely mitigate the effect of such a channel on some observable of interest (see Appendix A), similar to CDR.

IV. NUMERICAL RESULTS
A. Transverse-field Ising model First, we consider a task of variational simulation of the ground state of a 1-D transverse-field Ising model using parameterized quantum circuits. The Hamiltonian of the system is given by where σ X , σ Z are Pauli matrices and j, j are nearest neighbor sites on the lattice. We assume here open boundary conditions. We consider the case of g = 2 corresponding to a paramagnetic phase. We use the QAOA [28,29] |ψ(β, γ) = j=p,p−1...,1 where β ,γ are the rotation angles to be optimized, , and Q is the number of qubits. A decomposition of (15) to a quantum circuit is described in Appendix C.
We perform the optimization for Q = 8 qubits using a circuit depth p = 4. We minimize the energy eval-uated with a noisy simulator using a MATLAB implementation of quasi-Newton gradient descent. The noise model we employ is obtained by gate set tomography of IBM's Ourense quantum computer and described in detail in Ref. 8. Furthermore, we assume perfect measurement as measurement errors can be mitigated by specialized techniques [34,35]. To carry out the benchmark of our method, we run 27 instances of the optimization and correct the resulting observable expectations using the ZNE, CDR and vnCDR methods. The corrections are realized on each of the 1-and 2-qubit terms which make up the Hamiltonian from which we then estimate the ground state energy. The results are summarized in Fig. 2 showing vnCDR outperforms ZNE and CDR with a factor of 33 improvement of the mean absolute energy error while ZNE and CDR give a factor of 1.7 and 19, respectively.
In the case of CDR and vnCDR for each of the circuits we construct training sets with 80 classically simulable circuits, setting the number of non-Cliffords to N = 16. We remark that there are 60 non-Clifford gates in total for the circuit of interest. For further information regarding the construction of our training sets, see Section V B.
For the vnCDR and ZNE corrections, we computed the expectation values using the set of noise levels C = {1, 3, 5} and the fixed identity insertion noise amplification method [20] which we elaborate upon in Section V A. The noise level is defined as the ratio of CNOT gates in the modified circuits compared to the original one. Note that this is a fairly coarse-grained set of noise levels, which may explain why the ZNE performance is quite poor. We also found that including in C noise levels higher than 5 did not improve the performance of the methods. For further details, see Sections V A and V C.

B. Random quantum circuits
Next we consider an implementation of the IBMQ hardware efficient ansatz with random parameters, see Fig. 3. The ansatz consists of layers of alternating nearest-neighbor CNOTs decorated with general one qubit unitaries U (α, β, γ). We compute one and two qubit observables for 30 random instances and correct them with ZNE, CDR and vnCDR methods, see details in the caption of Fig. 4. We analyze scaling of the observables absolute error with increasing Q = 8, 16, 32, 64 for p = 6 and the scaling with increasing p = 4, 8, 12, 16 for Q = 8. To simulate large Q systems we employ a Matrix Product Operators (MPO) [36] noisy simulator with the same noise model as in the case of the Ising QAOA simulations. We discuss the simulator in more detail in Appendix E. Here to simplify presentation we show results obtained in the limit of infinite shot number. In Appendix F we show that qualitatively the same results can be obtained using finite shot numbers feasible with current quantum computers.
The results are discussed in detail in Fig. 4. We find that the vnCDR outperforms ZNE and CDR methods 3. An example of the IBMQ hardware efficient ansatz with p = 2 layers for Q = 4 qubits. The layers, represented by gates within the dashed contours, act on a random product state created by general single qubit unitaries U . The general unitary is defined as For fixed p = 6 we observe that the unmitigated mean absolute error does not grow with increasing Q in the limit of large Q. Such behavior can be explained by the existence of a threshold Q value for which the causal cones of the observables [37], stop increasing with Q. The causal cone is defined here as gates which affect the expectation value of the observable. See Appendix D for an example of causal cone construction. We take the causal cone into account when forming vnCDR and CDR training sets, see details in Section V. For such an implementation we find that the vnCDR and CDR mitigated mean errors also do not increase with increasing Q. We remark that our noise model does not include cross-talk which in principle may result in a faster increase in the number of gates in the noisy observables causal cones. We leave investigation of the scaling in the presence of such noise to a future work. With increasing p we find that quality of the correction decreases for all methods. Nevertheless, even in the case of the deepest circuits, p = 16, we obtain a significant improvement when employing vnCDR.
To perform CDR and vnCDR for each observable of interest in each random circuit we construct a training set using 100 classically simulable circuits with N = 20 non-Clifford gates. Detailed discussion of the method used to construct the training sets is given in Section V B. We remark that in the case of p = 6, Q = 64 circuits the largest number of non-Clifford gates within the causal cone of an observable is 60, while for Q = 8, p = 16 it is 312.
For the vnCDR and ZNE corrections, we increase the noise level by identity insertions as in the case of the QAOA Ising simulations. We find that in both for each of them. We define an absolute error per circuit as the mean of the observables' absolute errors. The absolute corrected (noisy) observable error is defined as an absolute value of its difference with respect to the exact value. In (a) we display the scaling with Q of the mitigated and unmitigated absolute error per circuit for p = 6. In the left panel we show the mean values (the solid lines), while the right panel shows the maximal values (the dashed lines). In (b) the bar plot of the error for Q = 64 and p = 6. In (c) the scaling with increasing p for Q = 8 and in (d) the results for Q = 8, p = 16.
cases it is beneficial to include higher noise levels than in the Ising case: C = {1, 3, 5, 7, 9}. We remark that the QAOA circuit having 16 layers of CNOTs is deeper and than most circuits considered here. As ZNE is supposed to correctly capture noise effects for sufficiently small noise this may explain why it is beneficial to use higher noise levels in the random quantum circuits case. We leave systematic investigation of this effect to future work. For more detailed description of the ZNE and vnCDR implementations see Section V A, V C.

A. ZNE
We perform the noise amplification in our numerical experiments using identity insertions after each application of a CNOT gate [19,20]. We use the fixed identity insertion method (FIIM) of Ref. [20], which adds pairs of CNOT gates after each CNOT gate of the original circuit. The noise level is defined as the factor by which the number of CNOT gates in the circuit increases. In the first example -the QAOA optimization task -we employ noise levels C = {1, 3, 5}, whereas for random circuits we achieved better results with a higher maximum noise level, so we used the set C = {1, 3, 5, 7, 9}. We obtained corrected values of the observables of interest by an extrapolation using both a polynomial fit via Eq. (1) and a linear fit to the data, as explained in Section II A. In both the Ising and random quantum circuits cases, we found that a linear fit performed better than a polynomial regression for extrapolation, so we report those results here.

B. CDR
To construct the training set for a circuit of interest we substitute most of the non-Clifford gates in the circuit by Clifford gates with two different substitution strategies, which are explained below. Such a procedure ensures that circuits in the training set are classically simulable and biased towards the circuit of interest. Here we consider circuits of interest which are compiled for the IBMQ quantum computers. The compiled circuits are built from CNOTs, R X (π/2) pulses and general σ Z rotations R Z (β) = e −iβ/2σ Z with β ∈ [0, 2π). The pulses and CNOTs are Clifford gates while R Z (β) is a Clifford gate only for β = nπ/2, where n is an integer. Therefore, we substitute most of the R Z gates by S n , where n = 0, 1, 2, 3 and S = e iπ/4σ Z is the phase gate. In both the Ising and random quantum circuits cases we find that substantial error reduction can be obtained using training sets built with approximately 100 near-Clifford circuits.
We consider two different substitution strategies. The first one substitutes a randomly chosen non-Clifford R Z (β) by S n minimizing d(β, n) = ||R z (β) − S n ||, where ||.|| is the Frobenius norm. This procedure is repeated until N non-Clifford rotations are left in the circuit. We find that this very simple strategy works well for the Ising model enabling us to obtain a factor of 33 improvement in calculating the energy.
In the more general and challenging case of random quantum circuit simulations we find that better results can be obtained with a more sophisticated substitution method. In such a case to construct a training set we tailor our choice of classically simulable circuits to an observable of interest, substituting all non-Clifford gates outside its causal cone. By the causal cone definition such a replacement does not affect its expectation value. See Appendix D for a discussion of the causal cone construction. Taking into account the causal cone of the observable is especially important in the case of local observables and large Q shallow circuits because in such a case the causal cone contains only a small fraction of all non-Clifford gates of the circuit of interest. Furthermore, for remaining non-Clifford rotations within the causal cone of the observable of interest we choose both which gate to replace and what gate to replace it with (S n ) according to a probability distribution p(β i , n) ∝ e −d(βi,n) 2 /σ 2 . Here i numbers the remaining non-Clifford rotations in the causal cone. We repeat the procedure until N non-Clifford rotations are left in the causal cone of the observable of interest. Here we use σ = 0.5. Such a choice of the probability distribution tends to leave gates which would be most severely distorted by the replacement in the circuit, unchanged. At the same time it produces more diverse training sets than a direct replacement by the closest power of S. We observe that in the case of the random quantum circuits the correction is more challenging as expectation values of the observable of interest become more clustered around 0 with increasing p. Furthermore, we observe that training sets created by the simple substitution method tend to have exact expectation values clustered around 0 more strongly than expectation values of the observable of interest. The more sophisticated procedure generates training sets with more diverse exact expectation values.

C. vnCDR
To construct a vnCDR training set we choose the same classically simulable circuits which are used for a CDR training set. We also use the same choice of noise levels as used in the ZNE implementation, namely C = {1, 3, 5} for the Ising and C = {1, 3, 5, 7, 9} for the random quantum circuits mitigation. As in the case of ZNE we observe that including more than 5 noise levels does not improve results for the Ising while it is beneficial for the random quantum circuits case.

VI. CONCLUSIONS
Data-driven error mitigation involves collecting data from multiple different quantum circuits in order to inform the correction of errors in a particular circuit of interest. In this work, we conceptually unified two distinct, popular methods for data-driven error mitigation: zero-noise extrapolation (ZNE) and Clifford data regression (CDR). Our unified approach, called variable-noise Clifford data regression (vnCDR), appears to be more powerful than the individual methods of ZNE and CDR.
The vnCDR method generates training data from classically simulable near-Clifford circuits, whose noise levels are varied (e.g., by identity insertions). The method then learns how to correct observables on these training circuits. This involves fitting a multidimensional ansatz, which we assume is a hyperplane, to the training data. This enables a guided extrapolation to the noiseless expectation value for the circuit of interest, which dramatically improves the mitigation realized. Rather than doing uninformed extrapolation as in ZNE, the vnCDR method demonstrates that near-Clifford circuits provide an effective guide for the extrapolation process. The fitted ansatz can be further motivated by considering the effect of a global depolarizing channel on some observable of interest. The effect of such a channel is completely removed using the vnCDR ansatz.
We compared vnCDR to both ZNE and CDR on two tasks: correcting the energy of an Ising transverse spin chain and mitigating local observables of random quantum circuits. For both of them we used a realistic noise model obtained by gate set tomography of IBM's Ourense quantum computer. On each of these tasks, vnCDR outperforms both of these state-of-the-art error mitigation methods. Compared to ZNE, vnCDR was shown to tolerate the relatively high noise levels obtained via fixed identity insertions.
Though preliminary scaling results are promising, further testing on real quantum devices will help determine the number of non-Clifford gates and size of the training sets required to attain accurate predictions. It will also help determine limitations of the method while dealing with large and deep noisy circuits which are challenging for error mitigation methods. Additionally it would be interesting to apply the vnCDR method using more sophisticated or finegrained noise amplification schemes such as random identity insertions or pulse stretching. This may enhance performance for the deep circuits necessary to obtain a quantum advantage. In this regime, we envision that vnCDR could play an important role in yielding quantum advantage for chemistry, materials science, and other applications. Finally we note that further testing is necessary to determine the potential of our method for quantum computing architectures with gate sets other than IBM's gate set. To motivate the form of the vnCDR ansatz we consider the action of a global depolarizing channel (see Eq. (10)). Assuming this channel acts in our circuit j different times the final state can be written as where d is the dimension of the system and is a parameter characterizing the noise. Considering the effect of the above channel on X leads to where µ = Tr(ρX). As previously discussed the vnCDR ansatz combines evaluations of the observable of interest at various noise levels: where the parameters a * j are chosen by fitting data produced by near-Clifford circuits. The above expression can be expanded Therefore, for the vnCDR ansatz to completely mitigate the effects of global depolarizing noise, such that µ = µ, we require: The training circuit observables and the observable of interest will behave the same way under such a noise channel. As such, the fitted parameters a * j will obey the above relations (Eq. (A7)). Therefore, vnCDR can be seen to perfectly mitigate global depolarizing noise for two or more noise levels.
It is interesting to consider how this contrasts with the ZNE implementation in this work. We used linear extrapolation to the zero noise limit and least-squares fitting of the noisy expectation values for the observable of interest. This extrapolation method is not expected to perfectly mitigate the effect of global depolarizing noise. An exponential extrapolation would be required in order to perfectly mitigate the effects of this channel and polynomial extrapolation is expected to perform better than linear. However, for our simulations we found in general a simple linear extrapolation gave better results.

Appendix B: Sufficiency of the Clifford training set
Consider a quantum circuit acting on Q qubits which is represented by a noise-free unitary channel U and let ρ 0 ∈ C d×d be the initial state, where d = 2 Q . Also suppose we have some observable of interest X and a collection of channels E 0 , E 1 , . . . , E n representing n + 1 different noise levels. Running the circuit with the j th noise channel returns the value Tr ((E j • U)(ρ 0 )X) in expectation.
For a given circuit V, the vnCDR correction is then given by f (µ(V)) = a · µ(V) where a is some optimal set of parameters obtained by training the model. Our goal is to show that if the vnCDR estimate f (µ(C)) is fully accurate for all Clifford circuits C, then the output of f (µ(U)) is also accurate for estimating the value Tr (U(ρ 0 )X).
In other words, we have achieved zero loss on all arbitrary circuits if we obtain zero loss on training data comprised of all possible Clifford circuits. We note the a similar remark is made in Ref. [25] to argue that Clifford circuits suffice for their learning-based approach to quasi-probability representation (QPR) error mitigation. However, unlike in Ref. [25], depending on the channels and initial state involved in the error mitigation, there may not exist a set of parameters which achieves zero loss on all Clifford circuits. Hence, the considerations in this section should serve as high-level motivation for the linear form of the ansatz we employ, rather than a rigorous demonstration of the practicality of Clifford-based training sets. 6. A causal cone of a single qubit observable is shown as gates within the dashed contour. Here we consider the case of the hardware efficient ansatz with Q = 8 and p = 2. Note that the causal cone is shown for a noisy expectation value while assuming that the Kraus matrices of a noise channel associated with a single or double qubit gate act on the same qubits as the gate. This assumption is true for our noise model. The causal cone for a noisy circuit implementation contains the causal cone of the corresponding exact expectation value.
(MPO) is an example of a tensor network, which corresponds to a 1-D array of tensors. In general they can describe any mixed state in the 1-D many body Hilbert space, with the dimensionality of the tensors scaling exponentially with the system size. However, states with sufficiently small entanglement can be efficiently represented as MPO making them a convenient numerical tool. For a detailed introduction of MPO methods we refer the reader to Ref. [40]. Consider an Q qubit density matrix, ρ = enough to see a systematic improvement of vnCDR over ZNE for the shallow circuits. With N s = 10 4 we see a systematic improvement of vnCDR over CDR for the shallow circuits and a systematic improvement of vnCDR over ZNE for the deep circuits. With N s = 10 5 shots we also obtain a systematic improvement of vnCDR over CDR for the deep circuits. We note that for setups in which vnCDR does not outperform other methods it gives results of a similar quality. We observe that increasing N s improves performance of the methods only for sufficiently small N s . In the case of ZNE the results obtained with N s = 10 3 are of similar quality as the ones obtained in the limit of N s = ∞. For vnCDR and CDR N s = 10 5 is needed to that end.
To give a full picture of the number of shots needed for the different methods one also needs to consider the number of circuits required to mitigate the circuit of interest. ZNE requires only the execution of the circuit of interest at various noise levels. Assuming n noise levels are needed with N s shots per circuit the total shot cost for ZNE is n × N s . Both CDR and vnCDR require the execution of near-Clifford training circuits as well as the circuit of interest. Assuming a training set consisting of m circuits, each run using N s shots the total shot cost for CDR is given as (m + 1) × N s . vnCDR requires the Clifford training circuits and the circuit of interest implemented at various noise levels. With n noise levels and m training circuits each evaluated using N s shots the total shot cost for vnCDR is (m + 1) × n × N s . For our RQC results n = 5 and m = 100. Therefore, the shot cost for ZNE is given as 5×N s , for CDR 101×N s and for vnCDR is 501 × N s . Then to see systematic improvement over ZNE and CDR, we need respectively 5 × 10 5 − 5 × 10 6 and 5 × 10 6 − 5 × 10 7 shots in total. These shot numbers can be obtained with current devices proving usefulness of vnCDR.