Practical Quantum Error Mitigation for Near-Future Applications

It is vital to minimise the impact of errors for near-future quantum devices that will lack the resources for full fault tolerance. Two quantum error mitigation (QEM) techniques have been introduced recently, namely error extrapolation [Li2017,Temme2017] and quasi-probability decomposition [Temme2017]. To enable practical implementation of these ideas, here we account for the inevitable imperfections in the experimentalist's knowledge of the error model itself. We describe a protocol for systematically measuring the effect of errors so as to design efficient QEM circuits. We find that the effect of localised Markovian errors can be fully eliminated by inserting or replacing some gates with certain single-qubit Clifford gates and measurements. Finally, having introduced an exponential variant of the extrapolation method we contrast the QEM techniques using exact numerical simulation of up to 19 qubits in the context of a `SWAP test' circuit. Our optimised methods dramatically reduce the circuit's output error without increasing the qubit count or time requirements.

It is vital to minimise the impact of errors for near-future quantum devices that will lack the resources for full fault tolerance. Two quantum error mitigation (QEM) techniques have been introduced recently, namely error extrapolation [1,2] and quasi-probability decomposition [2]. To enable practical implementation of these ideas, here we account for the inevitable imperfections in the experimentalist's knowledge of the error model itself. We describe a protocol for systematically measuring the effect of errors so as to design efficient QEM circuits. We find that the effect of localised Markovian errors can be fully eliminated by inserting or replacing some gates with certain single-qubit Clifford gates and measurements. Finally, having introduced an exponential variant of the extrapolation method we contrast the QEM techniques using exact numerical simulation of up to 19 qubits in the context of a 'SWAP test' circuit. Our optimised methods dramatically reduce the circuit's output error without increasing the qubit count or time requirements.

I. INTRODUCTION
Controlling noise in quantum systems is crucial for the development of practical technologies. Such noise can occur due to unwanted interactions of a passive qubit with the environment, or due to imperfections in the use of circuit elements that compose the algorithm (qubit initialisation, gates, and measurement). In all cases the result is errors occurring at the level of physical qubits. The theory of quantum fault tolerance (QFT) reveals that the introduction of logical qubits, composed of numerous physical qubits, can allow one to detect and correct errors at the physical level; however this capacity comes at an enormous multiplicative cost in resources. A recent estimate suggests that a Shor algorithm operating on a few thousand logical qubits would require several million physical qubits [3]. While it is encouraging to know that such techniques exist, hardware on this scale is probably at least a decade away. The timely (indeed, urgent) question is, to what extent can we control the impact of errors in computing devices that are too small to support full QFT?
It may prove to be the case that deep quantum algorithms, such as Shor's factoring algorithm and Grover's search algorithm, cannot be successfully executed on classically-intractable problems without the support of QFT. However, fortunately there are other algorithms of potential practical significance that focus on shallow circuits, with the output typically being fed into a classical supervising algorithm so as to form a hybrid system. Such approaches have been proposed for the simulation to aid discovery in chemistry and materials science; see Refs. [1,[4][5][6][7][8][9][10][11][12] for examples. Hybrid systems may be capable of yielding significant results, surpassing conventional computers, even when finite error rates are present because of their error resilience [13,14]. In order to achieve this it is desirable to suppress or mitigate errors to the greatest extent possible while keeping the qubit count ideally unchanged, or increasing only modestly compared to the high cost of full QFT.
Recently two techniques were introduced for quantum error mitigation (QEM) in generic hybrid quantum algorithms where the expected value of an observable -say, a z-basis measurement of a given qubit -is the quantity of interest. The goal is to estimate the value that this observable would take given an error-free circuit, despite the reality that the real experimental system cannot perform operations with less than a certain error rate. Ref. [1] introduced a hybrid algorithm simulating quantum dynamics, which featured an active error minimisation technique involving extrapolation. The experimentalist would execute the circuit with all errors at their minimum achievable severity, obtain the expected value of the observable, and then repeat the exercise having deliberately increased the physical error rate (or having applied additional quantum gates to achieve the same effect). By noting the effect of the increased errors on the observable, the experimentalist would be able to make an extrapolated estimate of the zero-error value, presuming that the error sources had scaled proportionately. The technique was found to be very advantageous in the numerical simulations of few-qubit experiments presented in that paper (see e.g. Fig. 5 in Ref. [1]).
A paper that appeared online at almost the same time was Ref. [2] by the IBM-based team of Temme, Bravyi and Gambetta. This paper presented a comprehensive analysis of the extrapolation technique, which the authors had independently conceived, and moreover it introduced a second technique with using (what we will call) a 'quasi-probability' formalism. The authors explained that by replacing operations in the quantum circuit and assigning parity ±1 to each operation following a certain probability distribution dependent on the noise, an experimentalist can obtain the unbiased estimator, at the cost of an increase in the variance. Their method was shown to be applicable to specific noise types including homogeneous depolarizing errors and damping errors. The authors found both methods to be promising in few-qubit numerical simulations (see e.g. Fig. 2 in Ref. [2]).
As exciting as these studies were, open questions re-mained to be answered before these two techniques could be considered to be fully practical. First, both techniques rely on the full knowledge of the error model, whereas an experimentalist will have imperfect knowledge and the real noise will generally differ from the canonical types considered in these first papers. Second, we need an explicit method to derive the QEM circuits, i.e. a specification of how to algorithmically increase the error rate in the error extrapolation or how to sample circuits in the quasi-probability decomposition. In this paper, we solve these two problems. We find that gate set tomography (GST) [15,16] provides sufficient information to enable full elimination of the impact of localised Markovian errors. As with other process tomography protocols, GST cannot determine the exact physical error model due to noise associated with state preparation and measurement. However, we determine that preparation and measurement noise in GST is not harmful to the overall QEM approach. We also find that single-qubit Clifford gates and measurements are universal in computing expected values. Each quantum operation is a linear map, and single-qubit Clifford gates and measurements yield a complete set of linearly-independent maps (quantum operations). Therefore, any error can be simulated or subtracted by decomposing the error using the complete operation set, which is the standard linear decomposition. We prove that, by combining GST and the complete set decomposition, any localised and Markovian errors in the quantum computer can be systemically mitigated, so that the error in the final computational output is only due to unbiased statistical fluctuation. For the quasi-probability method, we provide an upper bound of the cost in QEM, and we describe the utility of 'twirling' operations [17][18][19] in minimising this cost. For the extrapolation method, which is a relatively straightforward technique, our optimisation is to observe that typically for the classes of noise most common in experiments it is appropriate to assume that the expected value of the observable will decay exponentially with the severity of the circuit noise. Adopting this underlying assumption, rather than a polynomial (e.g. linear) fit, proves to be quite advantageous.
Having thus optimised both the quasi-probability and the extrapolation techniques, we make a series of numerical simulations to study their efficacy. We opt for a specific circuit, a realisation of the 'SWAP test' that is often employed in quantum algorithms as a means for estimating the similarity of quantum states [20,21]. Our SWAP test operates on 2n + 1 qubits, and we simulate a total of 15 qubits over a comprehensive set of cases as well as 19 qubits for specific cases. We numerically simulate the actions of the experimentalist, who must perform many circuit trials in order to make a single estimate of the observable (we choose 10 4 trials). But in order to evaluate our QEM techniques we must then repeat this entire process to determine the distribution of values that the experimentalist might obtain. We perform at least 10 3 repetitions so that the distribution becomes clear, thus at least 10 7 individual numerical experiments are performed for each of the curves that we presently report.

II. ERROR MITIGATION
In this paper, we focus on computing the expected value of an observable in a state (the final state of a quantum circuit) using a quantum computer. It is typical of a number of quantum algorithms and subroutines that the desired output is the expected value of a qubit or qubits -the SWAP-test [20,21] itself, which is a component of algorithms including the recently-introduced auto encoder [22], and several proposed hybrid algorithms for simulating chemical or materials systems [1,[4][5][6][7][8][9].
Without using QEM as shown in Fig. 1(a), the quantum circuit is repeated for many times, and the measurement outcome µ of each time is collected. Then, we can calculate the average µ as our best estimate of the expected value. Given that the number of repetitions is finite, the value of µ is a random variable with an associated distribution. Because the implementation of the quantum circuit is imperfect, it is likely that the distribution of µ is not even centered at the ideal value, i.e. the exact expected value when the quantum circuit is perfectly implemented without error.
When we use QEM as shown in Fig. 1(b), instead of the original quantum circuit, we implement a set of modified circuits. The scheme depicted in the figure is relevant to the quasi-probability method for QEM, but can also apply to the extrapolation method as a means to deliberately boost errors. Each modified circuit is determined by a set of random numbers l. The distribution of random numbers, i.e. modified circuits, depends on the error model, which is measured using GST before the quantum computing. In each run of the quantum experiment, firstly the random number set l is generated, then depending on l a specific circuit is implemented, and finally the measurement outcome µ is collected. Rather than calculating the average µ, we use both l and µ to calculate the average of an effective outcome µ eff (l, µ), which will be given explicitly later. If QEM is successful, the distribution of µ eff (l, µ) is centered at the ideal value, but the distribution is wider than µ. Thus only error due to the statistical fluctuation remains, although it is amplified. By repeating the quantum experiment enough times, we can obtain an accurate computing result of the expected value.
In Sec. IV, we explicitly give the effective outcome µ eff (l, µ). Modified circuits and their distribution are given in Sec. VIII.

Probability
Measurement outcome μ (a) Computing without error mitigation · · · · · · · · · · · · · · · · · · · · · In quantum theory, a quantum state is usually represented by a density matrix ρ, and an observable is represented by a Hermitian operator Q. The expected value of the observable quantity in the state is Q = Tr(Qρ). An operation is a map on the space of states, O(ρ) = k E k ρE † k , expressed in the Kraus form. Because an operation is a linear map, we can always express the operation O as a square matrix, e.g. using the Pauli transfer matrix representation, acting on the state expressed as a column vector |ρ . Similarly, an observable can be expressed as a row vector Q|, and the expected value is Q = Q|ρ . Throughout this paper, we use the Pauli transfer matrix representation, and see Appendix A for details. In quantum tomography, usually we focus on observables that are POVM operators, which is not necessary here.
In the Pauli transfer matrix representation, vectors representing states or observables and matrices representing operations are all real. For n qubits, vectors and matrices are 4 n -dimensional. The expected value of the observable Q in the state ρ going through a sequence of operations O 1 , . . . , O N reads as follows:

IV. QUANTUM COMPUTING BY SAMPLING CIRCUITS
We suppose that the initial state is ρ (0) , which goes through a sequence of operations O N , and in the final state the observable Q (0) is measured. Each time the experimentalist implements this circuit, the measurement returns an eigenvalue of Q (0) , and the probability distribution of eigenstates is determined by the final state. By repeating such a circuit for many times, she can estimate the expected value 1 , and µ (0) is the measurement outcome. Generally in this paper we will use the superscript 0 to denote the ideal noise-free realisation of a state, operation or observable quantity.
In the case that the quantum computation has errors, the actual initial state is ρ, actual operations are O 1 , . . . , O N , and the actually measured observable is Q. As a result, the estimation of the expected value converges to Q = Q|O tot. |ρ rather than Q (0) . Here, O tot. = O N · · · O 1 , and we have assumed that errors are Markovian.
The central idea introduced by the IBM team in Ref. [2] is that one can exactly compensate for the effect of errors by sampling from a set of (real, error-burdened) circuits, each labelled O (l) tot. for l = 1, 2..., provided that their outputs satisfy Ref. [2] describes how the real numbers {q l } which represent quasi-probabilities can be efficiently derived given specific error models, assuming that the experimentalist has full knowledge of the model. Note that each O (l) tot.
denotes the total operation composed by a sequence of operations in the l th circuit.
We can use the Monte Carlo method to compute Q (0) . We note that Q (l) |O is the measurement outcome in the l th circuit. Then To compute Q (0) , we randomly choose a circuit to implement, and the l th circuit is chosen with the probability p l = |q l |/C, where C = l |q l |. Then, the computing result is given by the expected value of effective measurement outcomes, i.e. Q (0) = CE[µ eff ], where the effective outcome is µ eff = sgn(q l )µ (l) if the l th circuit is chosen to be implemented, and µ (l) is the outcome directly obtained in the l th circuit.

V. PER-OPERATION ERROR CORRECTION
We can correct errors in each operation using the quasiprobability method, which will be the primary focus for the following several sections. We suppose that we have a set of initial states satisfying |ρ (0) = lin q When we sample circuits to compute Q (0) = CE[µ eff ], the initial state is |ρ (lin) with probability p li |/C i , and the observable is Q (lout) | with probability p lα |, and C = C in C 1 · · · C N C out accordingly. To calculate µ eff , we use sgn(q lout ).

VI. VARIANCE AMPLIFICATION IN QUASI-PROBABILITY DECOMPOSITION
The presence of quasi-probabilities taking negative values amplifies the variance of the expected value of the observable. We consider the case that Q (l) is a Pauli operator (maybe with error) and the measurement reports two kinds of outcomes denoted by ±1, respectively. In this case, the distribution is binomial. The standard deviation of the average of outcomes in the Monte Carlo Here, N r is the total number of samples, i.e. the total number of circuits of all kinds which the experimentalist performs is N r . We compare this to the error-free computing, i.e. the ideal original circuit Q (0) |O (0) tot. |ρ (0) is repeated for N (0) r times to estimate Q (0) . For the error-free computing, the standard deviation is given by r . Therefore, to achieve the  same accuracy, i.e. σ = σ (0) , the error-mitigated computation needs N r /N ) times more samples than the error-free computation. Here, we have used the fact that the error-mitigated computation and the error-free computation should converge to the same value of Q (0) , i.e. E[µ (0) ] = CE[µ eff ].
In order to limit the standard deviation to be σ ∼ , we can choose N r ∼ (C/ ) 2 . Therefore, if the factor C is larger, the computing takes longer.
Because C = C in C 1 · · · C N C out if errors are corrected for each operation, we call C α − 1 the cost for mitigating error in the corresponding operation. The overall cost therefore increases with the number of operations, thus it is important to reduce the operation number, e.g. in quantum computer with qubits fully connected [23], operations for communication are not required, which may significantly reduce the cost.

VII. UNIVERSAL OPERATION SET
The set of operations including measurement and single-qubit Clifford gates is universal in computing expected values of observables. The relevant measurement operation reads [π] = [ 1 2 (1 1 + σ z )], which projects a qubit to the state |0 . Here, [U ](ρ) = U ρU † denotes a superoperator. Such a non-destructive measurement can be realised using a destructive measurement followed by initialising the qubit in the state |0 The measurement superoperator [π] also means postselection, i.e. if the outcome of the measurement corresponding to [π] (which is not the final measurement on the observable Q (l) ) is |1 in a trial, the value of the observable Q (l) is noted as µ (l) = 0, but the trial is counted in the total number of samples in the Monte Carlo calculation. If Q (l) has two values ±1, we can estimate the value of Q (l) |O times the circuit does not pass postselections (i.e. µ (l) = 0), and for N (l) ±1 times the circuit passes all post-selections and reports Q (l) = ±1 (i.e. µ (l) = ±1). It is the same when we compute Q (0) using the Monte Carlo method. If the effect outcome is µ eff = 0 with the probability P 0 , then the standard deviation of the Monte Carlo calculation becomes σ = In Table I i |i = 1, . . . , 16} to denote these sixteen operations. Because they are linearly independent, any single-qubit operation O, which is a 4 × 4 real matrix, can be expressed as a linear combination of sixteen basis opera- i . Similarly, multi-qubit operations can be expressed as a linear combination of tensor products of basis operations. Using the quasi-probability method, any computation of expected values of observables can be realised using this operation set.
Note that these basis operations are universal, as one can verify by constructing a non-Clifford gate or an entangling gate: We can decompose T gate using our basis operations as for controlled-NOT as a second example). However this construction would not be used in practice -it is not an efficient means to actually implement a desired T in the basic circuit, since the corresponding cost C = √ 2 would imply and unacceptably steep exponential in the time overhead, as one would expect from e.g. Refs. [24][25][26]. Instead we rely on the assumption that the experimental system can directly implement a universal set of gates (including entangling and non-Clifford gates) with a reasonably high fidelity. Then rather than fully synthesising any of the basic gates using our basis, we need only compensate for slight imperfections. The cost for doing so, for each imperfect gate, is then ∼ C = 1 + δ as we presently discuss.
Having obtained the complete operation set in Table I we can use it in deriving the protocol that will compensate for errors. In this paper, we focus on the case that errors are localised: An (error-free) operation that is applied on a set of qubits S is a 4 |S| -dimensional real matrix, then the corresponding operation in real (i.e. errorburdened) O is also a 4 |S| -dimensional real matrix acting on the same set of qubits. The overall operation on the entire system can be expressed as 1 1S ⊗ O, where 1 1S is the identity acting on all other qubits. It is similar for the initialisation and measurement. If each qubit is initialised individually, the overall initial state is m |ρ m , where |ρ m is a 2-dimensional real vector representing the m th qubit's initial state. Similarly, individual measurement of qubits implies that the overall measured observable is m Q m |, where Q m | is a 2-dimensional real vector representing the measured observable for the m th qubit. In this case, a single-qubit operation with error can still be expressed using a 4 × 4 real matrix. We suppose that for a qubit, sixteen basis operations with errors are {B i |i = 1, . . . , 16}, which are all 4 × 4 real matrices. When errors are not significant, these sixteen bases should still be linearly independent, i.e. the set of basis operations with errors is still universal.
To make this statement more precise, we consider the 16 × 16 real matrix ( Here, (B i ) •,j denotes the j th column of the matrix of the basis operation B i . Sixteen basis operations are linearly 16} as the measure of the error severity in basis operations.
. We remark that even if max exceeds the threshold, basis operations are still likely to be linearly independent.

VIII. ERROR MITIGATION USING BASIS OPERATIONS
Given an operation with error O, we can use sixteen basis operations to correct the error, i.e. realise the operation without error O (0) . There are two ways for correcting the error.
Compensation method. The operation O is close to O (0) . Therefore, we can keep the correct component of O and only decompose the error component using basis operations. We decompose the operation without error as Error-mitigation circuits. The choice of a basis operation is determined by the corresponding random number i, j or k. Original gate that is identity (memory operation) also has to be error-mitigated, unless memory error is negligible. In the compensation method, ether the original gate or basis operations are applied depending on the random number. (b) The schematic of the linear extrapolation (orange curve) and exponential extrapolation (green curve).
the decomposition always exists, and there is only one solution of coefficients {q i }. However, the inverse method can only be applied if the matrix O is invertible.
For multi-qubit operations, the decomposition is performed using tensor products of basis operations, as described explicitly in Appendix D. Although basis operations are not entangling, we can use basis operations to efficiently mitigate multi-qubit errors and errors that can entangle qubits. As an example, we show how to decompose the controlled-NOT gate only using basis operations in Appendix B, which suffices to imply that any error in the form of the controlled-NOT gate can be mitigated using basis operations.
Initialisation and measurement errors can also be corrected using basis operations. Taking first the case of initialisation errors: If |ρ is the error-burdened initial state, and it is a non-zero vector, we can always find a transformation T that satisfies |ρ (0) = T |ρ where |ρ (0) is the error-free initial state. Thus by decomposing T using basis operations and applying it after the initialisation, we can prepare the initial state without er-ror. Actually, given an initial state that is close to |0 , we can generate a complete set of linearly independent vectors {|ρ k } using basis operations. With these vectors, we can decompose the initial state without error as A similar approach yields the corresponding result for measurement: For an observable Q| there will be some If an observable is close to σ z then a linearly independent set { Q j |} can be generated; then the error-free observable Circuits for QEM are shown in Fig. 2(a). Given quasiprobabilities, we can compute the corresponding probability in sampling circuits as shown in Sec. V. More details of QEM using basis operations are given in Appendix D.
Using the same technique, we can also increase the error in an operation, as required by the alternative error extrapolation method for QEM. Instead of decomposing the error-free operation O (0) using O and basis operations, we can also decompose the error-boosted operation using O and basis operations. It is similar for initial states and observables. We have noted that in the decomposition of an errorfree operation, there are always some negative quasiprobabilities, i.e. the C factor is greater than 1, which leads to greater time costs. But fortunately when we merely wish to decompose an error-boosted operation we can do so without introducing negative quasi-probability, e.g. by boosting Pauli errors using Pauli gates [1].

IX. QUANTUM GATE SET TOMOGRAPHY
We can measure a set of initial states {|ρ k }, observables { Q j |} and operations {Ō i } (including basis operations) using GST [15,16]. These vectors and matrices with the bar notation describe the actual physical system. Because there are errors in both initial states and observables, and initialisation and measurement errors cannot be distinguished, we may not obtain exactly these vectors and matrices describing the actual physical system. Instead, the vectors and matrices obtained using GST are If we know {|ρ k }, { Q j |} and {Ō i } because the physical system is well understood, we can directly use them in QEM. If our knowledge about the physical system is not enough, we can use GST to obtain {|ρ k }, { Q j |} and {Ô i }. We will show that, although the estimations may not be exact, we can exactly correct errors by using these estimations in QEM.
Using the protocol in Refs. [15,16] (also see Appendix E), the estimation of an operation and the actual physical operation are similar matrices, i.e.Ô i = TM in−1Ō iM in T −1 , whereM in is a matrix determined by initial states (i.e.M in σ,k = σ|ρ k ), and T is an arbitrary invertible matrix. We note that T andM in are independent of the operationŌ i , andM in cannot be determined by GST. By choosing T , we can obtain different estimations of the operation set. Similarly, All operations are transformed by the same similarity transformation, and initial states and observables are also transformed accordingly. As a result, these estimations obtained by GST can exactly predict the expected value of an observable, i.e. Q j |Ō N · · ·Ō 1 |ρ k = Q j |Ô N · · ·Ô 1 |ρ k . Therefore, we can directly use these estimations in QEM, and the similarity transformation does not lead to any computing error.
Using GST estimations in QEM, the actual operations realised in this way differ from operations without error, but the computing result is correct. To correctly obtain Q (0) |O (0) |ρ (0) , we decompose the initial state, the observable and the operation using the desired error-free output. The cost of this adaption lies in the potential increase to the number of samples required, as shown in Fig. 3 and discussed in the caption.
We would like to remark that, when errors in actual operations are small, errors in estimations of operations are also small. If we take a proper strategy for choosing T , and errors in initial states and observables are small, the estimation of an operationÔ is close to the operation without error O (0) when the actual operationŌ is close to O (0) . It is similar for estimations of initial states and observables. See Appendix F for details.

X. ESTIMATION OF THE COST
In general, when the error in an operation is more significant, there is a higher cost for mitigating the error (to a given level of suppression). We max as the measure of the error severity in the operation, where O (O (0) ) is the n-qubit operation with (without) error. An upper bound of the cost for correcting error in O is Here, max is the maximum error in all basis operations for all n qubits, and s min (A (0) ) = 1 2 (13 − 3 √ 17) ≈ 0.315. Similar upper bounds can be obtained for correcting er-rors in initial states and observables. See Appendix G for details.
There are several ways for reducing the cost. The upper bound of the cost is obtained using the compensation method and taking λ = 1. In general, we can optimise the value of λ or use the inverse method to minimise the cost. For example, for the depolarising error model (see Appendix H), the cost of using the inverse method is lower than using the compensation method [see Fig. 3(a)]. We remark that, to obtain data for the compensation method in Fig. 3, we have optimised the value of λ. If we use estimations obtained from GST to correct errors, we can optimise the T matrix to minimise the cost. In Fig. 3(a), we can find that, without optimising T matrices, the cost using estimations obtained from GST is higher than using actual operations. If we choose the matrix in the form T = ⊗ n m=1 T m , where T m is a 4-dimensional real matrix corresponding to the m th qubit, there are total 16n parameters to be optimised for a n-qubit quantum computer, which is a non-trivial task. Under some reasonable conditions, we can also use the Pauli twirling [17][18][19] to reduce the cost.

Pauli twirling
In many quantum computing systems, e.g. superconducting qubits [27] and ion traps [28][29][30], the fidelity of single-qubit gates is much better than the fidelity of twoqubit gates, and usually a state can be initialised with a high fidelity while the fidelity of measurement is worse. In this section, we focus on the case that error rates of initialisation and single-qubit gates are much lower than error rates of two-qubit gates and measurement.
If the error rate of initialisation is low (much lower than the error rate of measurement), we know how to choose T so that the estimation of an operation obtained from GST is close to the actual operation. We cannot exactly estimate operations using GST, because we cannot distinguish initialisation and measurement errors. If we treat all errors in the initialisation and measurement as measurement error (which corresponds to T = M in(0)⊗n in Appendix F), the difference between the estimation and the actual operation is only determined by the initialisation error. Therefore, if the initialisation is highfidelity, the estimation obtained in this way and the actual operation are close.
Because the set of basis operations includes Pauli gates, it is easy to use basis operations to correct Pauli errors. By using the Pauli twirling, we can convert the error in a two-qubit entangling Clifford gate to Pauli error [17][18][19], which is achieved by applying Pauli gates before and after the two-qubit gate. This treatment of the error is feasible only if the fidelity of Pauli gates is much better than the two-qubit gate, otherwise Pauli gates cause significant new errors, which may not be Pauli error, on the two-qubit gate. In Fig. 3(b), we can find that the cost can be significantly reduce by using the Pauli twirling for max. The x-axis illustrates the maximum distance over all operations in the universal set. In (b) and (c), we always use GST estimations. In (c), Pauli twirling and the inverse method are used for all the data. Pauli twirling is applied to the measurement and two-qubit gate, and the inverse method is only applied to the two-qubit gate, while errors in other operations are corrected using the compensation method. We remark that usually the maximum distance and the maximum cost are given by the two-qubit gate.
the over-rotation error model (see Appendix H).
In Fig. 3(c), costs of different error models are compared, including the depolarising model, pure-dephasing model, amplitude-damping model and the over-rotation model. We also randomly generated many other error models, please see Appendix H for details of these error models. For a random-operation model, we randomly generate an operation close to the ideal error-free operation, and we find that the cost is approximately the cost of the depolarising model. For a random-field model, we randomly generate a Hamiltonian that drives the erroneous evolution, and the cost is between the depolarising model and over-rotation model. From Fig. 3(c) we see that the cost of quantum error mitigation varies according to the error model but is generally upper-bounded by the case of depolarising noise, over the range of noise levels shown here. (Note that other models can exceed the cost of the the depolarising model if we use even lower fidelity gates). For the depolarising model, the cost for mitigating error in a two-qubit entangling gate is C − 1 a , where is the error rate and the factor a is between 2 and 3 [see Fig. 3(a)]. If errors in initialisation and single-qubit gates are negligible, or if the matrix T is optimised to minimise the cost, the factor a can approach 2. Accepting the depolarising model as an approximate upper bound, we can estimate the overall cost in a quantum algorithm. Suppose the total number of gates in a quantum algorithm is N , the overall amplification of the standard deviation (uncertainty of the computing result) is (1+2 ) N . Therefore, (1 + 2 ) 2N times more repetitions of the experiment are required in order to reduce the standard deviation. We are interested in the case that N is large but is small, therefore, (1 + 2 ) 2N ∼ e 4N . As a rule of thumb we might take N = 2 as a limit for acceptable scenarios, since then e 4N ≈ 3, 000. However, larger overhead factors may be acceptable depending on the speed of the quantum computer.

XI. NUMERICAL SIMULATION
In our numerical simulation, we apply QEM to the SWAP-test circuit [20] shown in Fig. 4, in which we realise each controlled-SWAP gate using Toffoli gates and realise each Toffoli gate using T gates, T † gates, Hadamard gates and controlled-NOT gates [31]. We note with interest that very recently, the implementation of a SWAP test using shallow circuit has been proposed [21]. However, for present purposes it is not essential to use an optimised realisation of the SWAP circuit; its role is simply to act as a real test case for our technique and indeed the is a probe qubit, and the expected value of Z gives the overlap between states of two qubit groups (denoted green and orange, respectively). Green qubits are prepared in the GHZ state (|00 · · · + |11 · · · )/ √ 2, and orange qubits are prepared in |00 · · · . Therefore, the ideal expected value of Z is 0.5. considerable depth of our non-optimal circuit is helpful here. The number of gates scales as 23N q − 21, where N q is the number of qubits (e.g. N q = 7 in Fig. 4). Without error, the expected value of the observable Z (σ z of the probe qubit) in the SWAP-test circuit in Fig. 4 is 0.5.
We consider error models according to which the same noise E is applied after the initialisation to the state |0 , before the measurement, and before and after each gate. For the controlled-NOT gate, the noise applied is E ⊗ E on two qubits. We remark that basis operations are also affected by noise likewise. We consider two types of noise: inhomogeneous Pauli error and leakage error, which can be respectively described as where p α is the probability of the error [σ α ], and p is the probability of the leakage error from the state |1 . It is worth mentioning that the leakage error is a nontrace-preserving error. In our simulations, we set p x = p y = 0.0001, p z = 0.0006, and p = 0.0008. Thus in both models the total error rate is 0.08% for initialisation and measurement, 0.16% for single-qubit gates and 0.32% for two-qubit gates, which is achievable with two-qubit gates in ion traps [29] and can be far surpassed for one-qubit gates [28]. Moreover, with these numbers the expected total number of error events in circuits of the depth and breadth that we consider here is approximately unity; this is a challenging domain for error mitigation.
In addition to quasi-probability decomposition (see Appendix I for an instruction of the implementation), we also study the extrapolation technique introduced in Ref. [1]. The expected value of Z obtained by running the SWAP-test circuit in a quantum computer with noise depends on the error rate, i.e. it is a function that can be denoted as Z ( ), where is the overall error rate. For our first set of numerical experiments we consider linear extrapolation to the error-free value Z (0) as follows: We obtain the expected value Z ( 0 ) with the lowest attainable error rate 0 , and by increasing error rate to r 0 with r > 1, we obtain another expected value Z (r 0 ). Using these two values, we can infer Z (0) = (r Z ( 0 ) − Z (r 0 ))/(r − 1) as shown in Fig. 2(b), which is the final estimation of Z . Here, we set r = 2.
The first set of numerical results are shown in Fig. 5. We assume that the experimentalist makes her overall estimate of the Z after she performs 10 4 individual experiments. We take this number of runs as a fixed constraint (effectively, we are constraining her overall time resource), and she may choose to employ those runs using one of three alternative approaches: no error correction, linear extrapolation, and quasi-probability decomposition (using basis operations and incorporating GST). In each experiment the SWAP-test circuit or its variant for the purpose of QEM is implemented. Because of the finite number of samples, the estimation is stochastic. Therefore, in our numerical simulation we perform the appropriate series of 10 4 experiments, mirroring the actions of the experimentalist, and then we repeat ≥ 1, 000 times in order to determine the distribution of final estimations that may be obtained. The distribution for each case is plotted in Fig. 5.
We can observe that both QEM approaches can improve the result, i.e. the corresponding distributions are shifted closer to the ideal value 0.5 compared to the approach without QEM. For the inhomogeneous Pauli error model, the means of distributions are at 0.1961, 0.3415, and 0.5011 for the three approaches, respectively. The distribution of the quasi-probability approach is centered at the ideal value, which clearly shows its desirable property of completely removing any systematic bias. However, the distribution is wider (as we expected) compared to other two approaches. A more fair metric would be the expected absolute error versus ideal value (i.e. | Z − 0.5|). Given an ideal error-free computer and 10 4 trials, this metric would evaluate to 0.006910. Using the error-prone computer with our three protocols the three corresponding values are 0.3039, 0.1853 and 0.0491. Similarly, for the leakage error model, the means for three approaches now lie at 0.3819, 0.4710, 0.5007, while the expected absolute error evaluates to 0.1181, 0.0294, and 0.0434.
From these results it may appear that (given a large but reasonable number of samples) the quasi-probability technique outperforms the extrapolation method, with the latter unable to approach the mean of the errorfree circuit. However, here the extrapolation method was limited to linear interpolation whereas the physical error rates are high enough that the linear assumption is poor. One could fit a higher order polynomial using more data points (here, we have only used two: one derived from the actual lowest possible error rate and one boosted to twice the error rate); however since we are limiting the total number of experimental runs to 10 4 this would lead to greater noise in each data point. Moreover, as we now argue, the underlying tend is likely to be well-approximated by an exponential decay rather than a polynomial one (i.e. the expected value of the observable falls exponentially with the physical error rate) and two data points will suffice to estimate the zero-error observable under that assumption.
In Fig. 6 we show the results when the experimentalist indeed assumes that the expected value Z ( ) changes exponentially with respect to the error rate and converges to 0 in the limit of → ∞. Then she will infer the error-free value as Here we take r = 2.
As shown in Fig. 6, the distribution of the final result using the exponential extrapolation approaches the ideal value of Z (which is 0.5 for the SWAP-test circuit) much better than the linear extrapolation. Given the same 10 4 experimental runs, the mean of the experimentalist's estimate is now 0.5111 for the inhomogeneous Pauli error model and 0.4986 for the leakage error model. These numbers almost rival those of quasi-probability technique but do so with a smaller variance. The expected absolute error for inhomogeneous Pauli error and leakage error are 0.06501 and 0.01882, respectively. For the latter, the expected absolute error comes within a factor of three of the shot-noise limit that would be achieved by error-free ideal hardware (0.00691). This is despite the fact that our error-burdened circuits have error rates corresponding to at least one error event per circuit. We emphasise that this suppression results purely from the QEM protocol i.e. it is achieved at no cost in terms of the qubit count or the total number of runs (constrained to 10 4 ).
Due to the limited power of classical computer we utilised, our exact numerical simulations did not involve more than 19 qubits. However, it is of course very interesting to assess the relevance of our techniques to quantum computing using over 50 qubits, which is in the so called 'quantum supremacy' regime. Therefore, we estimate the cost of quantum error mitigation in the SWAPtest circuit, using the same error models in our numerical simulation and error rates achievable in ion trap experiments [28,29], i.e. the error rate of two-qubit gate is 0.1% and error rates of single-qubit operations are 0.01%. Take for example the SWAP test with N q = 51 qubits (the number of gates is 1, 152). For the inhomogeneous Pauli error model, the overall cost is C = 2.956, which implies that we can attain the same computing precision as the ideal case if we have C 2 = 8.738 times more repetitions of the experiment, which is experimentally feasible. For the leakage error model, the cost for the 51-qubit SWAP test is C = 4.338, which means C 2 = 18.818 times more repetitions. A plot showing how the cost scales versus qubit count is shown in Fig. 7.
We also evaluate C 2 for a fully paralleled circuit, whose circuit depth is N q , and each layer has N q /2 single qubit gates and N q /4 controlled-NOT gates, which means the quantum circuit has N 2 q /2 single qubit gates and N 2 q /4 controlled-NOT gates. As single qubit gates, we use T gate, S gate and Hadamard gate, because these gates plus controlled-NOT gate constitute a universal gate set, and we equally assign the number of qubits to these three types of single qubit gates. We plot C 2 versus the number of qubits for the SWAP-test circuit and the fully paralleled circuit in Fig. 7. We observe that for the SWAP test circuit, it is feasible to venture into the 'supremacy' regime with today's best fidelities; for the more demanding case of full parallelism (so that the gate count scales as N 2 q ) we see that today's error rates would not suffice much beyond 50 qubits, but that error rates ten times lower would easily suffice for 80 qubits and beyond.

XII. INTUITION FOR EXPONENTIAL EXTRAPOLATION
Intuitively, the explanation for the success of the exponential extrapolation is as follows. We express the i th noise event occurring in the quantum circuit as where E ( ) is the error component. The only assumption is that the error component only weakly depends the error rate (see Appendix J). Now, for simplification we ignore the computing operations, which do not affect our general argument. The total noise that the entire quantum circuit experiences is Here, N is the total number of the noise-burdened operations. Expanding the overall noise, we get where X n = N n −1 × the sum of terms where E appears for n times . (7) Note that the coefficient of X n in the overall noise corresponds to a binomial distribution, which can be approximated by the Poisson distribution. We have We can find that the impact of the overall noise on the expected value of some observable is proportional to e −N , which implies that exponential extrapolation works better than linear extrapolation.

CONCLUSIONS
We have demonstrated that, following our protocol step by step, an experimentalist can derive an algorithm to run on a noisy quantum computer so as to estimate an output observable with zero bias versus the ideal observable. The experimentalist does not require any prior knowledge of the physical property of the noise, and the only condition is that the noise is localised and Markovian. For this purpose, we have shown that quantum gate set tomography is a perfect tool for measuring the noise in a quantum computer, if the aim is only to compensate the effect of the noise in quantum computing; and we also have shown that single-qubit Clifford gates and measurement can derive a complete set of operations that can compensate any noise in quantum computing. The price of using such a systematic method to negate computing errors is that the quantum computation needs to run for a longer time than an error-free system. We verify the protocol with numerical simulations of up to 19 qubits, in which an alternative method, i.e. exponential error extrapolation, is introduced and studied. We find that the estimation using exponential error extrapolation is also very accurate, while the computing time could be shorter. An approach combining two methods may optimise both accuracy and efficiency.
In Appendix I we describe in detail the steps that an experimentalist would take in order to realise the quasiprobability method. We hope that this compact summary, presented in a single section, will indeed be useful to researchers who are interested in demonstrating the QEM technique with their hardware.
Our general conclusion is that these quantum error mitigation techniques can dramatically enhance the performance of quantum computers, especially at the smallto-medium scale where full code-based quantum error correction is impossible. Our simulations have considered circuits up to 19 qubits, but with error rates considerably worse than the state of the art. Extrapolating from the trends that we observe in these smaller systems, we anticipate that hybrid algorithms involving 50+ qubits, i.e. beyond the reach of classical emulation, will benefit from QEM techniques if the hardware fidelity matches today's state-of-the-art error or modestly improves upon it. where the vector element is Here, we use notations ·| and |· to denote real row and column vectors, respectively. A physical operation O (i.e. O(ρ) = k E k ρE † k ) can be expressed as a real square matrix where σ, τ ∈ {1 1, σ x , σ y , σ z } ⊗n are Pauli operators. If ρ = O(ρ), we have |ρ = O|ρ .

Appendix B: Decomposition of controlled-NOT gate using basis operations
The controlled-NOT gate reads The controlled-NOT gate can be decomposed as Then, the corresponding cost is given by C = 9.

Appendix C: Error threshold of basis operations
For two real matrices A (0) and A and a non-zero real vector x, we have where s min (A (0) ) is the minimum singular value of A (0) . We also have Therefore, Now, A is the matrix formed by basis operations with error as defined in Eq. (2), and A (0) is the matrix formed by basis operations without error. Because det(A (0) ) = 16, A (0) is invertible, i.e. basis operations without error are linearly independent. The minimum singular value is s min (A (0) ) = 1 2 (13 − 3

Appendix D: Decomposition using basis operations
We consider the n-qubit operation E. For each qubit, there is a set of basis operations {B m,i |i = 1, . . . , 16}, where m = 1, . . . , n is the label of the qubit. For each set of basis operations, there is a matrix A as defined in Eq. (2). We use A m to denote the matrix of the m th qubit.
The operation E is decomposed as Coefficients form a 16 n -dimensional vector Therefore, the decomposition is given by We choose the order of Pauli operators, i.e. the order of bases of Pauli transfer matrices {B m,i |i = 1, . . . , 16}, as 1 1, σ x , σ y and σ z (which are also denoted as I, X, Y and Z, respectively). Then, to be consistent with A 1 ⊗ · · · ⊗ A n , we have Here, α m (α = I, X, Y, Z) is a Pauli operator of the m th qubit.
The state of a qubit is represented by a 4-dimensional real vector. To decompose the initial state of a qubit without error |ρ (0) , we need four linearly independent initial states. If the qubit can be initialised in the state |0 , we can choose the set of four states as {ρ Similarly, an observable of a qubit is also represented by a 4-dimensional real vector. To decompose the observable of a qubit without error Q (0) |, we need four linearly independent observables. If σ z can be measured, we can choose the set of four observables as Pauli operators {Q (0) j } = {1 1, σ x , σ y , σ z }. The operator 1 1 denotes a trivial measurement, i.e. the outcome is always +1. Measurements of other three Pauli operators can be obtained by applying basis-adjusting operations (Clifford gates)

Appendix E: Quantum gate set tomography
To measure a set of operations {Ō 1 , . . . ,Ō N } on n qubits using GST, we need to choose a set of 4 n linearly independent initial states {ρ k } and a set of 4 n linearly independent observables {Q j }. Given these initial states and observables, we measure expected values Here,Ō is one of operations {Ō 1 , . . . ,Ō N }.
The matrixÕ is equivalent toŌ up to a transformation. BecauseŌ σ,τ = σ|Ō|τ and σ |σ σ| = 1 1 (the sum is taken over all Pauli operators), we havẽ whereM in andM out are matrices defined asM in σ,k = σ|ρ k andM out j,σ = Q j |σ . We remark that initialisation error and measurement error are included inM in andM out , respectively. We cannot measure matrices M in andM out independently, therefore we cannot deter-mineŌ using GST. By takingŌ as the identity operation (i.e.Ō = 1 1) in Eq. (E1), we can measure The estimation ofŌ is given bŷ Here, g andÕ are obtained by measuring the expected values of observables, and T is an arbitrary invertible matrix. IfM in and T are different,Ô is different from O, but they are always similar matrices. The estimations of states |ρ k and observables Q j | are given by Here, M •,k (M j,• ) denotes the k th column (j th row) of the matrix M . We introduce matricesM in andM out defined aŝ M in σ,k = σ|ρ k andM out j,σ = Q j |σ , respectively.
For a sequence of operationsŌ 1 , . . . ,Ō N , becauseÔ i andŌ i are similar matrices up to the same transformation independent of the operation (i.e. the index i), we have Q j |Ō N · · ·Ō 1 |ρ k = Q j |Ô N · · ·Ô 1 |ρ k . (E7) Therefore, although estimations {|ρ k , Q j |,Ô i } may be different from their correspondences {|ρ k , Q j |,Ō i }, they can always provide the correct prediction for the expected value of an observable in an initial state going through a sequence of operations. invertible, i.e. g =M outM in is always invertible, ifε in < s min (M in(0) ) andε out < s min (M out(0) ). Choosing {ρ Here, we have used that 3.0204. The severity of the error in the estimation of a n-qubit operation iŝ as we will show next. Here, For an invertible matrix A, we have Then, using the inequality (C3), we have We have the expression First, we have Ō 2 ≤ O (0) 2 +ε O . Second, using we have Third, using the inequality (F10), we have

Appendix G: Upper bound of the cost
We consider compensation method and take λ = 1, i.e. the n-qubit operation without error is realised as where E is decomposed using basis operations as shown in Eq. D1. Then the cost for correcting the error in O is determined by Here, q and E are 16 n -dimensional vectors, and A 1 ⊗ · · · ⊗ A n is a 16 n -dimensional matrix. Therefore, for each element of q, Here, we have used that the maximum absolute value of Using the inequality (F10), we have Here, max = max{ A l − A (0) max |l = 1, . . . , n}. There are total 16 n decomposition coefficients, therefore Here, we have assumed that max < s min (A (0) )/16. We consider using the set of initial states with errors {|ρ k } to realise the initial state |ρ , which is in the set of initial states without error {|ρ (0) k }. The initial state can be decomposed as |ρ max as the measure of the error severity in initial states. Using |ρ k0 − |ρ we have the cost for correcting errors in initial states It is similar for observables. We consider using the set of observables with errors { Q j |} to realise the observable Q max as the measure of the error severity in observables. Then, the cost for correcting errors in measured observables is , which is equivalent to the controlled-NOT gate and controlled-phase gate up to single-qubit gates. The sixteen basis operations can be realised as shown in Table I. In order to perform GST, we choose initial states and observables as in Appendix D, and we choose the invertible matrix T = M in(0)⊗n for n qubits.
For the initialisation, the state prepared is ρ 0 rather than |0 0|. We can always express the initialisation operation with error as I = N i I, where N i (|0 0|) = ρ 0 .
A POVM is defined by a set of operators {E k } satisfying k E † k E k = 1 1. In a POVM, the state is mapped to E k ρE † k when the outcome is k. When the measurement has error, we may not be able to obtain all the information k. Usually there are only two outcomes corresponding to |0 and |1 , respectively. In this case, maybe several k values correspond to the same outcome ν = 0, 1. Therefore, we model the projective measurement [π] with error as Mρ = k∈K0 E k ρE † k , where K ν is the set of k corresponding to the measurement outcome ν.
For a gate without error G (0) , the gate with error can be expressed as G = N a G (0) N b . Any noisy gate can be expressed in this form: Because G (0) is invertible, we can always take N b = [1 1] and N a = GG (0)−1 .
We suppose that time costs of the measurement [π] and the two-qubit gate [Λ] are the same, and time costs of single qubit gates are negligible.
We distinguish the identity operation and the memory operation. Without error, both of them are the same operation [1 1]. In any case, the identity operation is [1 1], which means that the next operation is performed immediately, so it takes no time and there is not any memory error. When the memory operation is performed, the qubit waits for the next operation, so memory errors may occur on it. We apply the identity operation for measuring the matrix g (see Appendix E). In the basis operation set, the operation [1 1] is replaced by the memory operation.
We set the cycle time of the computing as the time cost of the measurement and the two-qubit gate. In one cycle, only one operation is performed on a qubit. If the operation is a single-qubit gate, the gate is performed at the middle of the cycle, i.e. the overall operation is N m GN m , where N m denotes memory noise. If no gate or measurement is performed in the cycle, the overall operation is N 2 m , which is the error version of the operation [1 1] in the basis operation set.
Similarly, we can obtain the 4 × 4 matrix g by choosing not to apply any gate to initial state, so that the matrix elements are g j,k = Tr Q jρk .
This process is implemented for each qubit and each type of single-qubit gate (including all basis operations). One will find that there is a freedom in the specification of the gateŌ. Legitimate variants can be obtained aŝ where T is an invertible 4 × 4 matrix. The matrix T can be different for different qubits but must be the same for all gates on the same qubit. We can choose T to minimise the cost in QEM. In the case that the error rate of preparing initial states is low, we can take which approximately minimise the cost according to our experience.
Estimations of the initial stateρ k and the observablē Q j are respectively Here, M •,k (M j,• ) denotes the k th column (j th row) of the matrix M .
To measure a two-qubit gate using GST, the procedure is basically the same. The only difference is that there are 16 initial states and 16 observables to be measured. Initial states are the tensor products of single-qubit initial states, i.e.ρ (1) k1 ⊗ρ (2) k2 , and observables are tensor products of single-qubit observables, i.e.Q (1) j1 ⊗Q (2) j2 . Here, the superscript is the label of the qubit. Accordingly, the matrix g = g (1) ⊗ g (2) , which is the tensor product of g matrices of two qubits, and similarly the matrix T = T (1) ⊗ T (2) , which is the tensor product of T matrices of two qubits. We need to implement two-qubit gate GST for each pair of qubits that the two-qubit gate may be performed on.

Quasi-probability decomposition
Using results obtained from GST, we can compute the quasi-probability decomposition. From GST we obtain estimations of initial states, observables to be measured, gates (including basis operations), and they are |ρ k : Initial state Q j | : Observablê O : Gatê B i : Basis operation These estimations are utilised to compute the quasiprobability decomposition. Now, we focus on the inverse method. We use O (0) to denote the Pauli transfer matrix of the ideal gate (without error). The estimation of the Pauli transfer matrix of the actual gate with noise (i.e.Ō) isÔ, which is obtained in GST. To compute the decomposition, first, we compute the ideal matrix O (0) ; second, we compute the inverse of the noise and finally, we solve the equation (for single-qubit gate) to determine quasi-probabilities q O,i of the gate O. We need to compute quasi-probabilities for each qubit and each gate. For example, for the SWAP-test circuit, we need to compute the decomposition for Hadamard gate, T gate, T † gate of each qubit and controlled-NOT gate of each pair of qubits that the controlled-NOT gate may be performed on.
For two-qubit gates, the procedure is the same but tensor products of single-qubit basis operations, i.e.B (1) i1 ⊗ B (2) i2 (where the superscribe is the label of the qubit), are used to decompose the inverse of the noise N −1 .
In order to mitigate errors in initial states and measurements of observables, we should solve the following equations for the quantities q for each qubit. Here, m is the label of the qubit, |ρ (0) is the column vector representing the ideal initial state |0 0|, and Q (0) | is the row vector representing the ideal observable σ z . Before implementing the quasi-probability decomposition on a quantum computer, we compute for each qubit and for each gate.

Monte Carlo implementation of the quasi-probability decomposition
It is vital to note that we use estimationsB i to decompose the inverse of the noise, but usually there is a difference betweenB i and the actual basis operationB i . This difference does not cause any error in the final computing result, because the computing result is invariant under a similarity transformation as we have explained in the main text. Now, we describe how to implement the quasiprobability decomposition on a quantum computer. We suppose the circuit is sequentially performing gates O 1 , O 2 , . . . , O N on the initial state |00 . . . 0 , and the first qubit is measured in the σ z basis to read the computing result. The procedure can be generalised to the case of measuring multiple qubits.
First, we generate a set of random integers: for each qubit m, we randomly select an integer k m such that each integer would be selected with corresponding probability |q  Second, on the quantum computer, we implement the following quantum computing for once: we initialise the qubit m in the stateρ Taking = p x +p y +p z , it is obvious that in the inhomogeneous Pauli error model, the error component does not depend on the error rate, i.e. E = −1 (p x [σ x ] + p y [σ y ] + p z [σ z ]). We remark that ratios p α / do not change with .