Optimal resource cost for error mitigation

One of the central problems for near-term quantum devices is to understand their ultimate potential and limitations. We address this problem in terms of quantum error mitigation by introducing a framework taking into account the full expressibility of near-term devices, in which the optimal resource cost for the probabilistic error cancellation method can be formalized. We provide a general methodology for evaluating the optimal cost by connecting it to a resource-theoretic quantifier defined with respect to the noisy operations that devices can implement. We employ our methods to estimate the optimal cost in mitigating a general class of noise, where we obtain an achievable cost that has a generic advantage over previous evaluations, as well as a fundamental lower bound applicable to a broad class of noisy implementable operations. We improve our bounds for several noise models, where we give the exact optimal costs for the depolarizing and dephasing noise, precisely characterizing the overhead cost while offering an operational meaning to the resource measure in terms of error mitigation. Our result particularly implies that the heuristic approach presented by Temme et al. [K. Temme, S. Bravyi, and J. M. Gambetta, Phys. Rev. Lett. 119, 180509 (2017)] is optimal even in our extended framework, putting fundamental limitations on the advantage provided by the extra degrees of freedom inherent in near-term devices for this noise model.


I. INTRODUCTION
Recent technological developments push us toward the realization of quantum information processing in a fully controlled manner, and a near-term cornerstone is to make use of noisy intermediate-scale quantum (NISQ) devices, which focus on manipulating tens to hundreds of qubits without a full error correction [1][2][3]. However, whether these near-term devices can provide advantages in useful applications is still elusive. In particular, rigorous theoretical analysis on the ultimate potential and limitations of NISQ devices has been largely missing.
In this work, we address this problem in terms of the capability to fight against noise. To deal with the critical noise effect without implementing an error-correcting code that is out of reach for the current technology, various error mitigation protocols have been proposed [4][5][6][7][8][9]. Among them, the probabilistic error cancellation method [4,8,[10][11][12][13][14][15] stands as a promising candidate, as it can construct an unbiased estimator that faithfully estimates the expectation value of an observable under a known error model. Probabilistic error cancellation is an application of a more general technique known as quasiprobability sampling [16][17][18][19][20][21], and the relevant resource cost is the sampling overhead-the number of samples necessary to ensure a certain accuracy-characterized by how much negative portion the quasiprobability has. Thus, the capability of a given noisy device to mitigate errors with this method can be characterized by the minimum negativity in the quasiprobability, providing the optimal resource cost at which the probabilistic error cancellation can be run. * ryuji.takagi@ntu.edu.sg However, the "optimal" cost is not well-defined by itself. Since quasiprobability decomposition expresses a given gate as a linear combination of noisy operations implementable on a noisy device, the optimal cost can highly depend on the choice of implementable operations. Many previous studies-not only on error mitigation but on NISQ algorithms in general-chose Pauli and Clifford gates as building blocks for the implementable operations [2]. As for the probabilistic error cancellation, Clifford operations combined with a target unitary gate followed by a noise channel were heuristically considered for depolarizing and amplitude damping noise [4], being extended to a complete set of Clifford-based operations applicable to a wider class of noise models [8]. However, the necessity of considering the Pauli and Clifford gates has been rarely asked. It surely makes sense to give them special status in the fault-tolerant quantum computation with error-correcting codes where logical operations take place in a code space, for which Clifford operations admit simple logical gate constructions [22][23][24]. On the other hand, for NISQ devices that do not implement error-correcting codes, there is no clear reason for the Clifford gates to be preferred over non-Clifford gates at the level of physical operations on unencoded qubits. In this sense, NISQ devices are endowed with extra degrees of freedom, and this flexibility should be fully exploited so that their potential and limitations can be properly gauged.
Here, we introduce a framework that incorporates the full expressibility of noisy near-term devices that do not assume error correction. Our framework formalizes the optimal resource cost for the probabilistic error cancellation with respect to a continuous set of implementable operations, reflecting the flexibility of the NISQ devices. This consideration, however, also raises a demanding theoretical problem; since we need to take into account infinitely many implementable operations, obtaining the optimal resource cost can be extremely challenging. We address this problem by relating the optimal cost to a quantity studied in resource theories [25]. A major pillar of resource theories is the quantification of resources, and previous studies indicated intimate relations between quasiprobability representation and resource quantifiers [26][27][28][29]. In particular, a resource quantifier known as robustness measure [30], has found several applications in the context of simulating quantum circuits [18,19] and quantum memory [31]. We consider our set of implementable operations as the accessible free resource and find that the optimal mitigation cost can be characterized by the robustness measure defined in our resource-theoretic framework, which can be evaluated by leveraging tools in general convex resource theories [32,33].
Then, we employ our method to obtain universal bounds for optimal error mitigation cost for a general noise model, showing that our framework provides a generic advantage over previous evaluations based on a discrete set of implementable operations, while placing a fundamental lower bound that must be observed by any device whose implementable operations are subsumed by the one introduced in our framework. We also study several specific noise models, for which we find that our bounds can be improved. Notably, our methods provide exact optimal costs for depolarizing and dephasing noise channels, precisely characterizing the error mitigation capability of noisy devices while offering an operational meaning to the robustness measure in the context of error mitigation. Our result particularly indicates that the heuristic decomposition for the depolarizing channel given in Ref. [4] is still optimal even in our extended framework, putting fundamental limitations on the enhancement enabled by a continuous set of implementable operations for this specific noise model.
Our results not only provide a systematic way to evaluate the ultimate resource cost for error mitigation crucial for running useful algorithms in practice, but also present an application of ideas in resource theories to address concrete problems [18,[34][35][36][37][38], paving the way for a rigorous information-theoretical analysis of noisy near-term devices.

II. PRELIMINARIES
A purpose of quantum computation, in particular for many variational algorithms designed for near-term devices, is to obtain an expectation value ideal = Tr[ ], where is some observable and = U • · · · • U 1 ( ) is a final quantum state with input state fed into a quantum circuit composed of a sequence of unitary gates, {U } =1 . (The curly letter refers to a unitary gate as a quantum channel, i.e., U ( ) := † .) However, since the application of quantum gates necessarily suffers from noise, the ideal gates {U} =1 are not directly implementable on noisy devices. Instead, one can consider a set of noisy operations, I E , for some noise channel E that is implementable on the device of interest. The idea of the probabilistic error cancellation method is to represent each quantum gate as a linear combination of the noisy implementable operations as U = , O , O ∈ I E , where , is a (not necessarily positive) real number. Then, for each gate U we sample an implementable operation O at probability | , |/ | , |, prepare a state˜= O • · · · • O 1 ( ) where O is the implementable operation sampled for U , and measure the observable . Then, letting := | , |, tot := =1 , and sgn tot := =1 sgn( , ), one can check that this realizes an unbiased estimator of ideal as ideal = tot sgn tot ( ) samp , where ( ) is a random variable for the measurement outcome and · samp refers to the expectation value for the sampling average taken for the above procedure.
Although it gives the desired expectation value, canceling the noise comes with a cost: one needs to pay more sampling cost than would be needed to estimate the desired expectation value with a noiseless circuit within the same accuracy. The Hoeffding's inequality [39] ensures that a sufficient number of samples used for estimating the true expectation value with error at probability 1 − is given by (2 2 tot / 2 ) ln(2/ ). Thus, having small tot is crucial to suppress the sampling overhead and, since tot grows exponentially with respect to the number of gates, the problem is reduced to finding a good linear decomposition of each ideal gate U with respect to implementable operations I E with small .
Clearly, the best linear decomposition depends on the choice of I E , and it has been heuristically chosen on a case-bycase basis. For instance, for the depolarizing noise model D , where is the dimension of the system and is the noise strength, I D , = {D , • P • U} where P is a Pauli operator was considered, while for the single-qubit amplitude damping channel A , a set of implementable operations that works for the linear decomposition was found to be I A = {A • U, A • Z 1/2 • U, A • Z −1/2 • U, P |0 } where Z 1/2 (·) = 1/2 · 1/2 † with 1/2 diag(1, ) the phase gate and P |0 the preparation of state |0 [4]. Later, this idea was extended to a Clifford-based universal basis set that works for any noise model with a sufficiently small noise strength [8].

III. FRAMEWORK
Although the above sets of operations can realize some decomposition, there is no guarantee that these choices of I E achieve the smallest overhead among other possible choices of the set of implementable operations. To assess the ultimate potential and limitations of the devices' capabilities having access to a continuous set of physical operations, we need to give more freedom to noisy devices with their programmability, i.e. the set of implementable operations under the noiseless condition. Motivated by this observation, we introduce the set of implementable operations as for a given noise channel E ∈ T ( ) where T ( ) is the set of completely positive trace preserving (CPTP) maps with input and output systems being -dimensional quantum systems, and P ( ) is the set of programmable operations defined as where { } is a probability distribution, T ( ) is the set of unitary channels on -dimensional systems, and S( ) = {P | | | ∈ H } with H being the -dimensional Hilbert space is the set of state preparation channels. The operations in I E ( ) are implementable on the devices that can program any unitary gate and state preparation under the effect of noise E, which is reasonable for small such as = 2 (single qubit) and = 4 (two qubits), and even if a given device is not powerful enough to realize the above operations, our results serve as its ultimate bounds. Note that although different operations may come with different noise channels in general, here we take a fixed noise channel after operations with the same size, which is a standard assumption in the quantitative analysis of noise effects [4,24,40,41]; extending our formalism to accommodate different noise channels is left for future work.

IV. RESOURCE COST AS ROBUSTNESS
Our goal is to find a decomposition for a given ideal unitary gate U ∈ T ( ) with respect to the set of implementable operations (1) with the minimum absolute sum of the coefficients. Namely, the optimal overhead constant opt (U) is written as where we assume opt (U) < ∞. As noted in Ref. [4], Eq. (3) becomes a linear program when I E ( ) is a discrete set. However, because we aim to exploit the full expressibility of the device and consider the continuous set I E ( ) given in (1), Eq. (3) is no longer simple linear programming; if one obtains some valid decomposition of U, it is hard to see whether there exists another decomposition that gives a smaller overhead constant.
Here, we provide a general strategy to approach this problem. First, we introduce the following quantity This type of quantities defined for quantum states have been used to quantify the amount of quantum resources (e.g., entanglement) contained in the given state and is known as the (standard) robustness measure in resource theories. In particular, the quantity in (4) can be considered as the robustness measure defined for quantum operations in the context of the resource theory of channels [33,42,43], where I E serves as the set of free channels. Then, one can show that (4) is equivalent to (3) with the relation which we explain in Appendix A. (See also related arguments considered for other settings in Refs. [18,31,44]. we can bring up ideas and tools developed in general convex resource theories [32,33] and apply them to our resource measure in (4). In particular, we obtain the following dual form of the robustness (details are given in Appendix B): , =0 | | is the Choi matrix of channel Λ and H denotes the set of Hermitian operators. The above dual form is known to provide operational meanings to robustness measures, and it particularly indicates that the robustness is physically observable [31,33]. Furthermore, these two expressions provide useful bounds for the optimal resource cost as where ∈ H and ≥ 0 are any Hermitian operator and real number satisfying the condition in (4) and (6). In addition, we show in Appendix C that if we find a decomposition U = O = E•V , which gives an upper bound in (7), then = −2 E −1 † •U with E −1 := V • U † satisfies the condition in (6) and provides 2 Tr[Φ id ⊗E −1 (Φ )] − 1 as a candidate for a good lower bound.

V. EVALUATION OF RESOURCE COST
The generality of our method allows us to obtain bounds for the optimal resource cost for general noise channels. To see this, let us consider the noise channels of the form which represents a large class of channels and indeed any CPTP map on up to two-qubit systems. In Appendix D, we show the latter claim by introducing a universal set of basis operations for CPTP maps that solely consist of programmable operations inP, which can be of independent interest. Although we are usually interested in the cases of small error , here we do not impose this assumption. This takes into account the fact that the decomposition is not unique, and the following theorem provides bounds for the optimal cost for any decomposition of this form. To state the result, let = ì ∈ {0, 1} wt( ì ) = denote the set of -bit strings which have 1's. Then, for ì ∈ we define ( , − ) ì to be an operation that applies for times and for − times with a pattern specified by a binary string ì . For instance, if ì = (0, 1, 1), then ( 2 , 1 ) ì = • • . Then, we get the following bounds that provide a systematic estimation of the optimal cost (the proof is given in Appendix E): , ± ≥ 0, and Λ, Ξ ∈P ( ), if 1 − > + + − , then for any U ∈ T ( ), The upper bound, an achievable cost (see also Appendix F), becomes especially insightful when the given error channel is not expressed by Pauli or Clifford operations. For instance, consider the generalized dephasing channel Fˆ, ( ) := (1 − ) +ˆ·ˆ/ 2 −ˆ·ˆ/2 , whereˆ·ˆ= + + and is the unit vector that determines the rotation axis. As we show in Appendix G, the best systematic decomposition with the Clifford-based basis in Ref. [8] , leading to a larger cost than what is achievable in our framework. This advantage comes from our flexible choice of decomposition basis taking advantage of our large set of implementable operations. In fact, we find that the advantage is generic. We compare two costs disc and cont where disc is the optimal cost for the discrete basis in Ref. [8] and cont is the achievable cost (1 − 2 + ) −1 from Theorem 1 realized by our continuous basis, where we set U = id for both cases. Figure 1 plots disc / cont for randomly sampled single-qubit and two-qubit noise models of the form E = (1 − ) id + V, where V is a Haar random unitary. We find that cont is smaller than disc in many cases for single-qubit error channels and all the sampled cases for two-qubit error channels. The advantage for two-qubit error is particularly significant-the improvement can become around the factor of 2 for the realistic noise strength, which can drastically reduce the total sampling cost that grows exponentially with the number of gates. Since two-qubit noise will be the most demanding one in real experiments, our result may greatly ease the experimental challenges. At the same time, the advantage seen in Fig. 1 confirms the necessity of considering the extended class of implementable operations introduced here to properly assess the ultimate capability of error mitigation.
The lower bound, on the other hand, corresponds to the fundamental limitations on the error mitigation performance. It serves as a universal necessary cost even if we employ the continuous set of implementable operations. Notably, this bound readily applies to any subset of implementable operations contained in I E , encompassing a broad class of noisy devices of interest. Although it might look daunting to evaluate the expression for the lower bound, it can be analytically calculated for many cases, as we discuss in Appendix E.
Theorem 1 can also give insights into correlated noise models. If the noise of interest has correlation among a number of qubits, the effective noise channel gets large and a quasiprobability decomposition becomes numerically intractable even if < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 f O p t p 1 w X P J k m / S 5 P m 4 r 1 a a n W 0 U = " > A A A H r X i c d d X J b h o x G A B g J 1 1 I 6 Z a 0 x 1 6 m R Z V 6 I o D a t F U u 2 U O U k F A C Z M E 0 s o 2 H j D I L s k 2 r y P E D 9 G l 6 b R + l b 9 N / g A h n H E b y j O X P v 8 e b b D o I A 6 l K p X 9 z 8 w 8 e P n q c W 3 i S f / r s + Y u X i 0 u v 2 j I Z C s Z b L A k T c U q J 5 G E Q 8 5 Y K V M h P B 4 K T i I b 8 h F 5 t p n 7 y g w s Z J H F T X Q 9 4 N y L 9 O P A D R h Q U X S w W c J 9 E E b n Q W E R e L 5 D M L N s l L I m V g V q l Y m n 0 e G 6 m P M k U 0 O S p X y w 9 F r i X s G H E Y 8 V C I m W n X B q o r i Z C B S z k J o + H k g 8 I u y J 9 3 o F s T C I u u 3 o 0 G u O 9 h 5 K e 5 y c C U q y 8 U a k d o U k k 5 X V E o W Z E 1 K X M W l p 4 n 3 W G y v / S 1 U E 8 G C o e s / G P / G H o q c R L p w a G L z h T 4 T V k C B M B 9 N V j l 0 Q Q p m A C 7 / y F C n L F l c n n c c x / s g S m K + 5 p D E U A p l P p 6 h t d K B s s S N w P O X 6 L w 1 F G F y r m x t y N 6 U W m U + 7 e x q Z R a c p U k p Q I o / E q F k N o B a b S 6 K / F T / D B I u h f q h u 8 m g k I J f Q O P t x X e P w p Z m q I c Y 1 R P I b / Z Y Y S r Q O m M 8 h I q N e z / Y k 2 L d 1 0 d N v S b U c P L T 1 0 t G V p y 9 G 2 p W 1 H m 5 Y 2 H d 2 y d M v R P U v 3 H N 2 x d M f R A 0 s P s i r C d J r H G 2 G 8 D b L h u 1 b 4 r t N 4 1 d K q o z V L a 4 4 e W X r k 6 I a l G 4 7 u W 7 r v a N 3 S u q M N S x u O H l t 6 7 O i p p a e O n l l 6 5 u i 5 p e e O 0 t t V p P S e R a R 7 U 3 U 3 A K 1 O 1 V 0 F W p u q u w r 0 c K r u h q e N q b p z 5 d / u W h 9 O n H u 2 r b 9 j u z s q v 2 F 7 2 n 4 e b 3 E 4 k w W v Q e n R g A u i E q G x g g M G 0 g x t g j Z n a t A z G t I A v 4 W 1 w n r h W 2 F r 0 n V x Y R r z G t x L h e / / A b g n z F U = < / l a t e x i t > ✏ < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 f O p t p 1 w X P J k m / S 5 P m 4 r 1 a a n W 0 U = " > A A A H r X i c d d X J b h o x G A B g J 1 1 I 6 Z a 0 x 1 6 m R Z V 6 I o D a t F U u 2 U O U k F A C Z M E 0 s o 2 H j D I L s k 2 r y P E D 9 G l 6 b R + l b 9 N / g A h n H E b y j O X P v 8 e b b D o I A 6 l K p X 9 z 8 w 8 e P n q c W 3 i S f / r s + Y u X i 0 u v 2 j I Z C s Z b L A k T c U q J 5 G E Q 8 5 Y K V M h P B 4 K T i I b 8 h F 5 t p n 7 y g w s Z J H F T X Q 9 4 N y L 9 O P A D R h Q U X S w W c J 9 E E b n Q W E R e L 5 D M L N s l L I m V g V q l Y m n 0 e G 6 m P M k U 0 O S p X y w 9 F r i X s G H E Y 8 V C I m W n X B q o r i Z C B S z k J o + H k g 8 I u y J 9 3 o F s T C I u u 3 o 0 G u O 9 h 5 K e 5 y c C U q y 8 U a k d o U k k 5 X V E o W Z E 1 K X M W l p 4 n 3 W G y v / S 1 U E 8 G C o e s / G P / G H o q c R L p w a G L z h T 4 T V k C B M B 9 N V j l 0 Q Q p m A C 7 / y F C n L F l c n n c c x / s g S m K + 5 the description of the noise is given and a discrete set of implementable operations is used. On the other hand, Theorem 1 immediately gives an effective evaluation of the optimal cost as long as the noise channel is provided in a certain form.

p D E U A p l P p 6 h t d K B s s S N w P O X 6 L w 1 F G F y r m x t y N 6 U W m U + 7 e x q Z R a c p U k p Q I o / E q F k N o B a b S 6 K / F T / D B I u h f q h u 8 m g k I J f Q O P t x X e P w p Z m q I c Y 1 R P I b / Z Y Y S r Q O m M 8 h I q N e z / Y k 2 L d 1 0 d N v S b U c P L T 1 0 t G V p y 9 G 2 p W 1 H m 5 Y 2 H d 2 y d M v R P U v 3 H N 2 x d M f R A 0 s P s i r C d J r H G 2 G 8 D b L h u 1 b 4 r t N 4 1 d K q o z V L a 4 4 e W X r k 6 I a l G 4 7 u W 7 r v a N 3 S u q M N S x u O H l t 6 7 O i p p a e O n l l 6 5 u i 5 p e e O 0 t t V p P S e R a R 7 U 3 U 3 A K 1 O 1 V 0 F W p u q u w r 0 c K r u h q e N q b p z 5 d / u W h 9 O n H u 2 r b 9 j u z s q v 2 F 7 2 n 4 e b 3 E 4 k w W v Q e n R g A u i E q G x g g M G 0 g x t g j Z n a t A z G t I M 5 U b z W S b 7 s d H p a 1 a v E j i k 0 t c M h + s J m h + 9 8 3 B L l b N 3 k p t p V 4 r l l W L l 2 8 f C 2 s r k v l p A b 9 A 7 9
Interestingly, the bounds in (8) can be improved by focusing on specific noise models. We can even obtain the exact optimal costs for depolarizing and dephasing noise, providing a precise characterization of the devices' capability under these noise models (the proof is given in Appendix H). Together with (5), it also gives an operational meaning to the robustness measure in terms of quantum error mitigation. This result shows that for these noise models, the optimal cost can be achieved by the Pauli-based basis operations [4,8], and the continuous degrees of freedom do not help reduce the cost. This particularly implies that the heuristic linear decomposition for the depolarizing noise considered in Ref. [4] remains optimal even in our extended framework, putting fundamental restrictions on the error mitigation feasibility.
Our method also provides a tighter bound for the amplitude damping noise (the proof is given in Appendix I).

Theorem 3. Let A be the qubit amplitude damping channel with Kraus operators
1|. Then, for any unitary channel U ∈ T (2) and 0 ≤ < 1, The upper bound in (11) can be achieved by a decomposition , and the lower bound is obtained by finding a good witness operator in (6), or alternatively directly evaluating the lower bound in (8).

VI. CONCLUSIONS
We investigated the ultimate potential and limitations of near-term noisy devices in terms of quantum error mitigation. We pointed out the wide programmability equipped with NISQ devices and formalized their implementable operations to properly assess their full capability. We provided a general methodology to evaluate the optimal error mitigation cost with the probabilistic error cancellation method by establishing a connection to the robustness measure that naturally emerges as a resource quantifier in our framework. We applied our method to a general noise model and obtained universal bounds for the optimal mitigation cost, finding that our framework, which takes into account the flexible choice of implementable operations, leads to a generic advantage over the strategy based on discrete basis sets. We also obtained the exact optimal costs for depolarizing and dephasing noise, rigorously showing that the cost for depolarizing channels given in Ref. [4] is optimal even in our extended framework, as well as obtained an improved bound for the amplitude damping channel.
Our consideration can be combined with other frameworks such as learning algorithms for unknown error models [13][14][15] and analog error mitigation [10], and may be extended to a broad class of problems for which the quasiprobability sampling is used, including the measurement of observables in variational algorithms and classical simulation of noisy quantum circuits. The generality of our approach also gives us the freedom to consider various sets of implementable operations, allowing one to tailor the analysis in this work to given physical devices. Another important future work is to extend our results toward a unified information-theoretic account of error mitigation and error correction, which will further clarify the true capability of noisy quantum devices.

ACKNOWLEDGMENTS
The author is grateful to Tennin Yan, Nobuyuki Yoshioka, and Xiao Yuan for fruitful discussions, and in particular to Suguru Endo, Kosuke Mitarai, and Yuya O. Nakagawa for valuable discussions and useful comments on the manuscript. This Here, we obtain the dual form of the robustness in (6) for general resource theories of channels (i.e., any finite-dimensional convex and closed set of free channels), in which our case is included as a special case. The discussion in this section is based on the argument given for another type of robustness measure (known as generalized robustness) for general resource theories of channels [33].
Let T ( , ) be the set of quantum channels with input subsystem and output subsystem . Given a convex and closed set of channels O F ⊆ T ( , ), we consider the (standard) robustness measure for channel Λ ∈ T ( , ) with respect to O F as where Λ = id ⊗Λ( ·Φ ) denotes the Choi matrix for a channel Λ and we assume O F (Λ) < ∞. Let H ( ) be the subspace consisting of Hermitian operators defined on subsystems ( ), and O F ⊆ H ⊗ H be the set of Choi matrices corresponding to channels in O F . Then, introducing the variableΞ = Λ + Θ, (B1) can be rewritten as the following optimization problem: where cone(S) := ≥ 0, ∈ S . We write the Lagrangian as Since the objective function does not include or , we reach the following equivalent formulation; which results in (6) by taking O F = I E .
Appendix C: Systematic lower bound for Eq. (7) Here, we present a systematic lower bound that can be constructed by a specific decomposition used for the upper bound. Suppose U = E • V for V ∈P and define E −1 := V • U † so that E • E −1 = id. Then, as long as V , E ∈ T ( ), it also holds that E −1 • E = id. This can be shown by considering a matrix representation of an operation Λ ∈ T ( ) with a 2 × 2 matrix Λ , so that a matrix element of := Λ( ) is written as = Λ , .
Then, E • E −1 = id implies E E −1 = id = I. Since the right inverse of a square matrix is also the left inverse, we get Using this, we get which ensures 0 ≤ −2 Tr[ E −1 † •U E•V ] ≤ 1 for any V ∈ P ( ). Since E −1 † •U is also Hermitian, := −2 E −1 † •U is a valid operator satisfying the condition in (6). We can further obtain where U (·) = · ( ) † is the unitary transposed with respect to the Schmidt basis of Φ , and we also used that id ⊗ |Φ = ⊗ id |Φ for any matrix . Thus, (7) implies that 2 Tr[Φ id ⊗E −1 (Φ )] − 1 serves as a valid lower bound for opt (U).

Appendix D: Basis operations for CPTP maps
Although any noise channel can be represented by a CPTP map that possibly involves some external systems, one could consider CP trace nonincreasing maps as effective error channels acting on systems of interest. A trace nonincreasing map on -dimensional quantum systems is an element of the 4dimensional vector space of linear maps, and thus it can be represented as a linear combination of 4 linearly independent trace nonincreasing maps. A specific set of such maps {B } 16

=1
( Table I) that serves as a universal basis that can decompose any trace nonincreasing map was introduced in Ref. [8]. It is a set of 16 linearly independent maps, consisting of 10 Clifford unitaries and 6 trace nonincreasing projections. Then, any operation acting on qubit systems can be linearly decomposed with respect to this basis, and its tensor product also serves as a universal basis for multiqubit systems. They also showed that this basis remains linearly independent after suffering from a noise channel as long as the noise strength is sufficiently small and used this for the quasiprobability decomposition for probabilistic error cancellation.
Although a universal basis should contain trace nonincreasing maps if the noise model can also be trace nonincreasing in general, in many cases noise models of interest are described as trace-preserving maps. Then, intuitively, one should be able to find a universal basis only consisting of trace-preserving maps for decomposing an arbitrary CPTP map, which is more favorable for probabilistic error cancellation because using trace nonincreasing maps usually incurs a larger overhead constant due to the larger coefficients in the decomposition needed to complement the non-unit trace (see also the example at the end of this section). In addition, using many projective measurements is not preferred due to the relatively large measurement error rate.
Here, we argue that if we restrict our attention to tracepreserving maps, the number of elements necessary for a universal basis is reduced to 4 − 2 + 1, which themselves are trace preserving. We then provide a specific set of CPTP maps consisting of unitary and state preparations that serve as a universal basis for CPTP maps on single-qubit and two-qubit systems. To this end, let O CPTP ( ) be the set of Choi matrices for CPTP maps with input and output being -dimensional quantum systems, and This means that there always exist two CPTP maps E 1 , E 2 and real numbers 1 , 2 such that = 1 E 1 + 2 E 2 . This showsṼ ( ) ⊆ V ( ), resulting in V ( ) =Ṽ ( ).Ṽ ( ) is a subspace of H( , ) with dim H( , ) = 4 and because of the constraintsṼ ( ) has, the dimension ofṼ ( ) is reduced to 4 − 2 + 1. Since V ( ) =Ṽ ( ), we have dim V ( ) = dimṼ ( ) = 4 − 2 + 1. Thus, 4 − 2 + 1 linearly independent CPTP maps are necessary and sufficient to linearly decompose an arbitrary CPTP map.
Up to two-qubit systems, such sets of CPTP maps can be explicitly constructed. For single-qubit systems, we replace B 11 , B 12 , B 13 with trace preserving state preparation channels and remove B 14 , B 15 , B 16 , leading to a set of linearly independent CPTP maps {B } 13 =1 (Table II). For multiqubit systems, simply tensoring {B } 13 =1 is not sufficient to construct a universal basis in general because 13 < 16 − 4 + 1 for ≥ 2. For two-qubit systems, one needs to find 16 2 − 4 2 + 1 − 13 2 = 72 additional independent operations to construct a complete basis, and they can be explicitly found as in Table III. An explicit construction beyond two-qubit systems is left for future work.
Note that since all the operations in Table II, III are unitary or state preparations, they are elements of programmable operationsP in our framework. Thus, for any CPTP map E on single-qubit and two-qubit systems, there always exists some decomposition E = (1 − ) id + + Λ − − Ξ, Λ, Ξ ∈P up to unitary.
Using {B } instead of {B } can save cost for error mitigation. For instance, one can check that the optimal cost = | | to mitigate the amplitude damping channel with {B } results in (1 + 2 )/(1 − ) whereas the optimal cost with {B } is (1 + )/(1 − ) as in (I1). 16 • X is a trace nonincreasing projection onto the |0 state.  The subscripts refer to the subsystems that the operations act on. "U + conjugation with V" refers to sandwiching U with V and its conjugation V † as V † • U • V. Then, "U + conjugation with K 1,2 , K † 1,2 " collects all of the nine possible conjugations with id 12 ,

Appendix E: Proof of Theorem 1
Proof. We first obtain an upper bound. Let us define and , are normalization constants, (1− ) +1 < ∞. Noting thatP is closed under concatenation, i.e., V 1 , V 2 ∈P ⇒ V 1 • V 2 ∈P, as well as the convexity ofP, we have Ψ , Ψ ∈P. Then, where the last equality is because where is the dimension of the system. Thus, we have U = E • Ψ • U − E • Ψ • U, and since Ψ • U, Ψ • U ∈P, we get opt (U) ≤ + = 1 1− −( + + − ) = 1 1−2 + . Next, we obtain the lower bound using the dual form of the robustness (6). Take the following witness One can check that this is a well-defined bounded operator (E9) where, in the first inequality, we used the triangle inequality of the operator norm; in the second inequality, we used the property of the completely bounded infinite norm (id ⊗·) ( ) ∞ ≤ ||| · ||| ∞ := sup (id ⊗·) ( ) ∞ ; and in the first equality, we used the dual property between the completely bounded infinite norm and diamond norm [46]. Thus, is a valid bounded Hermitian operator.
As argued in Appendix C, this choice of satisfies the condition in (6), and defining (E11) Using (7) results in the lower bound of the statement.
Although it might look daunting to evaluate , it can actually be exactly obtained for many cases. An important observation is that if at least one state preparation is involved, collapses to a constant. For instance, for the amplitude damping channel A where = 1+ − concluding the proof. The case for the dephasing noise can be shown similarly. We first get an upper bound by providing a specific decomposition. Namely, consider the following decomposition which can be explicitly checked as By applying U to both sides of (H4) from the right and using (3), we get opt (U) ≤ 1 1−2 . Next, we obtain a lower bound using (7) and the dual form of the robustness (6). Consider the following ∈ H. (H6) As argued in Appendix C, this choice of satisfies the condition in (6). A lower bound of the robustness can be computed using this as H7) Thus, the dual form of the robustness (6) implies I F (U) ≥ 1−2 , and using (7), we get opt (U) ≥ 1 1−2 . Combining it with the matching upper bound shown above concludes the proof of the statement.