Virtual Distillation for Quantum Error Mitigation

Contemporary quantum computers have relatively high levels of noise, making it difficult to use them to perform useful calculations, even with a large number of qubits. Quantum error correction is expected to eventually enable fault-tolerant quantum computation at large scales, but until then it will be necessary to use alternative strategies to mitigate the impact of errors. We propose a near-term friendly strategy to mitigate errors by entangling and measuring $M$ copies of a noisy state $\rho$. This enables us to estimate expectation values with respect to a state with dramatically reduced error, $\rho^M/ \mathrm{Tr}(\rho^M)$, without explicitly preparing it, hence the name"virtual distillation". As $M$ increases, this state approaches the closest pure state to $\rho$, exponentially quickly. We analyze the effectiveness of virtual distillation and find that it is governed in many regimes by the behavior of this pure state (corresponding to the dominant eigenvector of $\rho$). We numerically demonstrate that virtual distillation is capable of suppressing errors by multiple orders of magnitude and explain how this effect is enhanced as the system size grows. Finally, we show that this technique can improve the convergence of randomized quantum algorithms, even in the absence of device noise.


I. INTRODUCTION
Performing meaningful calculations using near-term quantum computers is challenging because of the relatively high error rates of these devices. While quantum error correction promises to enable quantum computation with arbitrarily small levels of noise, the overhead required is too large to be currently practical [1,2]. The most plausible paths between today's quantum computers and a fault-tolerant device assume a modest decrease in error rates together with a large increase in the number of qubits [2,3]. We find it interesting to ask if these additional qubits can be used fruitfully without employing the full machinery of fault-tolerance. In this work, we explore an alternative to traditional quantum error correction that uses multiple independently-performed copies of a computation for error mitigation.
A variety of strategies exist to mitigate against errors on noisy intermediate-scale quatum (NISQ) devices, i.e., to efficiently approximate the output that would be produced in the absence of noise. One class of approaches uses data collected at a variety of error rates to characterize the function relating the measured value of an observable to the error rate and extrapolate to the zero noise limit [4][5][6]. An alternative strategy proceeds by assuming a particular noise channel and expressing its inverse as a quasiprobability distribution over modified copies of the original circuit [4]. Other techniques work by comparing * corresponding author: whuggins@google.com † corresponding author: jmcclean@google.com classically tractable simulations (tractable because they utilize a restricted set of gates) to evaluations of the same circuits on a noisy device [7][8][9]. These methods aim to learn enough about the impact of the noise to predict the noise-free expectation values for structurally similar circuits. In Ref. 10, O'Brien et al. put forward a version of quantum phase estimation algorithm that achieves protection against errors by inverting the state preparation procedure and verifying that the system has returned to a reference state at the end of the computation. Besides these methods, more specific tools have been developed for ground state calculations [11][12][13], for situations when the desired state possesses certain symmetries [14][15][16][17][18], and for treating errors during measurement [19][20][21].
Before the modern field of quantum error-correction was developed, an alternative proposal was put forward for stabilizing quantum computations [22][23][24]. The essence of this approach is to execute M redundant copies of a computation in parallel and use a measurement to project into the symmetric subspace between these copies. Similar measurement primitives (measurements of the swap operator and its generalizations) have been applied to measure Renyi entanglement entropies and other polynomial functions of the density matrix [25][26][27][28][29][30][31][32][33]. One well-studied way to perform such a measurement is to use a Clebsch-Gordon or Schur transform to rotate to a basis which diagonalizes the swap operator [34]. In Ref. 35, Cotler et al. built on these approaches to implement an idea they call "virtual cooling." By performing a joint measurement on M copies of a thermal state at inverse temperature β (ρ ∝ e −βH ), they were able to estimate expectation values with respect to the thermal state arXiv:2011.07064v3 [quant-ph] 2 Aug 2021 at inverse temperature M β (ρ M ∝ e −M βH ). In this paper, we apply the same kind of measurement techniques to the problem of mitigating errors in a noisy quantum computation.
Earlier work on using symmetrization to stabilize a noisy quantum computation focused on protocols that prepared an approximately purified state [22][23][24]. We abandon this goal, and instead aim to reconstruct expectation values with respect to an approximately purified state without explicitly preparing it. We refer to this approach as virtual distillation, using the word "virtual" to emphasize that we don't actually prepare a purified version of the state like a typical distillation scheme would [36][37][38]. To be specific, we use collective measurements of M copies of ρ to measure expectation values with respect to the state where ρ = i p i |i i| is a spectral decomposition of ρ. Under this approach, the relative weights of the nondominant eigenvectors are suppressed exponentially in M . This represents an improvement over approaches which demand that the approximately purified state is prepared explicitly, which achieve a suppression that is merely linear in M in the general case [22][23][24]39]. Our proposed error mitigation technique offers the opportunity to make use of additional qubits to enhance the quality of a noisy computation without the large overhead of traditional quantum error correction. Furthermore, the technique is simple to use and analyze. If we neglect the errors that occur during measurement, it is straightforward to obtain analytic expressions for the states whose expectation values we effectively measure and for the variance of the resulting estimator. In the limit where the level of noise is small, the number of additional measurements required by our approach goes to zero. Our error mitigation strategy, as we shall show, is capable of reducing the impact of stochastic errors arising from noise on a near-term device as well as stochastic errors inherent to randomized quantum algorithms implemented on an error-free device.
We begin in Section II by introducing the theoretical formalism of virtual distillation and presenting its simplest implementation. We continue in Section II B with an analysis of the sample complexity of the simple version of this technique along with a proof that there exist more efficient generalizations under certain circumstances. In Section III, we study the error mitigation performance of virtual distillation analytically by splitting the effect of errors into two components. We treat the shift of the leading eigenvector of the density matrix away from the target (error-free) state perturbatively (Section III B), and the shift of the noisy density matrix away from its dominant eigenvector using a phenomenological model of errors (Section III A). Although the second effect may be exponentially suppressed by increasing the number of states (M ), the same is not true for the first effect, which in the worst case limits the performance of virtual distillation to only providing a constant-factor improvement in error rate (as a function of the underlying physical noise rate). For purely coherent errors, this first effect is the only consideration and virtual distillation offers no protection. To complement this analysis, in Section IV we present numerical simulations of virtual distillation applied to various noisy quantum circuits. We observe here that for some range of noise levels, virtual distillation achieves a rate of error suppression exceeding the bounds suggested in Section III B. Finally, in Section V, we consider the performance of our technique when applied to the stochastic errors that arise during randomized algorithms for real-time evolution.

II. THEORY
Virtual distillation is a protocol for using collective measurements of M copies of a state ρ to suppress incoherent errors by measuring expectation values with respect to the state ρ M /Tr(ρ M ). Virtual distillation approximates the error-free expectation value of O as The resulting estimator converges exponentially quickly towards the closest pure state to ρ as M is increased. In this section, we lay out the basic theory behind virtual distillation. We present the simplest implementation in Section II A and an analysis of the measurement overhead in Section II B. In Algorithm 1 below, we present pseudocode for the implementation discussed in more detail in Section II A. We begin by establishing some assumptions and notation. Throughout this paper we deal with operations that act on multiple copies of the same state. We make the assumption that the noise experienced by the separate copies has the same form and strength. If we relax this assumption, then we still measure an effective state that corresponds to the product of the density matrices of the individual copies so long as the copies are not entangled prior to virtual distillation. We briefly explore this more general situation in Appendix H.
We use the letter N to indicate the number of qubits in an individual system and the letter M to indicate the number of copies (which we sometimes refer to as subsystems). Superscripts with parentheses indicate an operator that acts on multiple systems. For example, we shall denote the cyclic shift operator between M copies by S (M ) . We use bolded superscripts without parentheses to denote which copy an operator acts on, e.g., O 1 indicates the operator O acting on subsystem 1. We use superscripts without a bold-faced font or parentheses to indicate exponentiation as usual. Subscripts are used in two different ways. Subscripts on an operator generally indicate which qubit within a system the operator acts on. The exception is when the subscript is being used more generically as an index in a summation, which should always be clear from the context and the presence of the symbol.

Algorithm 1 Virtual distillation, basic implementation (see Section II A)
Input: A number of measurement repetitions K, 2K copies of the N qubit state ρ (provided two at a time). Output: An error-mitigated estimate of Zi for each qubit in ρ; Zi corrected ≈ Tr(Z i ρ 2 ) Tr(ρ 2 ) .
Set Ei = 0 for each qubit i ∈ 1..N . Set D = 0. for k ∈ 1..K do Perform any SWAP operations necessary to make it possible to couple each qubit in the first copy of ρ with the corresponding qubit in the second copy.
Apply the two-qubit gate B (2) i (defined below in Eq. 10 of Section II A) between each qubit i in the first copy and the corresponding qubit in the second copy.
Measure both states in the computational basis. Let z 1 i and z 2 i denote the measurement outcomes for the ith qubits in the first and second copies of ρ respectively. for In order to evaluate the numerator and denominator of Eq. 1, we can make use of the following equality [26,27,35], Here, O i indicates the observable O acting on (an arbitrary) subsystem i and S (M ) indicates the cyclic shift operator on M systems, i.e., This identity can be proven by expanding the right-hand side, carefully keeping track of the indices. Without loss of generality we choose i = 1, yielding Tr(Oρ M ).
In Figure 1 we present a diagrammatic representation of Eq. 3 for the case where M = 3 (note that we have commuted ρ ⊗3 with S (3) in the diagram). The quantities in the numerator and denominator of Eq. 2 can be evaluated in a number of different ways. and i = 1 using tensor network notation [40][41][42]. The blue square represents the operator O 1 , each red circle represent a copy of the state ρ, and the connections between the shapes indicate indices which are summed over. The cyclic shift operator S (3) is naturally represented as a product of two swap operators, which are themselves indicated by the crossed wires. Note that the top diagram actually corresponds to the expression Tr(O 1 ρ ⊗3 S (3) ); we commuted ρ ⊗3 with S (3) before producing the figure. Rearranging the wires to yield the bottom diagram is equivalent to the simplification of the summation in Eq. 5.
For simplicity, we focus our presentation one such approach Section II A. In that section, we roughly follow the work of Ref. 35, except that we use the language of qubits rather than bosonic systems. We discuss a variety of alternative protocols in Appendix A, Appendix B, and Appendix C. Figure 2 summarizes the differences between these variants. The practical utility of these techniques as error-mitigation tools will be partly determined by the number of samples necessary to evaluate the corrected expectation values to within some target precision . We address this issue in Section II B and also show that their exists generalizations of our approach that can further reduce the number of circuit repetitions for a desired precision.

A. Measurement by Diagonalization
In this section, we present a straightforward strategy applicable when the operator O is the Pauli Z operator acting on a single qubit and M = 2. Other single-qubit observables can be accessed by applying the appropriate single-qubit rotations before the virtual distillation procedure. This realization of our error mitigation technique Figure 2. A flowchart that describes the choices involved in selecting between the different variants of virtual distillation presented in this work. Blue boxes denote questions for the experimentalist to answer about the available quantum resources and problem to be studied, green boxes link to the relevant sections in the text and briefly summarize the main features of each variant. The flowchart provides direction to the most flexible variant given the answers provided in the blue boxes but the actual experimental performance will depend on many factors. requires only a single additional layer of two-qubit gates followed by measurement in the computational basis. We present a schematic of this approach in Figure 3.
Rather than using the relation in Eq. 3 directly, we instead define a symmetrized version of our observable, For the specific case we consider here, that means we take It is straightforward to use Eq. 3 to show that Using the symmetrized observable is advantageous because Figure 3. A circuit diagram of our approach applied to a six-qubit circuit. We use twice the number of qubits to independently perform two copies of the original circuit. We then apply a single layer of the two-qubit gates specified in Eq. 10 before measuring each qubit in the computational basis. This allows us to estimate the error-mitigated expectation values for all single-site Z operators.
or, in our case, [Z (2) k , S (2) ] = 0. Both S (2) and Z (2) k factorize into tensor products of operators that act separately on each pair of qubits, where the ith pair consists of the ith qubit from each system. Therefore, we may simultaneously diagonalize S (2) and Z (2) k S (2) using an operator that factorizes with the same structure. We denote the two-qubit unitary that performs this diagonalization on the ith pair B (2) i . We give a matrix representation for this gate below, noting that there is some freedom in the choice of phases for the matrix elements, We then define As desired, this unitary diagonalized the individual factors that make up the observables, This diagonalization is particularly easy to implement when each qubit from the first copy of ρ is adjacent to the corresponding qubit from the second copy. The procedure for measuring the observables required to estimate the numerator and denominator of Eq. 8 then reduces to applying a single layer of N two-qubit gates in parallel and measuring in the computational basis. In fact, because B (2) diagonalizes Z (2) k S (2) for all N values of k, we naturally collect the data required to estimate the errormitigated expectation values for all N of the operators Z k simultaneously. By applying the appropriate singlequbit rotations before performing virtual distillation, we could instead access an arbitrary single-qubit observable on each qubit. We capture this process diagrammatically in Figure 3.
In order to develop some intuition, it is helpful to express ρ ⊗2 using a spectral decomposition of ρ and consider two separate components of the resulting sum, The calculation of measurement probabilities and expectation values is a linear operation on the density matrix; we can therefore consider these two components separately. The component of the state with i = j is in the +1 eigenspace of S (2) and leads to measurements of S (2) which yield the +1 eigenvalue with probability p = i p 2 i = Tr(ρ 2 ). In the case where i = j, |i i| ⊗ |j j| is an even superposition of symmetric and anti-symmetric states, |i |j = 1 2 |i |j + |j |i + 1 2 |i |j − |j |i . (15) For this component of the state, measurements of S (2) yield +1 and −1 with equal probability and S (2) = 0. Combining these two cases, we have the expected equality, Tr(S (2) ρ ⊗2 ) = Tr(ρ 2 ). Measurements of S (2) O (2) follow a similar pattern. We find it interesting to contrast this behavior with the stabilizer theory of quantum error correction. In the stabilizer formalism, errors are detected by projecting through measurement into the −1 eigenspace of one or more symmetries. In our approach, we instead rely on errors being equally supported on the eigenspaces of the symmetry we measure.

B. Sample Efficiency
The number of circuit repetitions required to determine the error-mitigated expectation values within a precision depends on the variance of our estimator. In this section, we present expressions for this variance. We focus on the M = 2 case and the methods discussed in Section II A. The calculations are also applicable to the variant protocol we present in Appendix C. We also show how their exists an extension to our protocol that makes more efficient use of multiple copies when the noise level is sufficiently high.
We'd like to determine the variance of our estimator for the error-mitigated expectation value We leave the derivation to Appendix D and simply give an (approximate) expression for the variance, where R refers to the number of measurement repetitions. It's useful to consider what happens in the limit where ρ is a pure state. In that case, the second and third lines are zero and the variance reduces to exactly what one would expect when averaging 2R independent measurements of O. As the purity of ρ decreases, the variance, and the number of circuit repetitions, increases. The rest of this section focuses on laying the groundwork to improve the sample efficiency of these techniques. This is an important goal because the number of samples required can grow large given sufficiently noisy circuits. At high enough error rates, we are highly likely to find ourselves in a situation where We now make the assumption that the level of error mitigation offered by measuring ρ M is sufficient but we have 2K M copies of ρ available. For simplicity, we focus on the case where M = 2 and O is a Pauli operator acting on one or more qubits. We present a generalization of our approach involving a collective measurement of all 2K copies of ρ that performs better than a naive parallelization.
The naive approach we hope to beat consists of taking K pairs and running the protocol described above in parallel, averaging the results. For simplicity, we focus on the variance of our estimator for the quantity that appears in the numerator of Eq. 8 rather than the ratio itself. In Appendix D we show that the variance of our estimator for S (2) O (2) is 1 2 Tr(ρO 2 ) + 1 2 Tr(ρO) 2 − Tr(ρ 2 O) 2 . Therefore, the variance obtained when using 2K copies in parallel is exactly We prove below that it is possible in some situations to obtain a more sample-efficient estimator for the corrected expectation value by performing a joint measurement on all 2K copies. We do so by providing an operatorÕ with the desired expectation value and calculating its variance. First, we define the operator where we use S (i,j) to denote the swap operator specifically between subsystems i and j. It is simple to show that Tr(Õρ ⊗2K ) = Tr(Oρ 2 ).
We compute the variance ofÕ with respect to the state ρ ⊗2K in Appendix E, finding that When Tr(ρ 3 ) is small, the second term in Eq. 23 is suppressed and there is a regime where the variance of this operator shrinks quadratically with K. The naive approach, where we perform K independent calculations on separate pairs results in an estimator whose variance is suppressed only linearly in K. We do not suggest a particular strategy, let alone one that is NISQ-friendly, for implementing the measurement ofÕ. We hope that future work can address this issue. Furthermore, while we have established that generalizations of the simplest virtual distillation procedure can outperform a naive parallel strategy, we have not established a comprehensive theory on the limitations of virtual distillation. It would be useful to quantify the minimum number of samples required to resolve T r(Oρ M ) given access to a large number of copies of ρ under various assumptions about the spectrum of the density matrix.

III. PERFORMANCE UNDER DIFFERENT NOISE MODELS
In the numerical studies, we will present evidence that the performance of virtual distillation can be essentially predicted by the combination of two contributions. Here we find it instructive to consider them separately using simple analytical models. To understand the potential benefit of our approach using the minimal setup, we consider the fidelity of with the ideal state generated by noiseless evolution (neglecting error introduced by the measurement procedure). We first consider the performance under noise that maps the ideal state to states orthogonal to it, leaving the dominant eigenvector of the density matrix as the ideal state. We then turn towards the effect of errors that lead to states non-orthogonal to the ideal state, causing a drift in the dominant eigenvector of the density matrix. The essential behavior of virtual distillation is to remove errors of the first kind rapidly, while converging to a floor determined by the drift in the dominant eigenvector that enables a large constant factor improvement over the erred state.

A. Orthogonal Errors
We first consider idealized errors that leave the dominant eigenvector as the ideal state. We consider a phenomenological error model motivated by the assumption that we can think of errors as discrete events that occur locally in space and time with some probability. For simplicity, we model every gate as a stochastic quantum map where with probability p an error occurs, and we assume that every new error sends the quantum evolution to a new orthogonal state. The resulting density matrix for a circuit with G gates is The operator for ρ 2 is similar with all the coefficients squared, as all the states are assumed to be orthogonal. Therefore, The fidelity with the ideal state ρ 0 is Therefore we expect a quadratic suppression of errors in the most favorable case. The result is similar in the case of M copies: The other factor affecting the performance of virtual distillation besides the fidelity is the sample complexity. We analyze the general case in more detail in Appendix D, but it is instructive to briefly consider the performance under this simplified model of errors. For simplicity, we assume that we aim to measure the errormitigated expectation value of a Pauli operator O at the M = 2 level using R independent experiments to estimate the numerator and denominator of Eq. 16 (for a total of 2R experiments). Then the variance of our estimators for the numerator and denominator are upper bounded by 1, and we have Because of our assumption that O is a Pauli operator, and therefore ||O|| = 1, we have Tr(Oρ 2 ) ≤ Tr(ρ 2 ), implying Tr(Oρ 2 ) 2 ≤ Tr(ρ 2 ) 2 . Therefore, (32) When p is small, we can neglect the p 2 term in the denominator. Therefore, taking is sufficient to estimate O corrected to within a fixed additive error.

B. Non-Orthogonal Error Floor
The analysis of the previous section made the simplifying assumption that the dominant eigenvector of the density matrix, ρ 0 = |0 0|, corresponds exactly to the ideal state generated by noiseless evolution. In practice, errors will lead to population in states that may not be orthogonal to the target state, leading to a drift in the dominant eigenvector of the density matrix. We will see in our numerical studies that this drift limits the maximum potential upside of virtual distillation. In this second, we develop an understanding of this drift by using perturbation theory to consider the first-order change in the dominant eigenvector of the density matrix.
Let us consider a state ρ in the middle of a noisy preparation circuit, allowing for ρ to already be somewhat distorted by noise. Writing ρ in its eigenbasis, we have where we order the eigenvalues in descending order. Note that we use the symbol λ i for the ith eigenvector of the density matrix rather than p i throughout this section, reserving the symbol p for the coefficients associated with a Kraus operator decomposition of our noise channel. We wish to consider the impact of a subsequent noise channel defined in terms of a set of Kraus operators, Note that we have demanded a representation of the channel where K 0 is the identity matrix in order to simplify our analysis. Now let ∆V denote the change in the density matrix induced by this channel (ρ → ρ + ∆V ), where we define the scale ∆ by taking ||V || to be O(1). Now we make the assumption that we are in the lowerror regime. Specifically, we assume that λ 0 λ 1 and that ∆ |λ 0 − λ 1 |. Under this assumption, we satisfy the necessary conditions for applying matrix perturbation theory to the dominant eigenvector [43]. We can therefore proceed by expressing the dominant eigenvector of ρ + ∆V as a convergent power series in ∆. This yields where |0 denotes the dominant eigenvector of ρ + ∆V , |0 (0) denotes the dominant eigenvector of the unperturbed ρ, and |0 (i) denotes the correction at ith order. Likewise, we can also express the eigenvalue corresponding to the dominant eigenvector as a power series in ∆, We can then proceed in the usual way, expanding the eigenvalue equation, and equating terms order by order. This leads to a familiar expression for the first order correction to the dominant eigenvector in terms of the zeroth order eigenvalues and eigenvectors, At this point, it's useful to carefully consider the normalization of |0 . Let |D denote the normalized form of |0 , where we have made use of the fact that the first and second order corrections are both orthogonal to the unperturbed eigenvector.
We can now compute the trace distance between |D and the dominant eigenvector of the unperturbed state, Now let us expand 0 (1) 0 (1) in terms of the Kraus operators of our noise model.
where we omit the (0) superscripts of the eigenvalues and eigenvectors on the right-hand side for readability.
We can see that, in the general case, we expect a nonzero contribution to the trace distance at first order in ∆. Because ρ 2 /Tr(ρ 2 ) ≈ |D D| in the low-noise regime, this will effectively set a floor for how well our method can correct errors. Therefore, without further constraints on the state, the noise model, or the observables being measured, our method will not achieve a quadratic suppression in errors in the low noise limit but rather a constant factor improvement whose magnitude depends on the typical size of a quantity we denote by the symbol γ, Interestingly, when we examine the data from our numerical simulations, we do obtain an improvement consistent with a quadratic suppression of errors at intermediate error rates. Additionally, γ has no lower bound; it can in some cases be zero, in which case we expect to recover the quadratic suppression of error predicted from Section III A. As the trace distance is an upper bound for the error in any observable, particular observables of particular states may recover this performance even when γ = 0.
In order to shed some light on the error floor set by the drift in the dominant eigenvector of the density matrix, it can be helpful to ask when we might expect γ to be near zero. It is clear that this quantity must be zero if one of two conditions hold: One way that this can occur is if the state and the circuit have a natural set of symmetries. The first condition holds if the error is drawn from such a symmetry group, while the second is satisfied if it violates it strictly.
For an example of the second case, consider a bit-flip or amplitude-damping error channel acting on a state with a definite number of excitations. There are other situations where the second equality is approximately satisfied. For example, in circuits exhibiting the limits of quantum chaos, apart from a small light cone at the end of the circuit, any local errors lead to a state nearly indistinguishable from a Haar random state. Therefore, the matrix elements in Eq. 46 are exponentially small in the number of qubits. This sensitivity to local perturbations in random circuits is used in the cross-entropy benchmarking technique [44], and explains the improved behavior of our technique in numerical tests on random circuits.

IV. NUMERICAL EXPERIMENTS
In this section, we present numerical simulations of virtual distillation applied to three model systems. We first consider two classes of random circuits, chosen because they are simple limits where the behaviour of virtual distillation is easy to analyze. We then turn towards the application of virtual distillation to the simulation of the dynamics of a one dimensional spin chain following a quantum quench. This example allows us to study the behaviour of virtual distillation in the context of quantum simulation, an application which is a promising candidate for the eventual demonstration of practical quantum advantage in the NISQ era. We choose to focus on time evolution rather than the ground state problem mainly because ground states have additional structure which enables specialized error mitigation techniques and we are interested in how virtual distillation behaves in the absence of this structure.
We find it illuminating to characterize the effectiveness of our approach as a function of the expected number of errors in a particular circuit. This tends to allow more universal prediction of performance when trading between error rate per gate and number of gates.
We consider a noise model that focuses on stochastic errors in two-qubit gates. Specifically, after each twoqubit gate, we apply a single-qubit depolarizing channel to both qubits acted on by the gate. The expected number of errors (E) can be expressed simply as a function of the number of two-qubit gates in the circuit (G) and the single-qubit depolarizing probability (p, defined in the usual way in Eq. F3), To quantify the error, we focus mainly on the trace distance between the ideal state that would be obtained with noise-free evolution and the effective state accessed by virtual distillation The trace distance leads to a natural bound in the error for the expectation value of an arbitrary observable, where O is an observable with operator norm ||O||, and T (−, −) denotes the trace distance.

A. Scrambling Circuits
Both classes of random circuits that we simulate are related to the scrambling circuits used to demonstrate beyond classical computation in Ref. 44. The first class is essentially a one-dimensional version of the circuit family considered in that work. The second class of circuits is exactly the same as the first class, except that we remove the two-qubit gates. We provide some additional details in Appendix F. For these non-entangling random circuits, we still perform the noisy simulations of these circuits by applying single-qubit depolarizing channels in the same locations where the two-qubit gates would have been.
Because the behaviour of the non-entangling random circuits is particularly simple to understand, we consider this class of circuits first. In the absence of entangling gates, we can commute the applications of the singlequbit depolarizing channel to the end of the circuit. We can then combine them together into a single application per qubit with a larger effective error rate. We carry this procedure out analytically in Appendix F, showing that the dominant eigenvector of the density matrix corresponds exactly the ideal state. This leads us to expect behaviour similar to that of the phenomenological noise model we considered in Section III A.  (M = 2, 3) for a variety of non-entangling random circuits at two different system sizes (differentiated by thickness of markers). We plot the error, quantified by the trace distance to the state obtained from noiseless evolution, as a function of the expected number of single-qubit depolarizing errors, resulting from varying both the error rate and number of gates. Unlike other cases, for these non-entangling circuits, the eigenvalue floor vanishes and we see exponential suppression in the number of copies.
In Figure 4, we plot the trace distances between the ideal states generated by noiseless evolution and the states obtained by noisy evolution of these nonentangling random circuits (blue curve). We consider a variety of different circuit depths and error rates for both six-qubit systems (thin curves) and ten qubit systems (thick curves). For each of these simulations, we also calculate the trace distance between the ideal state and the states we are effectively accessing by using virtual distillation with M = 2 (orange dotted curve, ρ 2 /Tr(ρ 2 )) or M = 3 (green dotted curve, ρ 3 /Tr(ρ 3 )) copies. For each particular setting of circuit depth and error rate, we consider a single randomly chosen member from the ensemble of non-entangling scrambling circuits described above.
We see that the data from this variety of simulations collapses together when we plot the error (in terms of trace distance) as a function of the expected number of gate errors. When the expected number of errors is not too large, the curves for M = 1, M = 2, and M = 3 are nearly linear with slopes 1, 2, and 3 respectively. Although the noise model in this case does not exactly match the phenomenological model of Section III A, the results are broadly consistent. For these non-entangling random circuits, we observe a level of error suppression that is exponential in M .
In Figure 5 and Figure 6, we present plots that explore the behaviour of entangling random circuits on a one-dimensional line of qubits. When we considered the non-entangling random circuits, we found that error (quantified by the trace distance to the ideal state) depended mostly on the system size and the expected number of gate errors. This was true regardless of whether . We see that the dominant eigenvector determines the noise floor beyond which we cannot improve, independent of the number of copies, and that this floor drops as the size of the system increases.
or the expected number of errors was varied by changing the circuit depth or by changing the error rate pergate. Here we observe slightly different behaviour between these two cases, and therefore consider them separately. These two figures also differ from Figure 4 in that they include a red dashed curve corresponding to trace distance between the dominant eigenvector of the density matrix (lim M →∞ ρ M /Tr(ρ M )) and the ideal state, a quantity which is non-zero for the richer family of circuits we now consider. Figure 5 presents two plots that show the effects of varying the circuit depth at two different fixed error rates. We see that the error in the dominant eigenvector effectively sets a floor for the minimal error achievable by virtual distillation for any value of M . This floor grows slowly with increasing circuit depth. Furthermore, both the absolute magnitude and the rate of growth appear to be suppressed with system size. In Section III B we showed that the leading order contributions to the trace distance between the dominant eigenvector and the ideal state can be understood in terms of the matrix elements of the Kraus operators (see Eq. 44). As the circuit depth of the random circuit increases, we expect a 1D random circuit to approach a Haar random circuit at a depth proportional to the number of qubits N . Once this ap- proximation is sufficient, all but a small fraction of errors in the lightcone of the observable at the end of the circuit will lead to matrix elements that contribute to the drift in the dominant eigenvector that are exponentially small in the number of qubits. This observation may explain the scaling we see in Figure 5.
In Figure 6 we plot the error in terms of trace distance as we vary the expected number of gate errors by varying the per-gate error rate for a fixed circuit. At low error rates, we see that the errors in the dominant eigenvectors (red dashed curves) scale linearly with the error rate but are orders of magnitude smaller than the errors in the unmitigated state. This matches the behaviour we would expect from the analysis of Section III B. As in Figure 5, the error in the dominant eigenvector sets a floor for the performance of our method at finite M and that this floor is suppressed as the system size increases.

B. Heisenberg Quench
The properties of random circuits can be somewhat unique in their ability to scramble errors. It is thus important to consider how the approach works for other circuits of interest, like the quantum simulation of physical systems. In this section, we explore the performance of our approach applied to the simulation of time evolution following a quantum quench in a spin model. We initialize the system in an antiferromagnetic state, e.g., |0101 , , and large purple diamonds (M = 2, noisy distillation) alongside bounds determined by the trace distance to the ideal state using Eq. 48 (various curves). We plot these quantities as a function of the expected number of single-qubit depolarizing errors, which we vary by varying the number of gates, fixing the single-qubit depolarizing probability to 5 × 10 −3 . We see that for this specific observable, the trace distance bounds are pessimistic by roughly an order of magnitude, though generally respect the behavior of the eigenvector floor. Furthermore, we notice an almost perfect coincidence between the orange crosses and purple diamonds, indicating that performing the virtual distillation circuits of Section II A with noise has a negligible effect on the corrected expectation value.

and simulate the time evolution under the Hamiltonian
Here we have chosen the parameters, J x = J y = 1.0, J z = 1.5, h = 1.0, in order to match a previously studied family of non-integrable models [45], although we take open boundary conditions rather than periodic ones. We approximate the time evolution under this Hamiltonian by Trotterization with a timestep of ∆t = 0.2. Specifically, we use alternating layers of single-qubit gates, twoqubit gates between odd-even pairs of qubits, and twoqubit gates between even-odd pairs of qubits. As above, we simulate the resulting circuits with single-qubit depolarizing noise applied after every two-qubit gate. In Figure 7, we plot the bounds on the error of an arbitrary observable (normalized so that ||O|| = 1) derived from the trace distance to the noiseless state. We vary the expected number of errors by varying the circuit depth of a six qubit system with a fixed single-qubit depolarizing probability of 5 × 10 −3 . Alongside these bounds Heisenberg model. We plot the trace distance to the state obtained from noiseless evolution as a function of the expected number of single-qubit depolarizing errors. The expected number of errors is varied by changing the error rate per-gate, fixing the number of two-qubit gates to be 450. We show this data for 6 and 10 qubit systems (differentiated by the size of the markers). As we increase the system size from 6 qubits to 10, we observe that the error (quantified by the trace distance to the ideal state) decreases for the error-mitigated states (M > 1).
(plotted using solid and dashed curves) we also plot the actual average error in the single-site magnetization (averaged over the 6 sites) at various points throughout the circuit. For the two-copy (M = 2) version of our proposal, we plot the error calculated directly from the state ρ 2 /Tr(ρ 2 ) using yellow crosses and the error we would obtain by applying the destructive measurement described in Section II A using large purple diamonds. For this second calculation, we simulate the application of the six two-qubit gates (Eq. 10) required to diagonalize the observables using the same noise model as the rest of the circuit. From the nearly perfect overlap of the yellow crosses with the purple diamonds, we can see that circuit noise during the diagonalization step has barely any effect on the reconstructed expectation values. It is also apparent that, although the average error in the magnetization does not saturate the bounds implied by the trace distance, our approach suppresses the errors in the actual expectation values to a similar degree that it suppresses the trace distance. Figure 8 offers a different look at the same system. As with Figure 6, in this figure we fix the number of twoqubit gates to be 450 and we vary the expected number of gate errors by sweeping over a range of per-gate error rates. Here we clearly see the impact of the floor set by the drift in the dominant eigenvector (red dotted curve). At low gate error rates, the error (in terms of trace distance to the ideal state) for the virtually distilled states (M > 1) is suppressed by a constant factor relative to the error in the unmitigated state (M = 1). The constant factor improvement is substantial and ap- Figure 9. The average overhead in the number of measurement repetitions required to measure the single-site magnetization (of a Heisenberg model) with virtual distillation (M = 2). The overhead is calculated by taking the ratio of the time required using our technique to achieve a fixed target precision and the time required when measuring the same quantity to the same precision with respect to the unmitigated noisy state (M = 1). We vary the expected number of errors by varying the error rate per-gate, fixing the number of two-qubit gates to be 450, and plot the overhead as a function of the expected number of single-qubit depolarizing errors. We show data for 6 and 10 qubit systems, denoting the larger system using larger markers. As the error rate grows beyond O(1) errors, the overhead increases dramatically, with the larger system seeing the greatest inflation.
pears to increase with system size. Both the size of the improvement and its sensitivity to the system size are smaller than we observed for the one-dimensional scrambling circuits of Figure 6.
In Figure 9, we consider the cost of performing the twocopy (M = 2) version of virtual distillation for the same systems considered in Figure 8. We do this using the expression for the variance presented in Eq. 17 and derived in Appendix D. Using this expression, we calculate the average variance of our error-mitigated estimators for the single-site magnetization, {Z i }. We consider the ratio of this average variance (for the error-mitigated expectation values) with the average variance of the same measurements without error mitigation. Because the number of measurements required for some fixed precision scales linearly with the variance, this ratio is also the ratio between the number of measurements required to use virtual distillation and the number of measurements required to measure the unmitigated expectation values (assuming the same target precision). This quantity therefore encapsulates the overhead incurred by our scheme.
When the expected number of errors is small, we see that virtual distillation barely increases the number of measurements required. It is only as the number of errors grows larger than one that the measurement cost rises dramatically. We note that Eq. 17 implicitly assumes that we perform a number of measurements R such that This assumption will break down when the target precision is low and the expected number of errors in Figure 9 is large, but the qualitative conclusion remains the same.

V. MITIGATING ALGORITHMIC ERRORS
To date, most error mitigation methods have focused on the reduction of errors caused by imperfections in a device implementation, such as decoherence or control errors. Here explore the idea that some of these techniques can be applied to algorithmic errors incurred during otherwise noise-free implementations of randomized algorithms. Previous works have used extrapolation [46] or randomized symmetry application [47] to mitigate coherent errors in evolution; we extend this concept to incoherent errors.
Recent developments in Hamiltonian simulation have led to the development of randomized evolution methods such as qDRIFT [48], randomized Trotter [49], and combinations thereof [50], which have benefits in some situations over their deterministic counterparts. As these methods are randomized, they output mixed states rather than pure states, even in the absence of noise. Moreover, they depend on an approximation parameter with a natural limit in which they converge to the pure state generated by exact evolution. In this section, we show numerically that virtual distillation applied to qDRIFT can suppress this deviation from the exact evolution. For the particular model system we consider, we find that virtual distillation can reduce the coherent space-time volume required to reach a particular accuracy threshold by a factor of 8 or more compared with the standard qDRIFT.

A. qDRIFT
We briefly introduce some background on the qDRIFT method. qDRIFT simulates time evolution under a Hamiltonian H, by constructing product formulae using a randomized selection rule. Terms are chosen from H at random, with a selection probability proportional to their interaction strength in the Hamiltonian. One then evolves the system forwards in time under this Hamiltonian term, for a fixed timestep and repeats this process a number of times, generating a product formula that provides an approximation to the time evolution operator. When averaged over the classical randomness (in the choice of interaction terms), qDRIFT generates a quantum channel that closely approximates the exact evolution more closely than the individual product formulae. Importantly, unlike most deterministic Trotter methods, the scaling of this approach does not depend explicitly Figure 10. qDRIFT coherent cost reduction through virtual distillation in the Heisenberg model. Here we show the number of coherent qDRIFT steps required to reach a target trace distance, with and without virtual distillation using 2 copies. We see that there is a consistent reduction of at least 16x in the number of required steps. When accounting for the overhead of using two copies, this amounts to a 8x reduction in the coherent space-time volume used to reach the same error rate.
on the number of terms in the Hamiltonian, but rather than 1-norm of the coefficients.
More precisely, we consider a Hamiltonian that we may decompose as H = i h i H i , where all h i are made real and positive by absorbing signs into H i , and the spectral norm of H i is bounded by 1. Defining λ = i h i , the diamond norm distance between the qDRIFT channel and the true time evolution is bounded by where η is the number of qDRIFT selection steps performed to generate each instance of the qDRIFT channel, and hence controls the amount of coherent evolution required. As η increases, the resulting quantum channel converges to the unitary corresponding to the exact evolution. It will be our aim to understand how our virtual distillation technique can reduce the coherent space-time volume required, by reducing this factor η required to achieve the same error in practice.

B. Virtual distillation applied to qDRIFT
Here we study the application of virtual distillation to qDRIFT numerically. Specifically, we investigate how virtual distillation can impact the number of coherent steps, η, required to reach a target accuracy. For this, we choose a Heisenberg Hamiltonian with up to 6 qubits per copy. The Hamiltonian is given by where h i ∈ {−h, h} are randomly chosen Z magnetic field strengths, and periodic boundary conditions are applied such that site N + 1 is site 1. For our studies here, we choose a time evolution length of t = N and let h = 1. We numerically investigate the number of coherent qDRIFT steps required to achieve a trace distance of 0.01 to the ideal state for evolutions under such a Heisenberg model. The results of this analysis are shown in Fig. 10. We see for this system that virtual distillation consistently reduces the required number of coherent steps to achieve the desired trace distance by a factor of more than 16x. If we account for the space overhead of using two copies, this still amounts to a space-time advantage of 8x. These results suggest that the use of error mitigation techniques may be further developed to yield practical algorithmic improvements for real systems, especially in the NISQ regime.

VI. CONCLUSION
In this work, we showed how techniques for using multiple copies of a state to access polynomials of the density matrix can be used to mitigate incoherent errors. We studied the effectiveness of this approach for two analytically tractable noise models and characterized its limit in terms of the dominant eigenvector of the noisy density matrix. We numerically demonstrated reductions in the error (quantified by the trace distance to the noise-free state) of up to three orders of magnitude for a collection of small model systems. Furthermore, we showed that this error suppression is enhanced as the system size or the speed of information scrambling grows. We also considered the application of our error mitigation approach to the incoherent algorithmic error that arises when approximating time-evolution using the qDRIFT algorithm, finding a substantial constant factor improvement.
Our proposed strategy for error mitigation, which we refer to as virtual distillation, is simple to use and analyze. It provides a natural way to take advantage of the surplus of qubits that we expect to have available as the NISQ era continues to suppress the effects of incoherent errors. We expect that our technique will prove complementary to other error mitigation and calibration techniques, especially those capable of addressing coherent errors. The effective state accessed by virtual distillation approaches the dominant eigenvector of the density matrix exponentially quickly as the number of copies, M , increases. Therefore, the utility of our technique depends mainly on the error in the dominant eigenvector and the number of samples required. In particular, we have devoted a signficant amount of attention to the question of sample complexity. This is due to the fact that proposed NISQ applications, especially in quantum simulation, already face a daunting cost in this regard [17]. Our analytical work and numerical simulations indicate that our strategy is most effective and affordable when the num-ber of errors expected in the circuit is O(1). The reach of our approach will therefore naturally grow throughout the NISQ era as hardware platforms continue to improve.
There are several other important considerations relevant to the performance of this technique in the NISQ era besides the drift in the dominant eigenvector and the sample complexity. First of all, virtual distillation requires collective measurements that couple the qubits of one copy with the corresponding qubits of additional copies. For hardware platforms based on a 2D grid of qubits, these measurements are easiest to perform when the connectivity of the original circuit is linear and would require a substantial number of additional gates in the general case. Fortunately, a linearly connected array of qubits is known to be sufficient to achieve the optimal gate complexity in some cases, including certain approaches for the simulation of quantum chemistry [51,52]. A second important consideration is that we're often interested in measuring observables with support on more than one qubit. We discuss some options for performing such measurements in Appendix A, Appendix B, and Appendix C, but these options come with their own overheads in gate complexity or the number of measurements repetitions. Thirdly, and perhaps most importantly, the assumption that we're able to measure expectation values with respect to ρ M /Tr(ρ M ) is violated when the individual copies are not in the same state or when errors occur during execution of the measurements. We discuss some aspects of the breakdown of this assumption in Appendix H. In general, we do not expect virtual distillation to be able to correct errors during the measurement process but we note that we saw a reasonable level of robustness to such errors in our numerical simulations (see Figure 7).
Even as we begin to leave the NISQ era behind and approach devices that start to incorporate quantum error correction, in the early days the desire to use as many logical qubits as possible means we may perform some computations that still have an appreciable number of logical errors. Given that our technique can provide a substantial improvement in error at negligible overhead compared to traditional quantum error correction techniques, there may be some advantageous interplay between the two, where small distance codes are used in conjunction with this technique before more qubits are available. We explore this opportunity in more detail in Appendix G.
Our technique builds upon a long tradition of work that uses the symmetric group for stabilizing quantum computations and mitigating errors. We believe that this research direction continues to hold promise, and we identify a few directions that we find particularly intriguing. Virtual distillation is based on a simple collective measurement of M copies of ρ. In Section II B, we show that there exists more sophisticated collective measurements whose sample complexity improves quadratically upon our approach in some regimes. It would be interesting to investigate this further, both from a fundamental perspective, and with an eye towards practical implementation. Besides this potential improvement, another clear question arises from our work. The drift in the dominant eigenvector of the density matrix is a coherent error. As with the coherent errors that occur directly in the execution of circuits on a NISQ device, we are hopeful that variational optimization or other, complementary, error mitigation techniques may prove useful in addressing this additional source of coherent error. Studying this in the context of the real noise experienced in hardware will be especially important and illuminating.
In the process of preparing this work, a related paper, Ref. 53 appeared in the literature. Our work highlights different conclusions than theirs in error scaling due to our explicit consideration of errors that induce a drift in the dominant eigenvector of the density matrix. a circuit to localize an observable of interest to a single qubit before performing virtual distillation. Here we present an alternative solution that doesn't require the use of ancilla-assisted measurement or circuit depth.
The challenge arises due to the use of Eq. 8, in particular, the choice to use the symmetrized version of an observable, a notion defined in Eq. 6. Using the symmetrized version of a multi-qubit observable means that it is not possible to perform the required diagonalization using a tensor product of separate unitaries across each pair of qubits. As an example, we consider the operator O = Z i Z j . Our arguments hold equally well for any other operator composed of a tensor product of (more than one) single-qubit Pauli operators. Taking the product of the symmetrized observable and the swap operator yields This operator does not factorize into a tensor product of operators with support on the individual pairs of qubits, nor can it be diagonalized by an operator that factors this way. However, instead of using Eq. 8 to determine the corrected expectation value of O, we can instead use the nonsymmetrized form introduced in Eq. 2. Returning to our example where O = Z i Z j , we see that we need to estimate the numerator and denominator of Unlike the symmetrized observable of Eq. A1, the operator Z 1 i Z 1 j S (2) factorizes into a tensor product over the N pairs of qubits (a pair being one qubit from the first system and the corresponding qubit from the second system).
is not Hermitian, but because it is unitary, we can still estimate Tr(Z 1 i Z 1 j S (2) ρ ⊗2 ) by applying a circuit to diagonalize it and measuring in the computational basis. As Z 1 i Z 1 j S (2) factorizes into a product of two-qubit operators, the circuit that diagonalizes it does as well.
Note that because Z 1 i Z 1 j does not commute with S (2) , we will be unable to simultaneously estimate the numerator and denominator of Eq. A2. We will also be unable to simultaneously measure the corrected expectation value corresponding to different choices of i and j. More generally, we are able to measure any single tensor product of one-qubit operators at a time, regardless of the number of qubits it acts on. The details of the diagonalization will, of course, depend upon the operator to be measured. We do not carefully analyze the number of measurements required by this flavor of virtual distillation, but we note that the inability to parallelize the measurement of commuting multi-qubit observables would make it challenging to profitably combine this approach with sophisticated NISQ measurement strategies, such as the one presented in Ref. 17. In particular, individual multi-qubit operator must be measured separately using this approach, even if they commute. This fact increases the overall number of circuit repetitions required for many applications. In Section II A and Appendix A we described protocols for measuring the expectation value of O with respect to the state ρ 2 Tr(ρ 2 ) by diagonalizing S (2) and either O 1 S (2) or O (2) S (2) . Here we describe how these approaches can be generalized to higher powers of ρ in a natural way. Like the swap operator, the cyclic shift operator between M N -qubit systems, S (M ) , factorizes into a tensor product of N M -qubit gates. Specifically, it factorizes into the tensor product of N single-qubit cyclic shift operators. The symmetrized operator Z The same concerns about correcting the expectation values of multi-qubit observables that we discussed in Appendix A for the two-copy (M = 2) case apply to this generalized proposal. The tools developed so far allow us to simultaneously estimate Tr(Z k ρ M ) for all values of m and also Tr(ρ M ). If we are interested in reconstructing Tr(P ρ M ) for some multi-qubit Pauli operator P , we can do so using a generalization of Eq. A2, but we would be limited to measuring the operators required for one particular P at a time.
For the specific case of M = 3, we have numerically optimized the quantum circuit of Figure 11 to simultaneously diagonalize Z k . We obtained parameters for the four two-qubit gates that allow for an error (measured in the Frobenius norm of the difference between the exact and approximate matrices) of approximately 5E − 5 when k . The two-qubit gates parameterized by the θis are arbitrary two-qubit gates. We performed the numerical optimization using the Julia language.
the following equations are used, In this section, we present the approach one may take if ancilla-assisted measurement is feasible in the experimental setup. This is a simplified version of the proposal for ancilla-assisted measurement protocol found in Ref. 35 for estimating the expectation value of an observable O with respect to the state ρ 2 /Tr(ρ 2 ). As with the method we discussed in Section II A, we do this by approximating the numerator and denominator of Eq. 8. Unlike that method, this approach uses a non-destructive measurement of the swap operator (S (2) ). The main reward for this added complexity is that this variant of virtual distillation doesn't restrict the form of the operators being measured, nor does it prevent simultaneous measurement of operators acting on overlapping subsets of qubits. Therefore, it is compatible with some of the recently developed techniques for efficiently measuring a large collection of commuting operators [17,54,55]. While we focus on the M = 2 copy version here, we also briefly discuss the generalization to M ≥ 3 copies.
To use this method, we begin with two system registers, each in the state ρ, as well as an ancilla qubit in the |0 state. We then perform a non-destructive measurement of S (2) in the standard way, using the so-called swap or Hadamard test [25,26,56]. Specifically, we apply a Hadamard gate to the ancilla qubit, apply S (2) conditioned on the ancilla qubit being in the |1 state, and measure the ancilla qubit in the X basis. The expectation value of X on the ancilla qubit is then equal to S (2) . Because S (2) factorizes into a tensor product of two-qubit swap gates, its controlled version likewise factorizes into a series of N Fredkin (controlled-swap) gates. Compiling this circuit may necessitate some extra steps (such as expanding the single ancilla qubit into a GHZ state using a series of CNOT gates) in order to deal with the restricted connectivity of a near-term device.
It isn't technically necessary, but it simplifies the analysis and reduces the variance of the resulting estimator to focus on the symmetrized form of O, O (2) = 1 2 (O 1 +O 2 ). As in Section II A, this is beneficial because the symmetrized observable O (2) commutes with S (2) . We can therefore measure the product O (2) S (2) by first measuring S (2) using the Hadamard test described above and then measuring O (2) on the system registers. This protocol does not require a separate estimation of Tr(ρO) like the original proposal of Ref. 35. Furthermore, it allows us to make us to make use of measurements of O on both copies of ρ and also simultaneously estimate the numerator and denominator of Eq. 8, leading to a relatively sample-efficient scheme. Now let us consider the case with three or more copies of ρ. S (N ) is not Hermitian for N > 2 but the natural generalization to the above strategy still works as expected. Specifically, we can use a controlled version of the cyclic shift operator, S (N ) , to sample an observable whose expectation value is equal to Re(Tr(S (N ) ρ ⊗N )) [26,27,35].
Because the symmetrized observable O (N ) commutes with S (N ) , it also commutes with the observable measured by this generalization of the swap test. Therefore, we can sample from an observable whose expectation value is Tr(S (N ) O (N ) ρ ⊗N ) by first performing the higher-order swap test and then a measurement of O (N ) .

Appendix D: Variance of the Corrected Expectation Value Estimator
In this section, we calculate the variance of the estimator for the corrected expectation value obtained by applying Eq. 8 with M = 2. Specifically, we consider the estimation of the expectation value of an observable O with respect to the state ρ 2 Tr(ρ 2 ) constructed by repeatedly measuring the operators S (2) O (2) and S (2) , We assume that both operators are simultaneously measured by averaging over R repetitions of state-preparation and measurement. This assumption applies to the measurement by diagonalzation method presented in Section II A of the main text and also to the ancilla-assisted measurement approach of Appendix C. The outcomes obtained from measurements of these operators are classical random variables, and we can proceed by determining the variance of these two random variables and their covariance. We begin by calculating the variance of the numerator (with respect to the state ρ ⊗2 ).
The variance of the random variable in the denominator follows by taking O = I, We'll also need the covariance between the random variables representing measurements of the operators which estimate the numerator and the denominator.
There isn't a closed-form expression for the variance of the ratio of two random variables [57], but we can take the standard approximation based on a Taylor series expansion, We estimate the expectation values for the numerator and denominator of Eq. D1 by averaging over a series of R experiments. This scales the variances calculated above by a factor of 1 R . If R is sufficiently large, then the approximation presented in Eq. D11 will be a good one. Applying this expression to determine the variance of the estimator from Eq. D1 yields Var(Estimator) ≈ 1 R In Section II B, we claimed that there exists a joint measurement on 2K copies of ρ that allows us to estimate Tr(Oρ 2 ) with a lower variance than performing K copies of the basic virtual distillation procedure in parallel. Specifically, we claimed that the operatorÕ (whose definition we reproduce below) exhibits a lower variance than the simple alternative under certain conditions. In this appendix, we prove this claim. First, we recall the definition, where we S (i,j) denotes the swap operator between subsystems i and j and O is an arbitrary Pauli operator. Linearity ensures that a calculation of the expectation value of this operator can be reduced to the virtual distillation procedure applied to two copies, yielding Tr(Õρ ⊗2K ) = Tr(Oρ 2 ).
We now bound the variance of measurements of this operator with respect to the state ρ ⊗2K ; Note that we have made use of Eq. E2 to replace O 2 by Tr(Oρ 2 ) 2 . We proceed by breaking the summation up into three cases. In the first case, i = a and j = b. In the second case, there are only three distinct values amongst the indices i, j, a, b. In the fourth case, all four of the indices take distinct values.
Consider the first case where i = a and j = b. Then we can simplify and bound the sum, Here we have used the properties that S Now we treat the case where the indices take three distinct values. Actually, there are four sub-cases here. We could have any one of the four possibilities, i = a, i = b, j = a, or j = b. We work out the details for the i = a case below, noting that the others behave symmetrically.
Here we have used the property that O i is self-inverse. Now we note that the product S (i,j) S (i,b) is a cyclic shift between the subsystems i, j, b and that this product commutes with the operator Computation using a tensor network diagram (see Figure 12) establishes that In order to bound this quantity, let us denote the projector onto the +1 eigenspace of O by P + and the projector onto the −1 eigenspace by P − . Then we can expand Eq. E15 in terms of these projectors, yielding For simplicity, let us define the indicator function Now we can combine the bounds from the three different cases and simplify the expression for the variance to yield Note that we have simplified by subtracting the Tr(Oρ 2 ) 2 term that arose from evaluating O 2 .

Appendix F: Details Regarding the Numerical Experiments with Scrambling Circuits
In Section IV A we briefly described the random circuits that we simulated to produce Figure 4, Figure 5, and Figure 6. Here we expand upon that description.
The first class of circuits are essentially the one-dimensional analogues of the random circuits of Ref. 44. They are constructed by alternating between layers of two-qubit gates and single-qubit gates. The two-qubit gate layers consist of 'Sycamore gates,' two-qubit gates that enact the unitary The two-qubit gate layers themselves alternate between layers that have Sycamore gates on every even-odd pair and every odd-even pair. The ensemble of random circuits is defined by adding a layer of randomly chosen single-qubit gates between every layer of two-qubit gates in this fixed structure. These single-qubit gates are drawn from the set X, Y, Z, We can now write an expression for the density matrix at the end of the computation, A strict analysis would consider that we need to round these to integer distances, but for this approximate analysis, this should suffice. If we consider the ratio between the implied error rates c s = (1 − f 1 )/(1 − f 2 ), we can find the required constant factor for a given number of qubits per logical qubit and number of gates to make using virtual distillation advantageous. Past a certain number of qubits, this constant factor is enormous, but we find that up to distance 10 − 15 the empirical improvements measured in the text are sufficient to justify the use of virtual distillation in place of additional qubit protection. In particular, at distance 10 with n = 200 physical qubits per logical qubit, performing G = 1000 gates on the logical qubit, the respective error rates are about are 10 −1 and 10 −5 , and hence a constant improvement is about of about 10 4 is sufficient to justify the use of virtual distillation, which is on par with some improvements seen in the main text. To be fair, one might argue that an overall error rate of 10 −5 would already suffice, and by a distance of 15, the required improvement is on the order of 10 7 which is at the upper limit of what we imagine can be achieved with this technique. At smaller distances and numbers of gates, the required constant factors decrease as well. If we assume that we will consistently push the limits of the number of logical qubits we use, reducing the number of physical qubits per logical qubit available, this may imply a regime in early fault tolerance where this technique is applicable. Further studies will be required to identify precisely under what conditions this may be the case. Figure 13. The error in the unmitigated noisy states, and the states accessed by virtual distillation, for the same 10 qubit Heisenberg evolution under two different noise models. We consider a bit-flip error model (ρ bit , teal curve) and phase-flip error model (ρ phase , orange dashed curve). We show the error for virtual distillation applied in the usual way to two identical copies of each noisy state (ρ 2 bit and ρ 2 phase , purple dotted curve and pink dashed curve). We also consider the error when virtual distillation is applied to one copy of each state state (ρ bit ρ phase , green dotted curve). We quantify the error in terms of the trace distance to the state obtained from noiseless evolution as a function of the expected number of single-qubit gate errors. The expected number of errors is varied by changing the error rate per-gate, fixing the number of two-qubit gates to be 450. Ultimately, we find that the performance of virtual distillation is barely affected when the two input states are generated with different noise processes.
Our dephasing channel is parameterized by γ 2 , which represents the probability that an unintended interaction between a qubit and its environment entangles the two, effectively performing a measurement in the computational basis. One standard way of expressing this channel in terms of Kraus operators is given below, It can also be convenient to use the equivalent representation, The parameters γ 1 and γ 2 can be related to the related to the T 1 and T 2 times frequently used to characterize relaxation in two-level systems [61,62]. In our simulations, we consider λ 1 = λ 2 , a choice with a physical model where the T 1 and T 2 times are comparable. This allows us to plot the error (quantified by the trace distance to the state that would be obtained in the absence of noise) as a function of the expected number of error events. The expected number of errors (E) can be expressed in terms of the number of two-qubit gates in the ciruit (G) and the two error probabilities, E = 2G(λ 1 + λ 2 ). (I4) In Figure 14 and Figure 15 we present the results of two sets of simulations under this error model. Figure 14 closely follows Figure 6 from the main text, examining the performance of virtual distillation applied to a random circuit on a one-dimensional array of qubits (see Section IV A). Likewise, Figure 15 considers the same Heisenberg evolution treated in Figure 8 and described in Section IV B. In both cases, we show the error in the unmitigated noisy state, the states accessed by virtual distillation with M = 2 and M = 3 copies, and the dominant eigenvector of the noisy density matrix. We quantify the error in terms of the trace distance to the ideal state that would be obtained in the absence of noise, plotting this trace distance as a function of the expected number of errors.
Comparing Figure 14 and Figure 15 with Figure 6 and Figure 8 from the main text, we note a few things. First of all, the qualitative behaviour of virtual distillation is largely unaffected by the change in the noise model. When the expected number of errors is small enough we still see that virtual distillation decreases the error by orders of magnitude. The performance is still limited by the drift in the dominant eigenvector and we find that in many regimes M = 2 copies is sufficient to maximize the potential benefit. The largest differences are observed when one compares the two figures that analyze the performance of virtual distillation on random circuits ( Figure 14 and Figure 6). In this case, the potential benefit of virtual distillation is smaller (roughly two orders of magnitude instead of three for the ten qubit system) under the noise model based on amplitude damping and dephasing. Additionally, and perhaps more interestingly, we see that the dependence on system size mostly vanishes for the random circuits when we switch to the noise model considered in this appendix, while it persists when we consider the Heisenberg evolution. We plot the trace distance to the state obtained from noiseless evolution, as a function of the expected number of single-qubit amplitude damping and dephasing errors, for 6 and 10 qubit systems (demarcated by the thickness of the symbols). We vary the expected number of errors by varying the error rate per-gate, fixing the number of two-qubit gates to be 450. This figure is constructed to parallel Figure 6, except that we consider an error model based on single-qubit amplitude damping and dephasing rather than a depolarizing channel. We see that the system size dependence vanishes at low error rates in this case, in contrast with the data from Figure 6. . We plot the trace distance to the state obtained from noiseless evolution as a function of the expected number of single-qubit amplitude damping and dephasing errors. The expected number of errors is varied by changing the error rate per-gate, fixing the number of two-qubit gates to be 450. We show this data for 6 and 10 qubit systems (differentiated by the size of the markers). This figure mirrors Figure 8, except that we consider an error model based on single-qubit amplitude damping and dephasing rather than depolarizing noise. The two figures display similar behaviour, indicating that our conclusions about virtual distillation have some robustness to changes in the noise model.