Scalable mitigation of measurement errors on quantum computers

We present a method for mitigating measurement errors on quantum computing platforms that does not form the full assignment matrix, or its inverse, and works in a subspace defined by the noisy input bit-strings. This method accommodates both uncorrelated and correlated errors, and allows for computing accurate error bounds. Additionally, we detail a matrix-free preconditioned iterative solution method that converges in $\mathcal{O}(1)$ steps that is performant and uses orders of magnitude less memory than direct factorization. We demonstrate the validity of our method, and mitigate errors in a few seconds on numbers of qubits that would otherwise be intractable.

In the canonical situation where initialization noise is minimal, measurement errors over N -qubits can be treated classically and satisfy where p noisy is a vector of noisy probabilities returned by the quantum system, p ideal is the probabilities in absence of measurement errors (but still includes e.g. gate errors), and A is the 2 N × 2 N complete assignment matrix (A-matrix) where element A row,col is the probability of bit-string col being converted to bit-string row by the measurement error process [see App.

(A) for examples].
While computing A requires executing 2 N circuits, it is often the case that errors on multiple qubits can be well approximated using at most O(N ) calibration circuits; the A-matrix can be approximated efficiently. Equation (1) has a solution p ideal readily found using direct LU-factorization. However, direct methods necessarily generate quasi-probability distributions due to finite sampling that contain negative values, but still sum to one, that are incompatible with the requirement of p ideal being a probability vector. Consequently, a bounded-minimization approach solving ||A p ideal − * E-mail: paul.nation@ibm.com p noisy || 2 2 , where p ideal is constrained to be positive, is often used in place of a direct solution [12-14, 16, 28]. Although physically appealing, the run times of these methods are orders of magnitude longer than those of direct techniques. Alternatively, it has been shown that quasiprobabilities can be used provided that one mitigates expectation values [4,11,29]. As proven in Ref. [11], these quasi-probabilities provide an unbiased estimate for the expectation value ξ of an operator O, with a spectral radius of one, that is diagonal in the computational basis Near-term algorithms such as the ubiquitous Variational Quantum Eigensolver (VQE) [20,30] and quantum machine learning [22,27] rely on the computation of expectation values, making the correction of measurement errors in these quantities an important step along the road to quantum advantage.
Current measurement mitigation techniques utilize the full 2 N -dimensional probability space, and thus do not scale beyond a handful of qubits. A truncation scheme was developed in [23], however it did so at the loss of measurement information, and still required explicit construction of the full A-matrix. Creating a scalable mitigation strategy requires reducing the dimensionality of the linear system in Eq. (1) without the need for computing A itself. Fortunately, present day cloud-accessible quantum computing systems have measurement error rates of a few-percent or less, see Table I To good approximation, the solution is contained within p noisy , and we can mitigate errors in a re-normalized subspace defined by these bit-strings. Worst case, this subspace dimension is equal to the number of times the input circuit is sampled. For cloud-accessible quantum computers, the number of times a circuit can be sampled is limited, typically 8192 times on IBM Quantum systems, and therefore the dimensionality of the corresponding reduced assignment matrixÃ can be markedly smaller than the full A-matrix for N -qubits. In practiceÃ is often small enough such that the solution is amenable to standard LU-factorization, returning a vector of quasiprobabilities for use in a reduced version of Eq.
(2). However, for situations where explicitly formingÃ is still prohibitive due to large numbers of unique samples, it is possible to use preconditioned matrix-free iterative linear solution methods. Such methods have also been explored in numerical solutions of large-scale steady-state density matrices [36]. In practice this gives quick convergence, typically in O(1) steps, and is competitive with direct solution run times while requiring orders of magnitude less memory. Although the methods introduced here return quasi-probabilities, it is possible to find the nearest probability distribution, in terms of L2-norm, in linear time [37].
In this paper, we detail this efficient mitigation method beginning with Sec. (II) that motives the subspace reduction procedure, and describes how it is performed. Section (III) shows how preconditioned matrix-free methods can be utilized for a performant and memory efficient solution technique. In Sec. (IV) we show that one can obtain bounds on the variance of the computed expectation values in a similarly efficient manner with an overhead of O(1) in terms of additional run time. We demonstrate our technique in Sec. (V), showing the validity of our method, and mitigating readout errors out to numbers of qubits that would be intractable on even the largest of supercomputers using previous methods.

II. SUBSPACE REDUCTION
We aim to constructÃ without necessarily forming the full A-matrix. To this end we look to compute elements A row,col , directly from the bit-string values of the input noisy counts, and a small set of calibration data matrices. For concreteness, we study the case of tensored measurement errors as IBM Quantum systems are calibrated for high-fidelity quantum non-demolition (QND) measurements where uncorrelated errors are nominally dominant. Other quantum hardware vendors also report the same [26]. The full-dimensional tensored A-matrix A (T ) over N -qubits can be constructed from N 2 × 2 calibration matrices: A (T ) = S N −1 ⊗ . . . S 1 ⊗ S 0 where S k is the calibration matrix for the k-th qubit with the form where P k i,j is the probability of the k-th qubit being in state j ∈ {0, 1} and measured in state i ∈ {0, 1}. Here we use the convention that qubit 0 corresponds to the leastsignificant bit. For two bit-strings: row, col ∈ {0, 1} N , the matrix element A (T ) row,col can be computed using: Therefore it is possible to compute individual matrix elements directly from bit-string values and a number of calibration matrices that scales at most linearly with the number of qubits. Accommodating correlated errors in our method simply amounts to finding an equivalent expression to Eq. (4). We give an example in App. (B), where we intentionally induce correlated errors into the measurement process. As a corollary of grabbing elements individually, it is possible to select only those elements within a given Hamming distance, d(row, col) ≤ D, where D is the desired maximum distance. This allows for varying the sparsity ofÃ, and examining the effect of low-distance approximations.
A is defined to be the assignment matrix over only those bit-strings observed in p noisy . To understand the validity of this reduction consider the simulation presented in Fig. (1a) where we compare the distributions for p ideal and p noisy . In effect, measurement errors take blocks of probability, as defined by the finite number of samples, from p ideal and redistribute them amongst other bit-strings to get p noisy . This redistribution is predominantly to those bit-strings that are close in Hamming distance. Importantly, as seen in Fig. (1a), when measurement errors are small, and a sufficient number of circuit samples has been performed, it is unlikely that a given bit-string in p ideal is completely redistributed; the bit-strings in p ideal are also contained in p noisy . Therefore, mitigating measurement errors requires only elements A row,col that correspond to transitions between bit-strings in p noisy . An example is shown in Fig. (1b). Because we are grabbing only select elements of the full A-matrix, the columns ofÃ must be renormalized such that they once again sum to one. That is to say given any two bit-strings row, col ∈ p noisy , the reduced matrix elementÃ row,col is given bỹ where D is the desired Hamming distance [38]. Performing a finite number of circuit executions, even when  Counts   0000  0001  0010  0100  1000  0110  1010  1100  0011  0101  1001  1101  1011  0111  1110  1111   0000  0001  0010  0100  1000  0110  1010  1100  0011  0101  1001  1101  1011  0111  1110  1111   0000  0001  0010  0011  0101  1101  0111  1110  1111   0000  0001  0010  0011  0101  1101  0111  1110  1111 renormalize < l a t e x i t s h a 1 _ b a s e 6 4 = "

q 4 9 W / c + T d m 2 l l o 6 4 H A 4 Z x 7 y b k n i D n T x n W / n d L a + s b m V n m 7 s r O 7 t 3 9 Q P T z q
H U 8 2 g n 0 j A P G T Q C 6 e 3 h d 9 7 B K l o L B 7 0 L I G A 4 7 G g I 0 q w N t L Q r v m a s g g y n 2 M 9 k T y 7 y f O h X X c b 7 h z O K v F K U k c l 2 k P 7 y 4 9 i k n I Q m j C s 1 M FIG. 1. a) Simulated discrete probability distribution for 30 samples showing the ideal distribution for a Greenberger-Horne-Zeilinger (GHZ) state subject to gate errors only as well as noisy data affected by both gate and measurement errors. Observed bit-strings are in bold. The same random number seed was used for both simulations such that the difference between ideal and noisy distributions is solely due to measurement errors. b) Elements of the assignment matrix A that are used for constructing the reduced assignment matrixÃ that corresponds to the noisy distribution from part (a). Columns of A must be renormalized such that they again sum to one, Eq. (5).
measurement errors are weak, may result in completely redistributing probability away from some small magnitude elements in p ideal such that those elements are not present in p noisy ; the solution vector will be missing these elements. However as we will show, for typical numbers of circuit samples this effect is minimal.

III. MATRIX-FREE SOLUTION
AlthoughÃ is much smaller than the original, when sampling circuits with wide probability distributions many times, or executing on systems with appreciable error rates, it is possible thatÃ itself may become too costly to explicitly construct. Fortunately, being able to grab elements of A individually, Eq. (4), allows us to take advantage of matrix-free iterative techniques [39]. The time to solution for iterative methods greatly depends on the properties ofÃ. A-matrices obtained from present day cloud-accessible platforms nominally have strict diagonal dominance |A row,row | > col =row |A row,col | ∀ row, and are readily solved by simple iterative methods such as Jacobi iteration [39]. However this condition does not hold in general for systems with large error rates. Moreover, as the number of qubits grows so does the number of possible error channels (i.e. number of states at low Hamming distance) and it becomes harder to satisfy this stringent condition. Thus general purpose methods such as generalized minimal residual (GMRES) [40] or biconjugate gradient stabilized (BiCGSTAB) [41] methods must be used. Importantly these Krylov subspace methods require computing only the productÃ p noisy , but not A itself [39,42]. However having strict diagonal dominance, or close to it, suggests that we can increase the rate of convergence by using a simple Jacobi preconditioner P −1 to solve where P −1 is a diagonal matrix with P −1 i,i = 1/Ã i,i [39]. In practice, Eq. (6) gives rapid convergence requiring only O(1) iterations for an absolute tolerance value of 10 −5 while simultaneously dramatically reducing the memory requirements for mitigation.

IV. UNCERTAINTY ESTIMATES
Mitigating measurement errors does not come for free. Rather it results in an increase in the uncertainty of repeated measurement outcomes that must be compensated for by increasing the number of times the circuit is sampled. This mitigation overhead M is determined by the one-norm of the inverse of the A-matrix M = ||A −1 || 2 1 [11], and gives an upper bound on the standard deviation of an observable σ O ≤ M/s, where s is the number of samples. Not wanting to constructÃ −1 , here we use the iterative Hager-Higham algorithm [43,44] for estimating ||Ã −1 || 1 using only linear systems of equations involving A andÃ T . When using direct factorization, the LU decomposition ofÃ can be cached, and thus the overall runtime is ∼ 2x longer than mitigation alone. However, for iterative methods, the overhead is between 4-10x longer depending on how many steps the Hager-Higham routine requires. Although this method gives a lower-bound on √ M, in practice it is often exact or nearly so (see related discussion in Ref. [44]). Because our truncation method selects only those rows and columns from p noisy , the one-norm of ||Ã −1 || 1 , and thus mitigation overhead, is dependent on the circuit being executed and the noise properties of the device on which it is run.

V. DEMONSTRATIONS
Our method [45] is implemented with NumPy [46], SciPy [47], and Cython [48], and makes use of Qiskit [28] for calibration circuit construction and execution. All timing data is taken on a quad core Intel i3-10100 system with 32 Gb of memory with NumPy and SciPy compiled against OpenBlas [49].
We begin by showcasing the veracity of our method by comparing expectation values for the circuit shown in Fig. (2a) computed with the full-space tensored method from Qiskit Ignis [28] along with our method called M3 [50] varying the number of samples taken per circuit.
Here the circuits are executed on the 27 qubit IBM Quantum Kolkata system, mapping the virtual circuit qubits to physical qubits [1, 4, 7, 10, 12, 13, 14, 11, 8, 5, 3, 2]. The calibration data and raw input samples are identical for both mitigation methods. In Fig. (2b) we see that, despite using at most 371 bit-strings (9% of the fulldimensionality), M3 closely matches the full-dimensional A (T ) results over the entire range of samples, and agrees remarkably well with the Ignis results for the numbers of executions typically employed in practice, > 1000. An exampleÃ (T ) generated by M3 is presented in App. (A 3). The sub-space reduction results in a run-time performance improvement as well, with Qiskit Ignis mitigation taking ∼ 3 s per circuit, where as M3 took at most ∼ 7 ms. Additionally, we see that the uncertainty bound given by M closely matches the experimental values, and verifies the use of this technique in reporting faithful error bounds. The inset of Fig. (2b) also shows that, for a tolerance of ≤ 10 −5 ,Ã (T ) is well-approximated by keeping only terms out to D = 3.
We now demonstrate the scalability of our M3 method by mitigating GHZ states out to 42 qubits on the 65 qubit IBM Quantum Brooklyn system. Details of this experiment are in App. (C). In Fig. (3) we compare M3 along with the Qiskit Ignis tensored and bounded leastsquares methods [51]. Only M3 allows for mitigating errors beyond 14 qubits due to algorithmic breakdown or extreme run-times, for the tensored and least-squares methods respectively, and shows the importance of performing measurement mitigation for large-scale experiments. The overall expectation values drop as the circuit depth increases where gate errors and decoherence, effects measurement mitigation cannot resolve, start to dominate.
The mitigation overhead, inset of Fig. (3a), shows exponential scaling at small numbers of qubits after which the overhead begins to plateau. The exponential scaling arises as the diagonal elements ofÃ, formed from the product of N probabilities, Eq. (4), are effectively being inverted when computingÃ −1 , with additional contributions coming from elements close in Hamming distance. Provided that the bit-strings in p noisy sample sufficient portions of these short Hamming distance elements, the re-normalization used in obtainingÃ is small, and one recovers the exponential scaling shown for the full-dimensional A-matrix [11]. However if this is not the case then re-normalization increases the magnitude of the elements inÃ (in particular the diagonal elements) suppressing the exponential growth in M. Figure (3b) details the timing across the different mitigation methods. We see that M3 greatly improves the computed expectation values while taking at most 1.2 seconds to compute at 42 qubits. When computing the mitigation overhead, the total time increased to 2.4 and 4.5 seconds for the direct and iterative solutions at 42 qubits, respectively (not shown); an extraordinary improvement upon the exponential run times observed for the Qiskit methods. As with the example in Fig. (2), a D = 3 Hamming approximation well captures the full mitigation process to the desired tolerance, and performs best at large numbers of bit-strings where the overhead from computing the Hamming distance between elements starts to become less than the cost of additional floatingpoint multiplications. For the M3 results, the rate at which the run times increase begins to slow down at larger numbers of qubits following a similar slow down in the number of unique bit-strings observed, Fig. (3a), when additional qubits are added.
Finally, we note that at 42 qubits, storing a full 2 Nvector of single-precision floating point values for p noisy requires 16 TiB of memory; well beyond the limits of our computer on which the mitigation is implemented, but is amenable to storage on a supercomputer. Juxtapose that with storing a sparse representation of A using, for example, compressed sparse column (CSC) format out to D = 3. This requires ∼ 580 PiB of memory, which is 120x more than that available in the Fugaku supercomputer [52] [see App. (C 2) for details]. In contrast, the M3 iterative method uses ∼ 1 MiB of storage, highlighting the benefit of the techniques presented here for mitigating measurements at scales amenable to demonstrations of quantum advantage.
Appendix A: Example A-matrices

Complete A-matrix
An example complete A-matrix computed by running 2 N circuits, one for each computational basis state, on the IBM Quantum Kolkata system is given in Fig. (4). Because of finite sampling, the matrix is nominally sparse, and only those elements close in Hamming distance have appreciable transition probabilities. The matrix has strict diagonal dominance and thus is guaranteed to be invertible.

Tensored A-matrix
The A-matrix corresponding to tensored measurement errors, A (T ) is constructed by taking the tensor product of single-qubit calibration matrices given by Eq. (3) in the main text. Unlike the complete A-matrix, A (T ) contains only non-zero elements unless one or more qubits has no reported measurement error for P This matrix is also strictly diagonally dominant. For each circuit execution, the number of elements inÃ (T ) may vary, as can their associated amplitudes.

Appendix B: Correlated errors
Although we have focused on tensored mitigation techniques, our method is equally capable of handling correlated errors provided that matrix elementsÃ row,col can be obtained by their bit-string values as done in Eq. (4). IBM Quantum systems operating normally are dominated by uncorrelated measurement error, and are thus well mitigated by the tensored A-matrix methods presented in the main text. It is possible however to purposely induce correlated errors into the readout process and explore correlated mitigation strategies. To take into account pairwise correlated errors, we can modify Eq. (4), grabbing elements using where row = q N −1 q N −2 ..q 0 and col = q N −1 q N −2 ..q 0 . Here, the S m are defined as in Eq. (3), and C kl are a 4x4 stochastic matrix (local noise matrix) between qubits k and l where a kl = 2q k + q l , b kl = 2q k + q l , respectively. The elements of C kl are obtained following Section V of Ref. [11].
IBM Quantum Kolkata, a Falcon 5.11 series system, has readout output multiplexing ratios ranging from 3 : 1 to 5 : 1, with readout frequencies in a shared output typically separated by 50-60 MHz. This separation is much larger than the average cavity linewidth κ (5.6 MHz) and dispersive shift χ (1.6 MHz), which when combined enable short 330 ns readout pulses for all qubits. Using default readout pulse amplitudes, calibrated to optimize fidelity while maintaining QND readout, Fig. (7a) shows expectation values produced by executing an 8 qubit GHZ circuit 100 times using qubits [8,5,3,2,1,4,7,6] that span two multiplex readout groupings. Importantly we perform mitigation using the complete A-matrix, A (C) , as well asÃ (T ) , with results in agreement with uncorrelated errors largely dominating the readout process.
Intentionally increasing readout pulse amplitudes by approximately 2x from the optimized values results in correlated readout errors. With these larger readout amplitudes, while there is no appreciable change in average readout fidelity (< 0.2%) compared to the default setting, there is a substantial uptick in non-QNDness. In  Experiments are run on the 65 qubit IBM Quantum Brooklyn system. GHZ states were prepared starting with a Hadamard gate on qubit 11 and entangling additional qubits as shown in Fig. (8). After entangling the first 6 qubits, this pattern allows for increasing the GHZ state by 4 qubits per layer. The average assignment and CNOT error rates across the qubits used is 2.15% and 1.01%, respectively. elements in each of the 2 42 columns. Storing these values using single-precision floating-point numbers, 4bytes per entry, requires 193.5 PiB of memory. In addition, we must also specify the row and column indices for these values. In compressed sparse column (CSC) format we need 2 42 + 1 elements, the difference of which specify the number of non-zero elements in each of the columns. Lastly, we also need the row index for each non-zero matrix element. At 42 qubits, the size of these indices cannot be stored using 32-bit integers, and we must use 64-bits per entry. Storing these values requires an additional 387 PiB of memory. The total memory required is therefore 580.5 PiB and is ∼ 120x larger than the 4. [1] S. Bravyi, D. Gosset, and R. König, Quantum advantage with shallow circuits, Science 362, 308 (2018).