Scalable Measurement Error Mitigation via Iterative Bayesian Unfolding

Measurement error mitigation (MEM) techniques are postprocessing strategies to counteract systematic read-out errors on quantum computers (QC). Currently used MEM strategies face a tradeoff: methods that scale well with the number of qubits return negative probabilities, while those that guarantee a valid probability distribution are not scalable. Here, we present a scheme that addresses both of these issues. In particular, we present a scalable implementation of iterative Bayesian unfolding, a standard mitigation technique used in high-energy physics experiments. We demonstrate our method by mitigating QC data from experimental preparation of Greenberger-Horne-Zeilinger (GHZ) states up to 127 qubits and implementation of the Bernstein-Vazirani algorithm on up to 26 qubits.


I. INTRODUCTION
Measurement errors are a significant source of error [1-4] for today's programmable quantum computers (QCs).Currently, QCs typically implement quantum algorithms with a measurement at the end of the quantum circuit.However, advanced quantum operations, including quantum error correction [5,6], involve measurements not only at the end of quantum computation but throughout the computational scheme.Thus, measurement accuracy is vital to obtaining accurate computational results.Measurement error mitigation (MEM) techniques are a collection of classical post-processing techniques that aim to address the measurement errors inherent to the QC hardware.Most MEM methods are unable to achieve both scalability and a guarantee of valid probability distributions after mitigation.Tackling this challenge, we present a scalable MEM strategy whose time and memory costs do not scale exponentially with the number of qubits while also guaranteeing valid probability distributions.
Most MEM methods assume that the true classical probability distribution θ obtained from measuring a quantum state in the computational basis is modified by a response matrix R to yield p, a noisy probability distribution distorted by measurement occurring in a faulty basis, i.e., p = R θ . (1) Eq. (1) assumes that measurement errors are systematic, linear, and time-independent.To obtain the response matrix, most MEM techniques further assume that the state preparation errors can be ignored.These simplifying assumptions have been verified for IBM Quantum Experience (IBMQE) * denotes equal contribution.
superconducting devices [7,8], provided the readout calibration data is up to date.Assuming that systematic measurement errors manifest as in Eq. (1), MEM schemes require two steps.The first step is quantum detector tomography (QDT) [7] to acquire the response matrix R (see Fig. 1).The entries of the response matrix R i j are conditional probabilities, i.e., the probability of measuring a bitstring b i given that computational basis element |b j was prepared.In the second step, the classical probability distribution p acquired from measurements after running quantum circuit U is used to compute a new probability distribution θ which approximates the true distribution (for any future experiment).The most popular strategy, response matrix inversion, defines θ = R −1 p. (2) Acquiring R and implementing MEM face two challenges: scaling and negative probabilities.First, R is a 2 n × 2 n matrix for an n-qubit system and in general requires 2 n calibration experiments.In other words, the dimensions of R are exponential in n, and QDT requires exponentially many experiments.Secondly, while the response matrix R is stochastic, its inverse is not.As most MEM methods rely on applying R −1 , this leads to a 'negative probability' issue.The distributions obtained after measurement error mitigation are quasiprobabilities -the elements of this quasiprobability distribution can be negative or greater than 1 as long as they sum to 1.
Various strategies have been proposed to address the scalability of the response matrix inversion method.Assuming k-locality for measurement crosstalk [7,9] allows R to be represented as n /k different 2 k × 2 k matrices.This assumption has been verified in Refs [7,8] and presents the first step in addressing the scalability issue.The unmitigated probability distributions are already sparse as the experiments required to construct these distributions are only performed for S shots, which are generally set between 500 to 100000.In practice, S does not scale exponentially with qubit size due to time and computational constraints.Ref [8] relied on this sparsity to make MEM tractable.In particular, only the matrix elements of R that had bitstrings close to the observed ones (measured using Hamming distance) could be considered, while others were ignored.This subspace reduction of R suggested that a user could perform QDT after and with the knowledge of the unmitigated experimental results.In other words, the order of QDT and the experiment in Fig. 1 was flipped.Similarly, Ref. [10] suggested placing a precision threshold on R and MEM, such that any probabilities below 1 /(10×S) are ignored.These approaches to scaling MEM have been quite effective.In particular, Ref. [8,10,11] have demonstrated Greenberger-Horne-Zeilinger (GHZ) state preparation on 27, 42 and 65qubits respectively.MEM plays a crucial part in these demonstrations by significantly improving success metrics.
The methods listed above, while scalable, aim to invert R and apply R −1 , and the non-stochasticity of R −1 necessarily leads to quasiprobabilities.Numerous methods [1, 12,13] have been proposed to address the negativity issue, but no clear consensus has emerged.One strategy is to use constrained least-squares optimization to find a probability distribution with no negative probabilities close to the mitigated quasiprobability distribution (in terms of some norm-based distance between probability distributions).However, it is unclear how to scale this approach.Alternatively, the negative terms can be canceled systematically by projecting to the nearest valid probability distribution, as detailed in Ref. [13].However, this zeroes out all negative probabilities and can no longer be considered an unbiased estimator of the true distribution without measurement errors.Here Ref. [12] is particularly notable.It highlighted that error-prone detectors are standard across experimental fields, and unfolding techniques used in high-energy physics experiments can also be used to mitigate readout errors on quantum devices.In particular, Ref. [12] showed that iterative Bayesian unfolding (IBU) [14] could mitigate readout errors while preserving non-negativity.
While IBU does not return any quasi-probabilities in its basic form, it requires iterative matrix multiplications with R and is challenging to scale to many qubits without further modification.In this paper, we demonstrate a scalable implementation of IBU.The original IBU formulation starts with the response matrix R, the noisy empirical distribution p, and an initial guess θ 0 .Then it iteratively applies Bayes' rule to find a mitigated probability distribution θ k like so: A well-known technique in the high-energy physics community, IBU is also known as the Richardson-Lucy deconvolution [15,16].While IBU uses Bayes' rule and starts with an initial guess, sometimes referred to as a 'prior,' it does not report a 'posterior' in any sense and is not a Bayesian method.It is more appropriate to call it iterative expectation maximization unfolding [17] as it converges to the maximum likelihood estimate [18] for a large number of iterations.Since this method may not be familiar to the quantum computing community, we provide a derivation of IBU as expectation maximization as relevant to the quantum computing setting.Our main challenge was to show that under reasonable assumptions about measurement crosstalk and precision, IBU can be implemented scalably.To demonstrate the scalability of our method, we mitigate errors on QC data acquired by implementing the Bernstein-Vazirani algorithm [19] on the 27qubit IBMQ Montreal [20] and from preparing GHZ states for the 127-qubit IBMQ Washington.Our MEM implementation performs as well as or better than the current state-of-the-art M3 method [8] without producing negative probabilities, albeit with slightly higher computational time.
The rest of the paper is structured as follows: in Section II, we describe our implementation of scalable iterative Bayesian unfolding.We discuss the connection between IBU and expectation maximization in Section III.Section IV focuses on demonstrating our MEM method on two quantum datasets -the Bernstein-Vazirani algorithm on up to 26-qubits (Section IV A) and GHZ state preparation up to 81 qubits (Section IV B).We give some concluding remarks in Section V.

II. SCALABLE IBU IMPLEMENTATION
Here, we discuss some assumptions and computational approaches we use to scalably implement IBU (as in Eq. 3) for measurement error mitigation (see Algorithm 1).
a. Vectorized Implementation First, we show how to vectorize the IBU update rule given in Equation 3to enable fast parallel computation on GPUs.Let Z be a random variable representing the true distribution over bitstrings without any measurement error and Z be a random variable representing the distribution over bitstrings with measurement error.With c(z i ) the counts of bitstring z i and S shots, let p ∈ R 2 n be the noisy distribution with p i = c(z i ) S , let θ k ∈ R 2 n be the kth iteration guess of the error-mitigated distribution with θ k j = P θ k (Z j ), and let R ∈ R 2 n ×2 n be the response matrix where R i j = P(Z i |Z j ).Then, the IBU update rule is: Here, is element-wise multiplication, and is elementwise division.This expression fully vectorizes the IBU update rule by writing it solely in terms of elementwise operations and matrix products.It also optimizes the order of operations to avoid constructing large intermediate matrices; only the additional memory for storing θ k+1 is needed.
b. Tensor Product Structure of Noise Model Despite the efficient update scheme above, the memory requirements are still quite unfavorable.As previously discussed, the first big memory bottleneck is that the response matrix R grows exponentially in the number of qubits: 2 n × 2 n .To get around this issue, we assume that there is no measurement crosstalk and hence R has a convenient tensor product decomposition [7][8][9], i.e., R = R (1) ⊗ R (2) ⊗ . . .⊗ R (n) .While we restrict ourselves to this scenario, our implementation can be extended to cases where measurement crosstalk is restricted to k qubits.
With this tensor product structure on R, we now have 4n parameters to represent R, as opposed to the original 4 n parameters.This structure also allows matrix product R x to be computed very quickly using only matrix products and reshapes [21] without any additional memory requirement.This is the simplest structure we can impose on R to obtain tractability, although this can be relaxed into other tensor network decompositions of R as well (such as matrix product states (MPS)).
c. Subspace Reduction for Tractability While we managed the exponential scaling of R with a tensor product decomposition, the parameters of the true latent distribution θ ∈ R 2 n also scale exponentially in the number of qubits.
To manage this, we can use the subspace reduction used in Ref. [8].In particular, instead of maintaining and updating a vector of probabilities over 2 n bitstrings, we only maintain and update an M-dimensional vector over select bitstrings.Specifically, we pre-select M-bitstrings (e.g., to be all those bitstrings within Hamming distance d from any bitstring in our dataset) and initialize θ 0 with non-zero values in the entries corresponding to those M bitstrings; all other entries are zero.Under such an initialization θ 0 , the IBU update rule will guarantee that all entries that started as zero remain zero.Thus, we can ignore those entries and track only the M × 1 sub-vector corresponding to non-zero entries.We refer to this subspace-reduced implementation as IBU (Reduced), and IBU without subspace reduction as IBU (Full).Under this subspace reduction trick, we can no longer easily leverage the Kronecker product structure of R to compute R x for any x.Instead, we must construct a reduced matrix R whose rows and columns correspond to the M selected bitstrings.Unlike [8], we are not solving a least squares problem, so matrix-free methods like generalized minimal residual method (GMRES) will not be helpful; we need to construct the reduced matrix R.Even if the matrix does not grow exponentially with the number of qubits, it may still be prohibitively large to store in memory.Our computational solution to this problem is to build the solution to R x by weighting each column R j by x j and only keeping the running sum.We also use the JAX library's built-in accelerated linear algebra routines and justin-time compilation, vectorize as many operations as possible, and use a GPU for parallelization.Although we incur the cost of repeatedly computing R, there is a tradeoff between this cost and memory costs for explicitly storing R.
d. Worst-case Computational Complexity Suppose we run an experiment with S shots on an n qubit system and track strings up to Hamming distance d from the measured bitstrings.In the worst case, there will be S different measurements, so subspace-reduced IBU tracks S • ∑ d k=0 n k ∼ O(Sn d ) bitstrings.The three main computational subroutines are 1) constructing the subspace-reduced response matrix, 2) matrix-vector multiplication, and 3) elementwise multiplication and division.First, the subspace-reduced response matrix will have O(S 2 n 2d ) cells to be constructed.Each cell of the matrix requires a single call to each of the n single-qubit response matrices, so constructing the matrix takes O(S 2 n 2d+1 ) operations.Second, multiplying this matrix with a vector with O(Sn d ) entries takes O(S 2 n 2d ) operations.Third, the elementwise multiplication and division operations take O(Sn d ) operations.Overall, a single IBU iteration scales as O(S 2 n 2d+1 ), better than exponential scaling of the naive approach.

III. ITERATIVE BAYESIAN UNFOLDING AS EXPECTATION-MAXIMIZATION
Here we summarize the argument that iterative Bayesian unfolding is an instantiation of the well-known Expectation-Maximization (EM) algorithm from machine learning.See Appendix A for complete derivation as well as a 'true' Bayesian extension that computes that maximum a posteriori (MAP) estimate of θ given Dirichlet priors.Shepp and Vardi [18] have noted connections between IBU and EM for Poisson Algorithm 1: Scalable IBU Input: p (Vector of normalized counts for each bitstring), {R (i) } n i=1 (n single-qubit response matrices), θ 0 (initial guess of MEM distribution), tol (the convergence tolerance) [21] if no subspace reduction, else by keeping a running sum of the columns R j weighted by θ k j 3 Compute x 2 = p x 1 4 Compute x 3 = R T x 2 using [21] if no subspace reduction, else by keeping a running sum of the columns R T j weighted by ( x 2 ) j 5 Compute θ k+1 = θ k x 3 .
6 Return θ k variables.In our setting, we have at most 2 n discrete observations, each with an associated probability.So our measurements come from a multinomial distribution parameterized by θ ∈ R 2 n , and we show the IBU-EM connection for multinomial distributions typical in quantum computing.
EM is a standard maximum-likelihood approach to estimating the parameters of latent variable models, where optimizing the likelihood analytically and directly is not possible.The EM algorithm iterates between an E-step, which estimates a conditional expected likelihood under a given choice of model parameters, and an M-step which updates the model parameters to maximize the aforementioned conditional likelihood.EM maximizes a lower bound of the likelihood at every iteration, and thus every parameter update increases the likelihood.The method is guaranteed to converge to a local maximum.
a. Setup Assume all measurements are made in a fixed basis.Let Z be the discrete random variable denoting n-length bitstrings with P(Z) be the distribution over bitstrings prepared by the quantum computer.Due to measurement error, the distribution over bitstrings that are actually measured is different, and we denote this with random variable Z with distribution P(Z ).Suppose we collect a dataset of S shots Z = {z 1 , . . ., z S } sampled IID from P(Z ), and that we have access to the error model P(Z |Z).Our goal is to estimate P θ (Z), where θ is 2 n -dimensional vector summing to 1 that parameterizes the true distribution over bitstrings.
b. Expectation-Maximization Using Jensen's inequality, we can manipulate the log-likelihood of the data Z under parameter θ as: We can then find θ that maximizes this lower bound ˆ ( θ | θ k ; Z ) locally, which is also equivalent to maximizing ( θ | θ k ; Z ) := ∑ S i=1 ∑ 2 n j=1 P θ k (z j |z i ) log P(z i , z j ; θ ) since the denominator inside the log in ˆ ( θ , θ k ; Z ) does not depend on θ , i.e., To locally maximize ( θ | θ k ; Z ), we can take the derivative (using the Lagrange multiplier λ to enforce the normalization constraint that ∑ j θ j = 1).
Setting this derivative to zero, we find that λ = S. Substituting λ = S back into the derivative set to zero, we get the update rule for each entry θ j of θ that locally maximizes the log-likelihood given guess θ k : Using Bayes' rule, we have Instead of summing over every bitstring in the dataset, we can re-write the sum as over all possible bitstrings using bitstring counts c(z i ).Doing so, we end up with the IBU update rule: c. Convergence Finally, we note that EM converges to a local stationary point of the log-likelihood.This is because the optimization objective ˆ ( θ | θ k ; Z ) lower bounds the log-likelihood of the data ( θ ; Z ), and is also equal to the log-likelihood at θ = θ k .Consequently, any choice of θ k+1 which improves ˆ ( θ k+1 | θ k ; Z ) from the current estimate ˆ ( θ k | θ k ; Z ) must also improve the log-likelihood ( θ k ; Z ).This log-likelihood increases at every iteration of IBU by at least the amount ˆ ( θ k+1 | θ k ; Z ) − ˆ ( θ k | θ k ; Z ) and the iterations stop at a stationary point of ˆ ( θ | θ k ; Z ).Since the optimization objective ˆ ( θ | θ k ; Z ) and log-likelihood ( θ ; Z ) are equal at θ = θ k and have the same slope with respect to θ [22], the termination coincides with a stationary point of the log-likelihood.

IV. DEMONSTRATIONS
We demonstrate our method on two datasets -measurements after the Bernstein-Vazirani (BV) algorithm (26 qubits) and measurements of the GHZ state (up to 127 qubits).We compare our method with M3 and focus on three metrics: accuracy (measured as success probabilities for BV and 1error for GHZ), runtime on our machine, and total negative probabilities in the mitigated distribution.Our algorithm is implemented using JAX 0.3.10 [23] and run on an NVIDIA GeForce RTX 3090 GPU with 24576 MB of memory. 1   Oracle Circuit diagram for the Bernstein-Vazirani algorithm with the hidden bitstring b = 1 . . . 1 [5,19].
Example of GHZ generation circuit.Given a 7-qubit architecture (right), the GHZ generation circuit is created by first identifying an vertex with maximum degree (q 1 ) and then using breadthfirst expansion and performing CNOTs accordingly to create a fullyentangled state.For our experiments, we generated GHZ on an 127qubit device, but here we demonstrate the scheme on a 7-qubit device for simplicity.

A. Bernstein-Vazirani algorithm
a. Problem Setting BV [19] is an oracular algorithm where a hidden bitstring b ∈ {0, 1} n is known to the oracle, 1 Our code is accessible at: https://github.com/sidsrinivasan/PyIBU. but when enquired with a known bitstring x ∈ {0, 1} n the oracle reveals x ⊕ b, where ⊕ is bit-wise addition.The goal is to guess the hidden bitstring b by making queries to the oracle.Classically, this algorithm requires O(n) queries, but the BV algorithm solves this problem with O(1) query.Theoretically, regardless of the problem size or the hidden bitstring, the success probability for BV is always 1 (see Fig. 2).This expectation is violated for a real quantum computer which is subject to decoherence.The data, originally collected by Ref. [20], was taken from a publicly available repository.This work demonstrated quantum speedup over the classical strategy for single-shot BV, where the hidden bitstring changes after every oracle query.While the original work computed time-tosolution as the success metric, here we focus on how the raw success probabilities are affected by MEM.
b. Experimental Details In Ref. [20], the BV experiment was performed on the 27-qubit IBMQ Montreal.Experiments at each problem size had 32000 shots, and we use 1 /32000 as the convergence tolerance for IBU.The primary metric is the success probability, i.e., the probability that a single guess yields the right answer.This problem employs up to 26 qubits on a bonafide quantum algorithm, and we explored whether MEM can improve this success probability.
c. Results Our results are shown in Fig. 4 (additional results in Appendix B 1), and we see that a significant improvement in the success probability is possible due to MEM.In terms of the success probabilities, our IBU implementation performs about equal to or marginally better than the stateof-the-art M3 method [8].Notably, the distribution produced by IBU does not contain negative probabilities, while negative probabilities are present in the quasi-probability solutions provided by M3 (see Fig. 4).In terms of runtime, as the problem size increases, full IBU implementation, i.e., the one without subspace reduction, is noticeably faster than M3.In contrast, the subspace-reduced version, which is set to track bitstrings up to Hamming distance 1, is slightly slower.Recall that this is because compared to full IBU, subspace-reduced IBU is less memory intensive but can be somewhat slower.For a smaller number of qubits, relying on full IBU is reasonable, but the memory constraints become the bottleneck for larger problem sizes.
B. Generating GHZ states a. Data Generation In this experiment, we prepared the GHZ state, on the 127-qubit ibmq_washington (see Appendix B for device architecture).GHZ states are prepared by entangling n qubits.This means that as long as there is a single connection allowing for a multi-qubit entangling operation like CNOT, it is possible to create a GHZ state linearly increasing in circuit depth.To reduce circuit-depth, the entangling operation is implemented by parallelizing the two-qubit operations as much as the device architecture allows.Starting from the most connected qubit (the qubit with the greatest number of connections/edges to neighboring qubits) in a superposition state |+ , we use breadth-first search to entangle every additional qubit with FIG. 5. GHZ state preparation on 127-qubit IBMQ Washington: The left figure shows the normalized 1 score.We visualize results for the full range of GHZ states we attempted to create, up to 127 qubits.Circuit errors dominate beyond n = 81, and MEM is not of much help.At all problem sizes, IBU returns a higher value for the performance metric 1 − 1 2 ( 1 error).M3 achieves a negative score due to negative probabilities.The top right figure shows the total negative probability in the error-mitigated distribution; here, M3 yields significant negative probabilities.The bottom right figure shows the run time on our machine.IBU takes longer to run than M3 but is reasonable in absolute terms.
CNOT gates.In effect, this amounts to prioritizing the distance 1 neighbors to entangle, followed by the distance 2 neighbors, and so on.We give an example of this strategy in Fig. 3.
b. Experimental Details We generated GHZ states up to n = 127.The experiments here repeated for 100,000 shots, and the results were bootstrapped using the observed counts.We use a convergence tolerance of 1 /10000.The primary error metric is a normalized 1 score; the 1 error between two probability distributions p, q is ∑ i |p i − q i | 1 and lies in [0, 2], and we normalize it as 1 − 1 2 ( 1 error) to lie in the range [0, 1] where a score of 1 implies zero 1 error.We consider the 1 error = ∑ i |θ i − q i | 1 between the error mitigated distribution θ and the theoretical distribution q of bitstrings for a GHZ state measured in the Z basis (i.e., q = [0.5, 0, 0, . . ., 0, 0.5]).Due to the memory costs, we can only run IBU with subspace reduction.Furthermore, we use a Hamming distance of zero, meaning MEM modifies the probability mass only on the measured bitstrings.
c. Results Fig. 5 shows our GHZ state preparation experiment results.Firstly, IBU has a significantly higher normalized 1 score than the raw data or M3.Beyond n = 81, the true solution bitstrings (0 n and 1 n ) were no longer the modal bitstrings due to circuit errors, so MEM (M3 and IBU) is not of much help.However, the normalized 1 score for IBU goes to zero, but the score for M3 is negative due to negative probabilities.We found that the negative probabilities produced by M3 can be quite substantial, up to ∼ 30% of the probability mass.We also found that M3 places higher probability mass on the 'correct' bitstrings (0 n and 1 n ), but compensates for this with negative probabilities on the wrong bitstrings, leading to overall worse performance on the normalized 1 score (see Appendix B 2).Finally, we note that the cost of achieving better 1 error is longer run-time: IBU is noticeably slower than M3.Nevertheless, in absolute terms, IBU can perform MEM on 81 qubits in a little over 2 minutes on our machine.IBU run-time declines past n = 81 as iterations do not modify the guess, and there is little room for progress.One can always choose a higher tolerance to reduce the number of IBU iterations and hence run-time.In Appendix B 2, we also provide a graph showing empirically that the per-iteration run-time scales linearly in the number of qubits.

V. CONCLUSION
We showed that IBU, an expectation-maximization algorithm, can be implemented scalably and without compromising on the benefits of error mitigation.Our implementation is scalable to QC with hundreds of qubits while avoiding negative probabilities.The mitigation accuracy and runtime are as good and often better than the state-of-the-art MEM methods.We demonstrated our method by mitigating results from the 26-qubit BV algorithm and 127-qubit GHZ preparation experiment.Using an 1 -based metric, we showed that up to 81 qubits, the GHZ experiment had a non-zero overlap with the ideal GHZ probability distribution.To our knowledge, this is the largest number of qubits to be entangled on programmable QCs.Overall, our results highlight the need to mitigate readout errors and provide an efficient way to do so.
As we stress above, MEM strategies rely on certain physical assumptions about measurement errors.In our case, we assume linear errors with minimal measurement crosstalk.The runtime and efficacy of various strategies depend on the validity of these assumptions.As QCs become more sophisticated, frequently validating these assumptions will be necessary.In the future, we expect such sophisticated quantum detector tomography to be native to the QC operation and inform MEM strategies.Define ( θ | θ k ; Z ) := ∑ S i=1 ∑ 2 n j=1 P θ k (z j |z i ) log P(z i , z j ; θ ), and note that maximizing ˆ ( θ | θ k ; Z ) is equivalent to maximizing ( θ | θ k ; Z ).EM consists of an E-step which involves computing ( θ | θ k ; Z ) and an M-step which involves computing arg max θ ( θ | θ k ; Z ).For computational reasons, we will combine these into a single step for updating parameter θ k to θ k+1 .To locally maximize ( θ | θ k ; Z ), we can take the derivative (using the Lagrange multiplier λ to enforce the normalization constraint that ∑ j θ j = 1).
Setting this to zero, we get θ j = 1 λ ∑ S i=1 P θ k (z j |z i ) , and by the requirement that ∑ Substituting this back in, we get our update rule for each element θ j of θ that locally maximizes the log-likelihood given guess θ k : Note that by Bayes' rule, we have . Additionally, instead of summing over every bitstring in the dataset, we can re-write the sum as over all possible bitstrings using bitstring counts c(z i ).Doing so, we end up with the IBU update rule:

Additional Results for GHZ state preparation experiment
In Fig. 9, we visualize the probabilities assigned to the 'correct' bitstrings ('0 n ' and '1 n ') with IBU, M3, and no MEM.Even though IBU achieved a higher normalized 1 score (Fig. 5), M3 places greater probability on the 'correct' bitstrings and compensates for this with negative probabilities on 'wrong' bitstrings.While this may appear advantageous, the negative probabilities show that M3's error-mitigated distribution deviates from the true solution, which assigns zero probability to these bitstrings.Hence, the normalized 1 score gives a more reliable measure of whether the error-mitigated distribution matches the true distribution.Indeed, Fig. 9 we see that M3 assigns negative probabilities to the 'correct' bitstring ('0 n ') at n > 81.
In Fig. 10, we show the per-iteration runtime of IBU.This disregards the effect that the convergence tolerance might have on the run-time.As expected, the graph is essentially linear in the number of qubits.FIG. 9. GHZ state preparation on 127-qubit IBMQ Washington: The probabilities assigned to the 'correct' bitstrings ('0 n ' on the left and '1 n ' on the right) with IBU, M3, and no MEM.Even though IBU achieved a higher normalized 1 score (Fig. 5), M3 places greater probability on the 'correct' bitstrings.This is because M3 compensates for higher probabilities on the correct answer by assigning negative probabilities to 'wrong' bitstrings.FIG. 10.GHZ state preparation on 127-qubit IBMQ Washington (Per-Iteration Run-time): To disregard the effect of a varying number of iterations till convergence, We give the time taken by a single IBU iteration under Hamming distance 0. The run-time scales linearly with the number of qubits, when using Hamming distance 0, and the number of shots, matching theoretical expectations and showing that IBU is scalable to a higher number of qubits.Error bars indicate 95% confidence intervals.

FIG. 4 .
FIG. 4. Bernstein-Vazirani algorithm results: The BV algorithm was implemented on the 27-qubit IBMQ Montreal, and we compare IBU, M3, and the raw (no-MEM) result.The left figure shows log average success probability and is restricted to larger problem sizes as the mitigated and unmitigated distributions are nearly identical at smaller sizes (see Appendix B 1). Full IBU and subspace-reduced IBU have identical performance.The top right figure shows the total negative probability in the error-mitigated distribution, and the bottom right figure shows our machine's run-time.Error bars indicate 95% confidence intervals.
VI. ACKNOWLEDGEMENTSThis work was partly supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, and Accelerated Research in Quantum Computing under Award Number DE-SC0020316.This research used resources from the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

FIG. 8 .
FIG. 8.Additional Bernstein-Vazirani results: Bernstein-Vazirani results for all problem sizes.The left figure shows the log average success probability for the full range of problem sizes from 1 to 26.The figure on the right shows the average success probability for small problem sizes.The performance of IBU (both full and reduced subspace) and M3 are nearly identical at all problem sizes.Error bars indicate 95% confidence intervals.

TABLE I .
Device specifications for Montreal and Washington.1QG and 2QG denote 1-qubit gate and 2-qubit gate, respectively.RO denotes readout.