Probing context-dependent errors in quantum processors

Gates in error-prone quantum information processors are often modeled using sets of one- and two-qubit process matrices, the standard model of quantum errors. However, the results of quantum circuits on real processors often depend on additional external"context"variables. Such contexts may include the state of a spectator qubit, the time of data collection, or the temperature of control electronics. In this article we demonstrate a suite of simple, widely applicable, and statistically rigorous methods for detecting context dependence in quantum circuit experiments. They can be used on any data that comprise two or more"pools"of measurement results obtained by repeating the same set of quantum circuits in different contexts. These tools may be integrated seamlessly into standard quantum device characterization techniques, like randomized benchmarking or tomography. We experimentally demonstrate these methods by detecting and quantifying crosstalk and drift on the publicly accessible 16-qubit ibmqx3.

In this paper we propose and demonstrate a practical, statistically rigorous toolkit for detecting whether a quantum circuit's observable behavior depends on external variables.The underlying statistical tasks here are old and well studied [34][35][36][37], so we make no claims of 1.An illustration of how to detect and quantify context dependence in a quantum information processor by repeatedly performing a quantum circuit in two or more contexts.In this simple example, a Bell state is prepared during two different time periods (am/pm), to test for time variation; or while an adjacent pair of qubits is or is not being driven, to test for crosstalk.The measurement outcome frequencies for the two contexts are compared to determine if the circuit behavior is the same across contexts.If not, the change is quantified.
Multiple test circuits and a physical model of the device can sometimes enable identification of the underlying cause and indicate the size of the effect.
statistical novelty.Instead, our focus is on choosing and harnessing established statistical techniques for detecting context dependence in QCVV, using the type of data most often found in quantum device characterization and circuit-based experiments.Almost all such experiments generate count data: the aggregated outcomes of N repetitions of one or more quantum circuits that each begin with a state preparation and end with a measurement.Usually, all the measurement results for a single circuit are collected into a single "pool".This precludes testing for variation, because a single pool of counts is always perfectly consistent with a single underlying set of probabilities for the observed outcomes.However, some data have additional structure, such as time stamps, that define a natural division into two or more pools that are each associated with a different "context".Then, we can look for significant variation in the circuit behavior be-tween contexts (Fig. 1).For example, flipping two coins 100 times and getting 49 heads for one coin and 55 for the other is intuitively consistent with the claim that the coins are identically biased; the variation is typical of random finite-sample fluctuations.Observing instead 28 heads for one coin and 72 heads for the other is strong evidence that the coins actually have different biases.We can address this question formally using statistical hypothesis testing, a standard framework for rigorously deciding if there is sufficient evidence to reject a base assumption, known as a null hypothesis.In the tools we propose, our null hypothesis is that there is no context dependence, and we seek statistically significant evidence in the data to the contrary.
This paper is structured as follows.In Section II we present hypothesis testing techniques for detecting context dependence in count data from one or more circuits.In Section III we adapt these context dependence detection tools to the task of context dependence quantification.In Section IV we simulate applying these techniques to detect drift, demonstrating that these methods can clearly highlight context-dependent errors.In Section V we apply our techniques to drift and crosstalk detection and quantification on the ibmqx3 [38], a publicly accessible superconducting quantum processor.In Section VI we discuss the relationship between our tools and simultaneous RB [7], a popular crosstalk quantification technique, and we conclude in Section VII.

A. Single circuit data
First, we consider how to detect context dependence in a single quantum circuit.Suppose this circuit has M ≥ 2 possible measurement outcomes, indexed by m = 1, 2, . . ., M .In general, if a circuit has n qubits (and all n qubits are read out at the end of the circuit), then M = 2 n .Note that we could also choose to measure only a subset of the qubits in the system, or marginalize multi-qubit data over some of the qubits.Let this circuit be performed repeatedly in each of C different contexts, indexed c = 1, 2, . . ., C. For example, the contexts might correspond to distinct time intervals, or to driving (or not driving) neighboring qubits (see Fig. 1).For each context c, the circuit defines a probability distribution over the possible measurement results These are probabilities for obtaining each of the M measurement outcomes, after averaging over any other unaccounted-for contexts that might vary within a cindexed context.For example, time is a continuously varying context variable, and a time period context is a coarse-graining over time.Thus, in this example each p c is the probability distribution after this time-averaging.An experiment consists of running our circuit N c times in each context c and recording the total counts for each measurement outcome m.This effectively samples from each of the the p c distributions, producing measurement results x = {x c }.Here is a vector of positive integers summing to N c , representing the observed counts from N c repeats of the circuit in context c.In terms of the data, context independence holds iff all of the data were drawn from the same underlying probability distribution p 0 .To detect context dependence we therefore ask whether the measurement results in different contexts are consistent with being drawn from a single distribution.This is a hypothesis testing problem: we are looking for evidence to reject the null hypothesis that the underlying distributions are context independent.
In general, hypothesis testing is the following procedure: 1. Choose a statistic.This is a function Λ from the space of all possible experimental results to R.
4. Calculate the p-value (p) of Λ(x).This is the probability of observing a value of Λ that is at least as extreme as Λ(x) if the null hypothesis is true.
5. Reject the null hypothesis if p < α.Here, rejecting the null hypothesis means detecting context dependence.
Any procedure of this form ensures that the probability of falsely detecting context dependence is at most α.Within this constraint, it is desirable to choose a procedure -i.e., a statistic -with high power to detect context dependence if it is present.For general hypothesis testing, there is no universally optimal statistic except for the simplest problems [35], but the log-likelihood ratio (LLR) statistic is canonical and popular, and we have found it to be convenient and powerful.
For data x, a statistical model parameterized by θ ∈ H for some parameter space H, and a null-hypothesis subspace H 0 ⊂ H, the LLR is defined as where L(θ) = Pr(θ | x) is the likelihood function, θ0 is the maximum likelihood estimate of θ over the nullhypothesis subspace H 0 , and θ is the maximum likelihood estimate of θ over the full parameter space H [34][35][36].For our problem, we have 1.H 0 : the null hypothesis that p c = p 0 for all c.The maximum likelihood estimate over the null hypothesis space is p0 = N −1 (x 1 , x 2 , . . ., x M ), with x m = c x c,m counts obtained by aggregating over contexts, and N = c N c .
2. H: the alternative hypothesis that each p c is independent.The maximum likelihood estimate under the alternative hypothesis is pc = x c /N c .
Via basic multinomial statistics, the LLR is then To compute p-values, we appeal to Wilks' theorem [36].It states that if the null hypothesis holds, as the number of samples → ∞, the LLR converges to a χ 2 k random variable, where k = l − l 0 and l (resp., l 0 ) is the number of free parameters in the full (resp., null) model [34][35][36].Each probability vector contains M − 1 free parameters (M probabilities summing to 1), so l = C(M − 1) and l 0 = (M − 1).If N c 1, then under the null hypothesis The p-value of an observed λ is therefore approximated by where F k is the χ 2 k cumulative distribution function.For pre-specified α, we say that context dependence has been detected at significance α if p < α.We call this simple primitive the individual circuit test (ICT), because it applies to data from a single circuit.
Here is a simple example of how the ICT can be used to detect context dependence.Consider a 1-qubit circuit comprising preparation of |0 , application of X π/2 = exp(−iπσ x /4), and measurement of σ z .It is performed in two contexts: (1) while a neighbor qubit sits idle; (2) while the neighbor is driven in some fashion.Now, suppose the operations are perfect under Context 1, but the driving in Context 2 causes the X π/2 gate to over-rotate: X π/2 → exp(−i1.1πσx /4).We chose a significance level of 5%, and simulated 200 repetitions of the circuit in each context, observing 99 "0" outcomes in Context 1 and 131 in Context 2. Putting this data into Eqs.(4 -6) with C = 2 and M = 2, we find that the p-value is p ≈ 0.1%.This is easily significant at the 5% level (p < 5%), so context dependence was detected in this simulated experiment.We also simulated a scenario where driving did not cause any change, and this time obtained 108 "0" counts in Context 1 and 107 in Context 2. Calculated in the same way, the p-value for this data was p ≈ 92%, so context independence was not rejected.If we repeated this simulation many times, in the latter case where there is no context dependence we'd expect to erroneously detect context dependence in 5% of the trials.

B. Multi-circuit data
Many quantum circuit based experiments involve collecting data from multiple distinct circuits, as is the case for most QCVV techniques, including all RB protocols [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21], GST [1][2][3][4] and other tomographic methods [39,40].We now extend the context dependence detection method presented above to the multi-circuit scenario.Consider Q circuits indexed q = 1, 2, . . ., Q, each with M possible outcomes, indexed m = 1, 2, . . .M [41].These circuits are all implemented in each of C contexts, again indexed by c for c = 1, 2, . . ., C. Slightly generalizing the notation of Eq. ( 1), let p q,c = (p q,c,1 , p q,c,2 , . . ., p q,c,M ), (7) denote the underlying probability distribution for circuit q in context c.As before, a particular circuit is context independent iff all p q,c = p q,0 for some circuit-dependent p q,0 .All of the circuits are context independent if this holds for all circuits q.Consider data generated by N q,c repeats of circuit q in context c.Let x q,c,m denote counts data for outcome m of circuit q in context c, with the full set of data denoted by There are many ways to test for context dependence with multi-circuit data of this sort.Most obviously, we could apply the ICT defined above to the data from each circuit, to separately test for context dependence in each circuit.However, implementing all Q ICTs involves implementing multiple statistical hypothesis tests, and it is necessary to take this into account.If the null hypothesis is true, and we naively implement T independent hypothesis tests all at some fixed significance α, then we expect approximately αT of the tests to falsely reject the null hypothesis just by random chance.In fact, the probability of falsely rejecting the null hypothesis in at least one test will converge to 1 as T increases.
To keep the probability of false detection in one or more tests -known as the family-wise error rate (FWER) [35,42] -to at most α, it is necessary to adjust the significance of the individual tests.The simplest solution is the generalized Bonferroni correction [35,42]: For any tests implemented together, a FWER of at most α can be obtained by setting the "local" significance level of test i to α i = αw i for any w i ≥ 0 satisfying i w i = 1.Implementing all Q ICTs with each significance set to α/Q is therefore sufficient to maintain a global significance of α.However, the Bonferroni correction is unnecessarily conservative, so we will use a strictly more powerful correction.
Because the λ q are independent under the null hypothesis, where λ q is the LLR for circuit q, we can implement the ICTs with a Hochberg correction [42, 43][44].In this setting, the Hochberg correction keeps the FWER to at most α using the following procedure: 1. Order the Q p-values from smallest to largest: p (1) , p (2) , . . ., p (Q) .
2. Find the largest l such that p (l) ≤ α/(Q − l + 1), denoting this integer by l max .
3. Reject the null hypothesis (context independence) for all circuits with p-values smaller than Hereafter, we use this multi-test correction procedure used for the ICTs herein.Note that p threshold is not a true threshold for the statistical significance of a p-value, in the sense that it depends on the data.We therefore refer to it instead as a "pseudo-threshold".Sometimes it is convenient to convert this to a pseudo-threshold above which the LLR of a circuit is significant.Inverting Eq. ( 6), this is given by where k is the degrees of freedom per circuit, in Eq. ( 5), and is the inverse cumulative distribution function for the χ 2 k distribution.The ICTs are often not the most sensitive for deciding whether there is context dependence in at least one circuit.In particular, there are tests that are more sensitive to context dependence that is distributed uniformly over all the circuits.A complementary test statistic, powerful for detecting uniformly distributed context dependence, is the aggregate LLR where, again, λ q is the LLR for circuit q.This is the LLR between the null hypothesis of context independence in all circuits and the full context dependence model.That is, it is the LLR between the model whereby p q,c = p q,0 for some p q,0 and all q, and the model whereby all the p q,c are independent.Therefore, when the null hypothesis holds, λ agg approximately follows a χ 2 kagg distribution with For k 1, the χ 2 k distribution is approximately normal with mean k and variance 1/(2k).Therefore, in the common situation of Q 1, a convenient and intuitive way to express the statistical significance of λ agg is as the number of standard deviations by which it exceeds its expected context-independent value.This is given by In our experience, the p-value of the aggregate LLR is often vanishingly small (see, e.g., Sec.IV), so N σ provides an alternative measure of statistical significance that is on a more convenient scale.It is sometimes useful to have a threshold for α significance of the N σ , and this is given by When Q 1, this is essentially identical to the standard significance thresholds for standard deviations above the mean with a normal distribution.
Although the aggregate LLR test is often more sensitive, the ICTs are useful because they indicate which circuits vary.This can constitute helpful diagnostic information, as demonstrated later.We can strike a balance between these tests by implementing the set of ICTs and the aggregate test, with significance levels adjusted appropriately.A reasonable strategy, which we adopt for the simulations and experiments in this paper, is the following.For a user-specified global significance α: 2. Implement the ICTs using a Hochberg correction at a significance of β.
This type of multi-test compensation is based on the closed test principle (a generalization of the Bonferroni correction), and it controls the FWER to be at most α [45].

C. Choosing the circuits
The context dependence detection methods that we have proposed in this section can be applied to data from almost any set of circuits.They can be bolted on to almost any device characterization protocol.However, if context dependence detection is a high priority, it is often useful to choose circuits that are sensitive to all the parameters that might vary with context.GST circuits [1][2][3][4] are one reasonable choice, because they are informationally complete for tomography of gates, state preparations and measurements (SPAM).If context dependence manifests as an observable dependence of gate or SPAM process matrices on the context, at least one GST circuit will be sensitive to it.We use GST circuits in our examples below.
Using our tools on data from GST circuits does not require implementing the tomographic reconstructions of GST.Tomographic reconstructions using the data from each context are nevertheless clearly possible with GST data.This naturally raises the question of what our tools add that couldn't be achieved as easily with tomography.Our tools have three distinct advantages over tomography, which highlight how they complement any tomographic data analysis.First, precise tomography require large amounts of data and many individual circuits, whereas detecting context dependence can often be achieved using few circuits and/or less data.Second, tomographic methods are based on fitting a model, and become unreliable if this model does not accurately describe the system [25].In contrast, these direct context dependence detection tools require no model of the underlying operations (the gates and SPAM).Finally, tomography is computationally expensive, but the tools here require only very simple classical computation.

III. QUANTIFYING CONTEXT DEPENDENCE
The detection methods presented in the previous section test whether or not there is statistically significant evidence of context dependence; when used rigorously they only report "yes" or "no".In general, the value of a test statistic will not necessarily quantify the "strength" of a detected effect.Neither the magnitude of the LLR for each circuit, nor the aggregate LLR, nor the associated p-values, nor the aggregate N σ directly quantify the strength of context dependence.Instead, they quantify our confidence that context dependence exists.If there is any context dependence in one or more circuits then, as we take more data, both λ agg and N σ will increase without bound.Arguably, the most interesting metrics of context dependence "strength" would describe the variation of an underlying gate/SPAM error rate, but this is the domain of specific QCVV protocols (e.g.RB or GST).In the very general framework of this paper, the most we can do is to quantify the strength of each individual circuit's context dependence.This is equivalent to estimating how much the circuit's outcome probabilities change between contexts, and there are many ways to do this.

A. Jensen-Shannon Divergence
The simplest way to quantify context dependence is to rescale the per-circuit LLRs to where N q = c N q,c .As suggested by this notation, JSD q provides an estimate of the Jensen-Shannon divergence (JSD) of the underlying probability distributions.For probability distributions P c over M events, with c = 1, 2, . . ., C, and some weightings π c with c π c = 1, the JSD is defined by [46] JSD {πc} (P 1 , . . ., P where H(P ) is the Shannon entropy of the probability distribution P given by The JSD q quantity defined in Eq. ( 15) is in fact the JSD (with a particular weighting) of the maximum likelihood estimates of the p c , so we call JSD q the observed JSD.This can be shown directly by letting P c (m) → x c,m /N c and taking π c = N c /N (where N = c N c ), in the definition of JSD.The observed JSD is an estimate of the JSD of the underlying probability distributions for circuit q.Even if there is no context dependence, however, each JSD q will almost always be non-zero due to ordinary finite-sample fluctuations.Thus JSD q is significantly different from zero only if it is greater than where λ threshold is the LLR pseudo-threshold of Eq. (10).Implicit in this relation is the fact that λ q and JSD q are entirely equivalent test statistics.

B. Total variation distance
JSD quantifies statistical distinguishability between probability distributions and their average [46], so an estimate of the underlying JSD is a well-motivated measure of the context dependence of a circuit.However, there are other metrics with other meanings.One commonly used in quantum information is the total variation distance (TVD) [47].The TVD between two distributions P 1 and P 2 over M events, is The observed TVD for circuit q (TVD q ) is naturally defined by Here the contexts are indexed "1" and "2", because the TVD is only defined between two contexts, i.e., when Even if there is no context dependence, observed TVDs between two contexts are generally non-zero because of finite-sample fluctuations.It is often useful to correct for this.Unlike the observed JSD, however, the observed TVD is not simply related to the LLR so there is no simple seudo-threshold for TVD q .Instead, we introduce the statistically significant total variation distance (SSTVD).If statistically significant variation is detected for circuit q using the ICTs, we report SSTVD q = TVD q for that circuit; when no statistically significant context dependence is detected, the circuit has no SSTVD.That is, Note that we do not define SSTVD q to be zero when λ q ≤ λ threshold .Failure to detect context dependence does not imply that this circuit is probably context independent.This is because not rejecting a null hypothesis in a hypothesis test does not imply anything about whether that null hypothesis is true.For example, one or more λ q could be just below the pseudo-threshold at a global 5% significance and above the pseudo-threshold at a global significance of 6%.Those circuits are therefore quite probably context dependent, meaning that a SSTVD q of zero could be misleading.When analyzing data from many circuits (Q 1), it is often useful to summarize any observed context dependence with a single number.One such candidate is the maximum SSTVD over all circuits max SSTVD = max and we will use this statistic in our examples later.The motivation for max SSTVD is that it partially captures worst-case context dependence.For example, without context dependent SPAM, the maximum over gates of the diamond distance between the process matrix for each gate in the two contexts is lower bounded by the maximum true TVD over the circuits, divided by the number of gates in the maximizing circuit.The max SSTVD is an estimate of this maximal TVD.(This link to diamond distance suggests an interesting alternative to max SSTVD; max q [SSTVD q /l(q)] where l(q) is the length of circuit q).It is also important to note that the value of max SSTVD is, in general, strongly dependent on the choice of circuits, even when divided by circuit length, as the most context dependent circuit might not be in the set of circuits chosen.
There are some subtleties to SSTVD, which can become important in slightly unusual circumstances.Perhaps the most significant of these is that the SSTVD of a circuit can sometimes significantly over-estimate the true TVD of the circuit.For example, consider a situation whereby the TVD between contexts is the same and fairly small for all circuits, and context dependence is detected in only some of the circuits (because the effect is small, so the chance that it is detected in any particular circuit is low).The circuits in which SSTVD is reported as non-null must have an observed TVD large enough so that the LLR test triggers, and the minimum such observed TVD could be significantly largely than the true TVD.If this is the case, any non-null SSTVD is a significant over-estimate of the true TVD.Subtleties of this sort can be accounted for by looking at additional properties of the observed TVD distribution.However, this is not to suggest that looking at the full observed TVD distribution is always preferable in practice: the SSTVD is a convenient tool for highlighting the rough size of any detected context dependence without requiring subtle, case-specific analysis of a distribution.

IV. SIMULATED DRIFT DETECTION
In this section we present a simulated example showing how to use the tools presented above to detect slow drift.This example uses data from GST circuits, but alternatives such as RB circuits could equally have been used.We consider long-sequence GST (LSGST) circuits [1] built from two gates: π/2 rotations around σ x and σ y .Each LSGST circuit begins with one of six short state-preparation sequences, followed by one of six short "germ" sequences repeated O(K) times, and concludes with one of six short pre-measurement sequences.These building blocks are chosen so that the collection of LS-GST circuits are informationally complete [1,40].Here, K ranges from 0 to 256 with logarithmic spacing, yielding 1405 unique quantum circuits.Below, the size of K is referred to as the "core" circuit length.The specific circuits used are given in Appendix A.
We simulated repeating these circuits N = 100 times in each of 5 consecutive time periods t = 1, 2, . . ., 5 (the contexts).In addition to small time-independent unitary errors in the gates, we simulated slow drift by adding over-rotations of (t − 1) • 10 −3 radians in time periods t to both gates.We tested for drift (context dependence between time periods) using a global significance level of α = 5%.
There are five contexts (the five time periods), so there are many ways to test for drift: we can implement the tests introduced earlier on all the data (jointly comparing the five contexts) and/or we can implement up to 10 pairwise comparisons between pairs of different time periods (comparing pairs of contexts).We'll demonstrate all of these analyses, resulting in 11 comparisons between contexts in total.Therefore, to guarantee a global significance of 5% we perform each comparison between contexts at a significance of (5/11)% ≈ 0.45% (this is a Bonferroni correction), with the aggregate LLR test and the ICTs performed for each comparison using the particular multi-test correction procedure specified earlier (so, for example, each aggregate LLR test is performed at (5/22)% ≈ 0.23% significance).For the joint comparison of all five time periods, we find that the signed standard deviation of the aggregate LLR N σ , defined in Eq. ( 13), is N σ ≈ 21; the threshold for drift detection is only N σ ≈ 2.9 (as given by Eq. ( 14) with α ≈ 0.23%).Thus we have detected drift with extremely high confidence.The ICTs test also detects drift, finding 21 circuits to be significant.
To obtain more detailed, diagnostic information, we turn to the pairwise time period comparisons.These results are summarized in Fig. 2. The upper triangle in the FIG.
2. An example using our techniques for drift detection on simulated data.Data was obtained by repeating the same 1405 circuits 100 times in each of five time periods.The circuits contain π/2 rotations around σ x/y and are informationally complete, meaning that they are collectively sensitive to drift in every aspect of gates and SPAM.Drift was modeled as time-dependent over-rotations in both gates, by (t − 1) • 10 −3 radians in time period t = 1, 2, . . ., 5. Upper plot, upper triangle: Nσ of total model violation for pairwise comparisons between the five pools.Upper plot, lower triangle: the number of circuits that were found to contain statistically significant drift.Lower plot: A violin plot of the estimated Jensen-Shannon divergence (JSD) for each circuit vs. core circuit length for the t = 1 to t = 5 time period comparison ("core" circuit length is defined in the main text).
Any JSD above the pseudo-threshold is significantly non-zero, at 5% global statistical significance, implying that drift has been rigorously detected in the associated circuits.As discussed in the main text, by looking at which circuits have a high JSD it is possible to infer the form of the errors.
upper plot of Fig. 2 shows N σ for each pairwise comparison.For the longest time difference comparison N σ ≈ 34 (the threshold for drift detection is still N σ ≈ 2.9).The lower triangle in the upper plot of Fig. 2 shows the number of circuits that were found to have statistically significant drift for each pairwise comparison.If this is zero and the N σ is not statistically significant then drift is not detected for that pairwise comparison; otherwise it is.Therefore, none of the comparisons between neighboring time periods detect drift, but all other comparisons do detect drift.Drift is thus detected whenever the difference in rotation angle between time periods is at least 2 • 10 −3 radians.As expected, the statistical significance of the observed effect, as quantified by N σ , increased with time delay.Note that, while no drift was detected between neighboring time periods, we know that drift was present (because we designed the model).This drift could have been made visible to our tools in either of two ways.Firstly, we could have included longer sequences that would be more sensitive to small rotations.Alternatively, we could simply have collected more data.Fig. 2 also demonstrates that these tools allow for a rough diagnosis of the drift, without requiring computationally expensive parameter estimation.The lower plot of Fig. 2 shows the distribution of the per-circuit observed JSDs, as defined in Eq. ( 15), versus "core" circuit length (see above), for the longest delay period t = 1 vs. t = 5.This shows that the magnitude of the drift grows with circuit length, implying that the gates are drifting, rather than the SPAM.Note that only those circuits with an observed JSD above the pseudo-threshold for statistical significance, given by Eq. ( 17), have been flagged up by our tests as being context dependent at 5% global significance (there are 25 of them, as shown in the upper plot).Looking, however, at the trend in the observed JSD distribution versus sequence length also provides additional, if less rigorous [48], evidence of an increase in the underlying JSD with length (without context dependence, the observed JSD would be uncorrelated with circuit length).This highlights the utility of further data analysis, after context dependence has been first detected with statistically rigorous hypothesis testing.
Looking at the specific details of the circuits, we observe that the largest observed JSDs are seen in circuits where the same gate is repeated sequentially many times.This strongly suggests that the gate rotation angles are drifting, rather than the rotation axes (which those circuits would not amplify sensitivity to) or the stochastic error rates (changes in which would manifest in all longer sequences).This is, of course, consistent with the simulated error model.Jupyter notebooks that contain this more detailed analysis, and which can be used to repeat and extend these simulations, are included as supplemental material [49].

V. EXPERIMENTAL DRIFT AND CROSSTALK DETECTION
To further demonstrate the practical utility of our tools, we applied them to detect and quantify drift and crosstalk in the publicly accessible ibmqx3 [50] [38,51].This is a 16-qubit superconducting device with connectivity on a 2 × 8 grid, shown schematically in Fig. 3, resembling a ladder.We ran circuits over {I, H, S} gates on a single qubit (Q 15 ) to see whether: (I) The behavior of this qubit was affected by simultaneous CNOT gates applied to various "rungs" of the "ladder".
(II) The behavior of this qubit drifted in time.
To do this, we implemented the circuits of linear inversion GST (LGST) [52] over {I, H, S} on Q 15 in multiple contexts.
LGST is the simplest, least experimentally intensive form of GST, requiring only 40 unique circuits for these gates.The exact circuits are listed in Appendix A, and all the circuits are depth 7 or less.For each rung, we compare the output of LGST circuits on Q 15 in the following time-ordered contexts: (a) All other qubits idle.
(b) The CNOT on the rung is applied whenever a gate is applied to Q 15 .
(c) All other qubits idle.
This experimental design was chosen to enable detection and isolation of both drift and crosstalk.If no context dependence is detected between (a) and (c), then we can safely rule out drift.Any context dependence between (a) and (b) may then be ascribed to crosstalk (modulo caveats discussed later).Access constraints prohibited running all the circuits for a rung in one submission.Therefore, for each rung, we submitted the circuits for each context [(a) -(c)] in sequential batches.The delay between executed batches ranged from a few seconds to several minutes, depending on machine availability.FIG. 3. Quantifying the effect of CNOT gates on the performance of qubit Q15 in ibmqx3 [38].Top: a schematic of ibmqx3 with Q15 highlighted.Circles indicate qubits and arrows denote CNOT gates, pointing from the control to target.Bottom: The effect of driving each of the seven "ladder-rung" CNOT gates on short circuits run on qubit Q15, as quantified by max SSTVD, which is an empirical, total-variationdistance based measure that we propose for estimating worstcase context dependence over circuits (see main text).The max SSTVD from driving each CNOT is plotted immediately below the corresponding rung in the schematic.The CNOT between qubits Q14 and Q3 has a large effect on the behavior of circuits on Q15, which corresponds to changing the outcome probabilities of a set of short circuits on Q15 by 27.7% in the worst case.The circuits run on Q15 were those of linear-inversion gate set tomography, and are discussed in the main text.
To implement the tests, we picked a global significance of 5%.To maintain this global significance level, a Bonferroni correction was used to split this 5% evenly over the comparisons for the seven rungs and the (a) to (b) and (a) to (c) comparisons for each rung (we do not compare (b) to (c) so as to avoid additional local significance dilution).This results in implementing each pairwise context comparison at a significance of 5  14 %, noting that each pairwise comparison itself contains 40 per-circuit comparisons (the ICTs) and an aggregate comparison, as described earlier.(The resulting data, along with the full analysis, is provided in supplemental material [53]) We detected no drift.That is, for all seven rungs, no change was detected between any (a) and corresponding (c) context.This is interesting in its own right, but it is also critical for the crosstalk detection.This is because it implies that any variation between any (a) and (b) contexts is probably not due to random drift -and thus, if differences are detected, that they are almost certainly due to the CNOT gate on the rung in question.
Our results comparing contexts (a) and (b) for each rung are summarized in Fig. 3, where we plot the max SSTVD for each rung (see Eq. ( 21)).In all cases, the application of CNOT gates on the other qubit pairs influences the behavior of Q 15 to a statistically significant degree, as the max SSTVD is non-zero (the SSTVD of a circuit is "null" if context dependence was not detected for that circuit; see Eq. ( 20)).The observed maximum SSTVD broadly decreases with the connectivity graph distance between Q 15 and the driven rung.Thus closer CNOT gates generally affect Q 15 more.For the CNOT between Q3 and Q14, one of the two closest rungs to Q 15 , we observed a max SSTVD of around 28%, corresponding to the gate sequence HSSSSH.For this circuit, out of 1024 measurement results, just 2 "1" outcomes were observed in context (a), while 286 "1" outcomes were observed in context (b).That is, this suggests that applying the CNOT gate to this rung changed the outcome probabilities of this circuit on Q 15 by about 28%.
The obvious cause of changes from contexts (a) to (b) is crosstalk, but there is an important caveat that needs to be addressed before we can conclude this.The circuits on Q 15 took longer when applying a CNOT to a rung (context (b)) than when implemented in isolation (context (a) or (c)).This is because CNOT gates take substantially longer to implement than 1-qubit gates on ibmqx3 [38], and in context (b) a single CNOT was applied in parallel with every gate acting on Q 15 .Thus a change in the output probabilities of Q 15 from context (a) to (b) could be just due to the circuits taking longer, allowing for more decoherence to build up on Q 15 .
This effect, however, will be independent of the rung being tested, and this allows us to bound this effect.The max SSTVDs between context (a) and (b) for the three furthest rungs are all approximately equal (see Fig 3), and much lower than the max SSTVDs for the other rungs.These max SSTVDs provide a rough baseline for the maximal amount of the context dependence that can be attributed to this timing difference; any excess in the max SSTVD above this level is almost certainly due to crosstalk.
To fully isolate the crosstalk caused by a CNOT from any change in circuit performance caused by increased circuit duration, the time for each circuit layer should be fixed for all contexts, which could be more easily incorporated into experiments with lower-level access to a device.This is illustrative of the need to carefully account for all "nuisance contexts" that may be unintentionally or unavoidably changing with the context of interest.These nuisance contexts should be removed if possible, or, as here, accounted for when not.

VI. DISCUSSION
To our knowledge, the tools we have presented and demonstrated herein are the first designed for detecting and characterizing generic context dependence in generic quantum circuits.However, one particular important example of context dependence is crosstalk, and there is already a widely used tool for characterizing crosstalk: simultaneous randomized benchmarking (SRB) [7,9].For this reason, we now briefly discuss the relationship between our tools and SRB.In essence, SRB involves comparing a qubit's RB error rates in two contexts, corresponding to (1) leaving neighbor qubits idle, and (2) driving them.This then provides a quantification of crosstalk in terms of the increase in the RB error rate caused by driving neighboring qubits.
Our methods complement those of SRB: our tools are not restricted to RB circuits, but unlike SRB they cannot directly provide a "crosstalk error rate" for the gates.Moreover, our methods can't be applied directly to SRB data, because SRB uses independently sampled (and so almost certainly different) random sequences in each context.Our methods can, however, be used in concert with the SRB analysis if SRB is modified slightly, so that each random sequence appears in both the drivenand undriven-neighbor(s) contexts.With data from circuits of this sort, our tools complement the standard SRB analysis; they provide statistically rigorous crosstalk detection, something not directly addressed by the SRB analysis.Moreover, our tools allows for the testing of each individual random SRB sequence for sensitivity to driving, and this can potentially help to identify the main sources of crosstalk (particularly if using varied-sampling-distribution RB methods such as those in Ref. [8]).

VII. CONCLUSIONS
Improving the performance of future quantum processors will require quantifying, understanding, and eventually mitigating a wide variety of context-dependent errors, such as crosstalk [7][8][9] and drift [23].The techniques presented and demonstrated here are simple, general, and statistically rigorous ways to detect and quantify contextdependent errors, independent of their underlying physical causes.These methods are also computationally lightweight, and can be applied to any collection of quantum circuits on any number of qubits.We therefore recommend that almost all device characterization protocols should be augmented with these tools.They can even be applied to archived data if any context-identifying information, such as time stamps, was kept.We expect that these techniques will contribute to the toolkit for calibrating and debugging next-generation qubits.For easy use, they have been integrated into (and documented in) the open-source pyGSTi software package [54].
FIG.1.An illustration of how to detect and quantify context dependence in a quantum information processor by repeatedly performing a quantum circuit in two or more contexts.In this simple example, a Bell state is prepared during two different time periods (am/pm), to test for time variation; or while an adjacent pair of qubits is or is not being driven, to test for crosstalk.The measurement outcome frequencies for the two contexts are compared to determine if the circuit behavior is the same across contexts.If not, the change is quantified.Multiple test circuits and a physical model of the device can sometimes enable identification of the underlying cause and indicate the size of the effect.