Experimental Characterization of Crosstalk Errors with Simultaneous Gate Set Tomography

Crosstalk is a leading source of failure in multiqubit quantum information processors. It can arise from a wide range of disparate physical phenomena, and can introduce subtle correlations in the errors experienced by a device. Several hardware characterization protocols are able to detect the presence of crosstalk, but few provide sufficient information to distinguish various crosstalk errors from one another. In this article we describe how gate set tomography, a protocol for detailed characterization of quantum operations, can be used to identify and characterize crosstalk errors in quantum information processors. We demonstrate our methods on a two-qubit trapped-ion processor and a two-qubit subsystem of a superconducting transmon processor.

Crosstalk is a leading source of failure in multiqubit quantum information processors. It can arise from a wide range of disparate physical phenomena, and can introduce subtle correlations in the errors experienced by a device. Several hardware characterization protocols are able to detect the presence of crosstalk, but few provide sufficient information to distinguish various crosstalk errors from one another. In this article we describe how gate set tomography, a protocol for detailed characterization of quantum operations, can be used to identify and characterize crosstalk errors in quantum information processors. We demonstrate our methods on a two-qubit trapped-ion processor and a two-qubit subsystem of a superconducting transmon processor.

I. INTRODUCTION
Quantum information processors have demonstrated oneand two-qubit quantum operations with error rates below the threshold required for fault-tolerant quantum computation [1][2][3][4][5][6][7][8][9][10]. One of the biggest obstacles to achieving similarly low error rates in large, integrated quantum processors is the appearance of a large class of errors known collectively as crosstalk [11][12][13][14][15]. Crosstalk can increase error rates on individual qubits, and can also cause errors on different qubits to become correlated with one another. These correlations are particularly damaging for error correction [16][17][18], and optimizing the power of quantum error correction requires understanding and strictly controlling crosstalk errors.
A number of techniques to characterize [15,[36][37][38][39][40][41][42][43][44][45][46][47][48][49][50][51][52] the impact of crosstalk have been developed and implemented. Randomized methods -such as simultaneous FIG. 1. Detecting and measuring crosstalk errors with gate set tomography. To probe crosstalk between two qubits, we execute a set of two-qubit quantum circuits (a) whose layers comprise parallel, single-qubit gates. These circuits consist of initialization into ρ ≈ |00 00|, a short statepreparation operation, an n-fold repeated germ operation, a short measurement-preparation operation, and finally measurement in the computational basis. The measured outcomes of these circuits are used to fit each of three models (b-d): (b) Crosstalk-free models assume that each elementary gate can be described by a single-qubit process matrix. (c) Contextdependent models assume that each gate can be described by a single-qubit process matrix conditioned on the neighboring qubit operation. (d) General crosstalk models assume each two-qubit layer can be described by a full, two-qubit process matrix with no additional constraints. (e) Non-Markovian behavior cannot be described by two-qubit process matrices. We use tools from statistical model selection to decide which model offers the best balance between accuracy (describing the data) and simplicity (using the fewest parameters). randomized benchmarking (RB) [41], correlated RB [42], cycle benchmarking [43], and Pauli noise learning [44] are among the most popular, as they are generally simple to implement and analyze. However, these methods are typically sensitive to coherent errors at only second order [45][46][47], and rely on twirling techniques that obfuscate the underlying physical sources of observed errors. Model-free methods based on hypothesis testing of probability distributions [15,48] can identify the presence of crosstalk errors, but cannot characterize those errors. Methods based on quantum process tomography, such as selective process tomography [49] and direct characterization of quantum dynamics [50], can be adapted to provide insight into certain types of crosstalk (such as coherent two-qubit interactions) but are not designed to detect others (such as how gate errors depend on neighboring operations), and inherit many of the well-known problems of quantum process tomography [51]. Finally, specialized techniques, like cross-Rabi oscillations or direct capacitance measurements [52] are useful for learning specific physical parameters, but they are generally unable to detect other crosstalk errors that may be present, or even dominant, in a system.
In this article we demonstrate how to use gate set tomography (GST) [5,51,53] to perform a detailed investigation of crosstalk errors between two subsystems of a quantum information processor. GST is a protocol designed to provide detailed characterization of qubit dynamics by estimating a set of process matrices describing the various operations of a processor. So in principle, we could simply perform GST on a multi-qubit system, obtain multi-qubit process matrices describing the gates, and look for the presence of crosstalk errors in these process matrices. But such process matrices are large and unwieldy. It is not clear what constitutes conclusive evidence for crosstalk errors, nor how to reliably distinguish real effects from statistical noise (finite-sample fluctuations).
Instead, we take a two-stage approach. First, we construct three parameterized models that, by design, allow for different degrees and kinds of crosstalk errors (see Table I and Fig. 1 b-d). We fit them to data, and we evaluate how well they explain the data [54]. We can infer a surprising amount of information just from this evaluation, because we know exactly what kinds of crosstalk each model can describe, and for each model we can quantify the amount of observed error that it failed to describe. We also use this analysis to select the simplest model that fits the data well, estimate its parameters to obtain "best fit" process matrices describing the gates and their errors, and analyze those process matrices in detail to understand the nature of the crosstalk errors. When the best-fit model only includes certain effects, we can apply customized analysis techniques that are tailored to those particular crosstalk errors. Finally, by comparing the measured errors and their magnitudes against candidate physical device models, we draw conclusions about the underlying physical causes of the ob-served crosstalk errors -and, potentially, how to mitigate them.
In this work we focus on identifying and analyzing crosstalk induced by parallel single-qubit gates on two single-qubit subsystems. However, all of our methods are straightforwardly extensible to larger systems, e.g., to crosstalk between two-qubit subsystems, induced by entangling gates. We demonstrate our methods on two DOE-sponsored quantum computing testbed platforms -the transmon-based Advanced Quantum Testbed (AQT) [55] and a prototype of the trapped-ionbased Quantum Scientific Computing Open User Testbed (QSCOUT) [56,57] -and we discuss and compare the errors observed on these two devices.

II. MODELING CROSSTALK ERRORS
Crosstalk in quantum information processors can arise from a wide range of disparate physical mechanisms. Our goal in this work is to characterize the crosstalk errors in quantum logical operations induced by those mechanisms. The standard model for any Markovian error, including crosstalk errors [15], is a process matrix -a completely positive, trace preserving map on density matrices. Multi-qubit process matrices can describe a wide variety of crosstalk effects. To disaggregate different categories of crosstalk errors, we construct a hierarchy of process matrix models, each of which can only represent certain kinds of crosstalk errors. We will fit these models to data, compare their ability to describe the observations, select the best model, and use it to draw quantitative conclusions about the crosstalk errors present in the device. In this section, we present (1) those models, (2) our methods for fitting them to data and evaluating fit quality, (3) how we analyze an estimated model, and (4) the data/model analysis pipeline used in our experiments.

A. Families of crosstalk error models
Reference [15] defines two conditions that must be satisfied for a quantum processor to be free of crosstalk: (i) Locality of operations -the process matrices decompose as tensor products; (ii) Independence of local operations -each component in the tensor product depends only on what gate is acting on that subsystem.
Each condition defines a constraint on multi-qubit process matrices, which can otherwise describe many forms of crosstalk. By systematically enforcing or relaxing constraints (i-ii), we can construct the three families of models shown in Table. I. They describe increasingly complex crosstalk phenomena.

Model
Decomposition 683   TABLE I. Three nested families of process matrix models for quantum logic operations. Each successively larger model can capture richer forms of crosstalk between qubits than the previous ones. Also shown are the decompositions of the process matrices as tensor products of local operations when operations A and B are applied simultaneously to subsystems 1 and 2, respectively. Np is the number of free parameters required to describe the model.
The crosstalk-free model cannot model any type of crosstalk in the system -each gate is required to act locally and independently. The process matrix representing each layer of gates is required to have a tensorproduct form, and the local process matrix on each subsystem is independent of its context (i.e., which operations are applied to other subsystems).
The context-dependent model relaxes the independence constraint. Each gate still acts locally -a layer's process matrix must have tensor-product form -but the local operations on a subsystem can vary from layer to layer (i.e., depend on context). This model can capture pulse-spillover effects, where the operations on one subsystem are perturbed by operations on the other. But, like the crosstalk-free model, the context-dependent model cannot model entangling Hamiltonians or correlated errors that create correlations between the two subsystems.
The general crosstalk model relaxes both constraints, and represents each parallel, single-qubit layer as a full two-qubit process matrix. This model can capture all Markovian crosstalk effects, including context dependence, entangling operations, and classical correlation.
In principle, a complete description of the system's crosstalk errors could be extracted just by estimating the parameters of the general model. In practice, analyzing those process matrices and deciding which effects are statistically significant is difficult and time-consuming. We let the model-fitting process do that work for us. If, e.g., the effects of crosstalk are not statistically significant, then model selection criteria (see below) will indicate that the data are consistent with a crosstalk-free model. Conversely, if only the general model fits well, then this is unambiguous evidence of nonlocal interactions, which we can track down by detailed analysis of process matrices.
Because our models utilize process matrices, they are all explicitly Markovian. But real devices are often non-Markovian -experiments on them yield data inconsistent with the predictions of any process matrix model. This has implications for the use of process matrix mod-els to diagnose crosstalk. As discussed in [15], correlations induced by non-Markovianity can easily be mistaken for crosstalk errors, so finding some way to acknowledge and incorporate non-Markovian errors is extremely important! Later in this section we discuss techniques for quantifying non-Markovianity, and what to do when it (almost inevitably) appears.
To predict data, our models also need to describe errors in state preparation and measurement (SPAM). We only consider systems that initialize and measure all qubits simultaneously. In each of our models, the density matrices and POVM effects that represent SPAM operations are required to respect the same constraints (e.g. tensor product decomposition) as the gate operations. Thus, both the crosstalk-free and context-dependent models incorporate initial states of the form ρ = ρ (1) ⊗ ρ (2) and POVM effects of the form F i,j = F (1) i ⊗F (2) j . SPAM operations in the general crosstalk model are unconstrained, and can describe entangled or correlated initial states, and arbitrarily correlated measurements.

B. Fitting models to data
Fitting process matrix models to data is typically done using some variant of process tomography [50]. While standard quantum process tomography can be used to characterize quantum gate operations, it is not selfconsistent, i.e., it assumes access to input states and measurements which are already highly accurately characterized [51]. We use GST, a protocol introduced to solve this self-consistency problem [5,51,53]. GST experiments include all the circuits required for state, measurement, and process tomography, plus a few additional circuits that establish a mutually consistent reference frame. This allows GST to fully characterize all state preparation, measurement, and gate operations on a processor, concurrently and self-consistently.
As illustrated in Fig. 1, GST relies on circuits of the form where p i is a fiducial state preparation subcircuit, m k is a fiducial measurement preparation subcircuit, and g k is an n-fold repeated "germ" subcircuit. Data from running these circuits can be used to estimate a process matrix for each distinct circuit layer [53]. In our experiments, we use and estimate the nine 2-qubit layers formed by all possible parallel combinations of a 3-element single-qubit gate set: where X π/2 and Y π/2 are π/2 rotations around X and Y , respectively, and I is an idle gate. Our experiments use no entangling gates. As in standard GST [5,53], the fiducial operations are chosen to be informationally complete, and the germs are chosen to amplify all observable components of a full two-qubit process matrix model for each layer. See Table VI for the specific subcircuits used in our experiments. Once data has been obtained, we use maximum likelihood estimation to find a best-fit estimate for each of the three crosstalk models, by varying its parameters θ to maximize the likelihood function L(θ) = Pr(data|θ). We denote the maximum likelihood estimate of the ith model by θ i . We denote the maximum value of the likelihood function for the ith model by L (i) = L( θ i ), and refer to it as the likelihood of model (i).
To construct, fit, and analyze these models, we use pyGSTi [58,59], a Python implementation of GST that includes robust routines for fitting predictive models of quantum information processors to data, and analyzing the resulting estimates.
C. Comparing, selecting, and validating models How well a candidate model (i) fits data is captured by its likelihood, L (i) . Extracting useful information requires some simple manipulations. We measure the quality of model (i)'s fit by the log-likelihood ratio between it and a "maximal model" that has no structure at all, and can assign an independent probability to each measurement outcome a of each circuit b, where the maximum likelihood of the maximal model, is achieved by predicting the observed frequency f a,b after N b measurements of circuit b. The log-likelihood ratio is a standard hypothesis-testing statistic. Wilks' theorem [60] states that when the data are actually generated by model (i) with some parameters θ, λ (i) is a χ 2 k random variable, with k equal to the difference between the number of free parameters (N p ) for model (i) and the maximal model. Under this null hypothesis, λ (i) = k and ∆λ (i) = √ 2k. The data are inconsistent with model (i) if and only if λ (i) is inconsistent with a χ 2 k distribution. In particular, when the data are not consistent with model (i), λ (i) will be larger than its expected value under the null hypothesis, k. How much model (i) is violated can be quantified by the number of standard deviations by which λ (i) exceeds its expected value under the null hypothesis, Ideally, we would simply choose the smallest model that fits the data -i.e., the smallest model for which N σ is negligible. But in practice, most systems display enough non-Markovian behavior that no model -not even the general (process matrix) model -fits the data that well. In this case, we need a criterion for identifying how much better (or worse) one model is than another.
To derive such a criterion [54], we observe that Wilks' theorem implies that removing exactly n "useless" parameters from a model increases λ by exactly n. So, if we consider two equally valid models (i) and (j), with (j) nested within (i) and having n fewer parameters than it, then we expect λ (j) − λ (i) ≈ n. If we observe λ (j) − λ (i) n, this suggests that the extra parameters in the larger model (i) are not useless -i.e., they describe real effects. But Akaike's derivation of his eponymous AIC [61] demonstrates a scenario where using the larger model to fit those effects actually decreases predictive accuracy, unless λ (j) − λ (i) ≥ 2n. We conclude that although there are multiple criteria for deciding which model is "better" in a given situation, they share a simple form: Is λ (j) − λ (i) ≥ αn for some α?
To compare two nested models 1 , we use a quantity that we call the evidence ratio of (i) against the smaller model (j) [54]: where N (i) p and N p (j indicate the number of parameters in models (i) and (j), respectively. If γ (i,j) ≤ 1, then there is no evidence against the smaller model (j) -the larger model (i)'s extra parameters are functionally useless -and so we always choose (j). If 1 < γ (i,j) ≤ 2, then the data provides weak evidence against the smaller model, but the AIC suggests its predictions would still be more accurate. Even when γ (i,j) > 2, we may still choose the smaller model if we prioritize simplicity, but for any use case there will be some threshold beyond which the smaller model must be rejected. In general, γ (i,j) normalizes the weight of evidence against the smaller model, on a per-parameter basis, and provides a quantitative measure for comparing two models.

D. Quantifying unmodeled error
Statistical measures of model violation like N (i) σ and γ (i,j) quantify the amount of evidence for errors outside a given model. They do not quantify the magnitude of those errors. For example, they depend strongly on the amount of data taken. In many circumstances, we care more about the size of unmodeled errors than about the amount of evidence that they exist. In this work, we quantify the size of unmodeled errors using wildcard error [62].
Wildcard error can quantify the per-gate deviation between a model's predictions and observed data. To do this, we assign a minimal wildcard model to an estimate. A wildcard model assigns to each estimated gate g a number w g ≥ 0, and to each circuit C the total w for all the gates in it: w C = g∈C w g . Adding a wildcard model explicitly relaxes the estimate's prediction for each circuit C: if the estimate originally predicted outcome distribution p C , then the wildcard-augmented estimate predicts only that C's outcomes will be drawn from some p C such that p C − p C ≤ w C . A minimal wildcard model is an assignment {w g } that just barely makes the estimate statistically consistent with the data. We only use singleparameter wildcard models that assign a single wildcard error rate (W (i) ) to all gates in the estimate of a model (i).
The minimal amount of W required to reconcile an estimate with data tells us whether unmodeled errors are dominant or negligible. In this work we use very simple wildcard models that assign a single wildcard error rate (W (i) ) to all gates in a model (i). If the W assigned to the gates in a model's estimate is significantly less than their average diamond error (¯ ) [63][64][65], that model explains most of the observed error. But if W ≥¯ , unmodeled errors may be dominant, and the model should probably be discarded or not taken seriously.
Unmodeled errors -heralded by significant W -can appear in our analysis from two distinct causes. If the general model cannot fit the data, then its unmodeled errors constitute some sort of non-Markovian dynamics, since the general model (by construction) can model all Markovian errors on the gates. Any non-Markovian effect will also go unmodeled by the smaller models (crosstalkfree and context-dependent). But certain crosstalk errors are also excluded by those models (again, by construction).
When data show evidence of non-Markovianity (as is often the case), none of the three models will fit the data well. But we can use wildcard error analysis to roughly estimate the magnitude of crosstalk errors even in the presence of non-Markovianity. To do so, we assign wildcard error to each model. The general model serves as a baseline; only non-Markovian errors contribute to its W . A smaller model's W accounts for both non-Markovian errors and the crosstalk errors excluded by that model. If the smaller model's W is significantly higher, that indicates the presence of crosstalk errors that are not dominated by non-Markovianity. We will see examples of this scenario in the experimental data.

E. Metrics
The analysis in the preceding section can provide extensive high-level information about whether whole classes of crosstalk error are present or absent, and about their overall magnitude. But it also lets us select one of the three models as the best fit to the data -i.e., the one that best balances simplicity and explanatory power. Once this model has been selected, we examine the process matrices that it assigns to each gate. We can extract detailed performance metrics, identify dominant error channels, and/or use the process matrices to predict the processor's performance on specific tasks and benchmarks.
Reductive gate error metrics like diamond distance or entanglement infidelity provide rough summaries of system performance, but to probe the details of estimated error models we transform process matrices to error generators [66]: where G is a gate's process matrix, H is a Hamiltonian superoperator that generates a perfect unitary implementation of the gate, ∆H is a Hamiltonian error generator that generates the gate's unitary errors 2 , and L is a nonunitary error generator that describes all non-unitary errors in the gate (see Ref. [66] for extensive discussion). Gate sets like the ones we analyze here have a gauge freedom [53,67,68]; some of their parameters have no physical consequences and are unobservable. Gauge degrees of freedom appear in error generator representations as unobservable linear combinations of error generator coefficients. When we construct and examine estimates in this article, we manifest the gauge freedoms explicitly as unobservable constant offsets, and we measure crosstalk errors using strictly gauge-invariant properties constructed as differences between two coefficients with identical gauge freedoms. Gauge-invariant parameters of the Hamiltonian error generator include: 1. The coefficient (rate) of any error generator that commutes with the target gate, including: (a) The entire ∆H for an idle operation, (b) Over/under rotation angles of any active gate, 2. The angle between the rotation axes of any two active gates, 3. The change in ∆H between the same gate acting in two different contexts.
These will be sufficient for our analysis.

F. Testing quantum information processors for crosstalk
The previous sections each discussed an important element of a robust method for identifying and characterizing crosstalk errors in a quantum information processor.
In this section we outline the steps we took for end-toend characterization of crosstalk errors in the two experimental platforms discussed in the next section. For each processor, we collected all data in one contiguous experiment (details are given below), but here we present a step-by-step procedure for clarity.
First, to obtain a rough estimate of local and crosstalk error rates, we performed and analyzed a form of simultaneous RB [41]. Simultaneous RB involves three distinct RB experiments: running RB on subsystem (1) while idling subsystem (2); idling (1) while running RB on (2); and running RB on (1) and (2) simultaneously. This yields two error rates (r i and r s , from the idle and driving contexts) for each subsystem. The change in each subsystem's RB error rate (r s − r i ), when the other subsystem is driven instead of idled, provides an estimate of how much error the gates on one subsystem induce on its neighbor. Simultaneous RB can be implemented with any variant of RB; we used direct RB (DRB), a variant of standard RB in which the Clifford RB circuits [69] are replaced by uncompiled circuits over a system's native gates [70].
Next, we analyzed GST data. We fit our three models to this data (using pyGSTi), and evaluated their fit quality using N σ , evidence ratios, and wildcard error. We used this information to deduce which forms of crosstalk were present, and to estimate their magnitude. In each use case, we then selected the model that best explained the observed data, for further analysis using the techniques of Section II E. The details of this analysis depends on the model: • Crosstalk-free model: If this model fits, there is no evidence for crosstalk. Each single-system gate is represented by a local process matrix, independent of context. It can be evaluated in the usual way.
• Context-dependent model: If this model is selected, the gates still act locally, but their action is contextdependent. The model specifies the action of each single-system gate in several contexts, indexed by which operation is performed on the neighboring subsystem. The most relevant object of study is the variation in a gate's action between contexts, which is gauge-invariant and easily extracted from the local process matrices that describe it in different layers.
• General model: If this model is selected, at least one layer is inducing nonlocal (e.g. entangling or correlated) errors. Each process matrix must be analyzed to see if it produces correlated errors, and if so, what their kind and magnitude are.

III. EXPERIMENTAL DEMONSTRATION
We used our nested crosstalk models to investigate and characterize crosstalk errors on the two DOE Quantum Testbed platforms, AQT [55] and the QSCOUT prototype [56,57]. AQT is a transmon-based platform housed at Lawrence Berkeley National Laboratory and UC Berkeley. QSCOUT is a trapped-ion quantum computing platform housed at Sandia National Laboratories. By characterizing both devices, we are able test the performance of our methods against vastly different physical error sources and experimental limitations.
We report experimentally measured/estimated values throughout this section. When possible, uncertainties are given, using concise notation, e.g., 1.234(5) indicates 1.234 ± 0.005. All uncertainty intervals are 95% ≈ 2σ confidence intervals, obtained using either bootstrapping or likelihood ratio confidence interval methods.

A. The Advanced Quantum Testbed
Experiments on the AQT platform were performed using an eight-qubit superconducting transmon processor (AQT@LBNL Trailblazer8-v5.c2). The qubits are encoded as the |0 and |1 states of the transmons, and are coupled to their nearest neighbors in a ring geometry. The demonstrations here focus on two next-nearestneighbor transmons, labeled Q4 and Q6, whose fundamental transition frequencies range from 5.2 to 5.5 GHz, with anharmonicities around 270 MHz. Each qubit in the device has its own control line for applying X π/2 and Y π/2 gates, while any necessary Z gates are applied virtually through discrete phase shifts of subsequent X π/2 and Y π/2 gates [71].

Potential (and actual) sources of crosstalk in AQT
Crosstalk is a well known problem in many transmonbased quantum processors. Two important crosstalk effects are coherent ZZ interactions, induced by shared microwave resonator modes, and pulse spillover. Microwave drive signals are often poorly localized on superconducting chips, so they rely on resonance mismatch to mitigate the impact of control spillover between qubits. However, this does not work perfectly, and the lingering interactions can manifest locally on spectator qubits as unwanted Rabi oscillations (if the drive is close to resonant with the spectator) or dispersive AC Stark shifts (if the drive is off-resonant). For neighboring qubits, such spillover crosstalk might even cause unwanted crossresonance entangling interactions [72].
Using spectroscopic analysis we found that the |0 -|1 transition frequency of Q4 is nearly resonant with the |1 -|2 transition of Q6. When microwave drive tones are applied to Q4, some of this power impinges on Q6 and this can therefore cause an AC Stark shift in the |0 -|1 transition frequency of Q6. The size of this shift can be measured by driving Q4 on resonance while monitoring the frequency of Q6 via Ramsey spectroscopy.
We can correct for this drive-dependent Stark shift by adding an explicit crosstalk compensation pulse on Q6 that interferes destructively with the spillover pulse. This compensation pulse is optimized by first identifying the phase shift for which the Stark shift of Q6 is maximal, and then finding the relative amplitude that minimizes the error on Q6. This compensation tone is then built into each active operation that is applied to Q4.

Experiment design
To characterize crosstalk in the AQT platform, we ran simultaneous DRB and GST experiments, both with and without crosstalk compensation. The GST circuit family we used is summarized in Table VI. We used circuits up to depth 32, resulting in 20,577 GST circuits in total. Our DRB circuits were constructed by sampling 30 two-qubit, simultaneous DRB circuits at each of 8 exponentially spaced depths up to a maximum depth of 256. From each of these DRB circuits, we created two additional independent DRB circuits for which all of the gate operations on one or the other qubit were replaced with idles, yielding 710 unique DRB circuits in total (if a circuit was duplicated we simply gathered twice as much data to avoid undersampling).
We combined the GST and DRB circuit sets into a single list and randomized its order (effectively interleaving the RB and GST circuits). Because neither GST nor RB are designed to be reliable when there are large drifts in a processor's behaviour over the duration of the experiment, we gathered data in four batches. Crosstalk compensation was employed in alternate batches, and each batch consisted of running every circuit in the combined circuit list 500 times. Using simple statistical tests described in [48], we confirmed that the data in similar batches were statistically consistent. We then aggregated data from similar batches (with and without crosstalk compensation) into a single dataset containing 1000 shots per circuit. All data was taken over a period of approximately 10 hours.
Our measurement was able to distinguish between |0 , |1 and the leaked state |2 . However, the GST and RB techniques used in this work are not designed to characterize leakage, so we discarded any measurements for which either qubit was found in the |2 state. This required discarding approximately 0.3% of all measurements when crosstalk compensation was used, and approximately 0.5% when it was not.

Randomized benchmarking
We used simultaneous DRB on AQT's Q4 and Q6 to estimate (1) each qubit's typical error-per-gate, and (2) how much that error rate changed when the other qubit was left idle or driven with random gate sequences. We call these "contexts". Simultaneous DRB reveals each qubit's error-per-gate in two contexts, the "spectator driven" context (r s ) and the "spectator idle" context (r i ).

FIG. 2.
Simultaneous direct randomized benchmarking (DRB) on the AQT platform. We ran simultaneous DRB experiments (details in main text) on the AQT device's Q4-Q6 subsystem both with (top) and without (bottom) crosstalk compensation enabled. Estimates of each qubit's average error-per-gate were extracted in two contexts: "idling spectator" (ri) and "driving spectator" (rs). The increase in each qubit's r due to driving the spectator is a high-level measure of crosstalk-induced errors on that qubit. Uncertainties are 95% confidence intervals obtained from bootstrapping. Violin plot regions around each point indicate the distribution of the 30 individual DRB circuits whose average success probability is represented by the point. There is no statistically significant evidence of crosstalk-induced errors from Q6→Q4. However, there is significant evidence of crosstalk from Q4→Q6, and it is reduced (though not eliminated) by crosstalk compensation.
Its main result is the difference between these error rates (r s − r i ), which we call the context-to-context variation of a qubit's error-per-gate. Figure 2 summarizes the results, showing success probability decay curves for each qubit, in both contexts, and the estimated values of each r s and r i , with and without crosstalk compensation.
Without crosstalk compensation, we observe a striking asymmetry between the two qubits. Q4's error-per-gate is approximately 0.07% regardless of context (r s ≈ r i and the estimated change of 0.006(6)% is within error bars of zero). But Q6's error-per-gate shows significant contextto-context variation -it jumps from 0.082(5)% when Q4 is idled, to 0.13(2)% when Q4 is driven. We conclude that there is significant crosstalk from Q4 to Q6, but no evidence for crosstalk in the opposite direction.
Simultaneous RB demonstrates that crosstalk is present, and that it is significant. It suggests that crosstalk effects are asymmetric (Q4→Q6), and that crosstalk compensation reduces them. But it is hard to draw further conclusions, because simultaneous RB does not reveal how each gate's action depends on context, nor does it distinguish context-dependent errors from entan-gling or correlated ones. It is possible to extract some of this information from a more sophisticated analysis of RB data [42], but we can answer these questions definitively with GST data and analysis.

Comparing and selecting crosstalk models
We began our analysis of the GST data by fitting all three models -crosstalk-free, context-dependent, and general -to the two datasets (with and without crosstalk compensation), and evaluating their fit quality. Table II displays the results.
None of the three models fit either dataset perfectly. We evaluated each model's fit quality by its loglikelihood ratio λ with respect to the maximal model [see Eq. (3)]. All six fits displayed N σ > 45, i.e., at least 45σ of model violation. This constitutes strong statistical evidence of non-Markovian behavior that cannot be modelled by twoqubit process matrices. Under these circumstances, neither GST nor RB is guaranteed to be be reliable, and caution is required when interpreting results.
However, large N σ does not imply that non-Markovian errors are dominant. To compare the rates of modeled and unmodeled errors, we assigned a wildcard error (see Sec. II D) to each model's best-fit estimate. The general model's wildcard error rate is less than 0.2%, both with and without crosstalk compensation. This is much smaller than the GST model's average diamond distance error rate (¯ = 1.670(3)% without crosstalk compensation, or¯ = 1.097(3)% with crosstalk compensation). We conclude that Markovian errors dominate, and are captured by the largest GST model.
Since the general GST model explains most of the errors observed in the GST data, we investigate whether smaller models -context-dependent and crosstalk-free -are equally consistent with the data. We can evaluate their fit relative to the general model using two criteria: evidence ratios, or the change in wildcard error.
The context-dependent model is conclusively accepted by both criteria -it fits the data as well as the general model despite having more than 1400 fewer parameters. The evidence ratio between them is γ = 0.9 (γ = 0.58) without (with) crosstalk compensation, and the wildcard error increases by only 0.014%. We conclude there is no evidence for entangling or correlated crosstalk errors; the context-dependent model's tensor product process matrices describe the observed data as well as possible by any (two-qubit) process matrices.
The crosstalk-free model, however, does not fit the data well. In the absence of crosstalk compensation, it is overwhelmingly rejected by both criteria -the evidence ratio between the context-dependent and crosstalk-free models is γ = 495, and the wildcard error required to reconcile it with the data is increased by more than 5× (W crosstalk−free = 0.85% W context−dependent = 0.15%). This constitutes overwhelming evidence that crosstalk errors are present, and their magnitude is large (i.e., they make a substantial contribution to the total gate error rates).
When crosstalk compensation is enabled, the crosstalkfree model fits the data much better. Its wildcard error drops substantially (W crosstalk−free = 0.23% ≈ W context−dependent = 0.20%), and the evidence ratio in favor of the context-dependent model drops to γ = 15.6. This constitutes clear evidence that crosstalk is still present, but its magnitude and significance are greatly reduced by crosstalk compensation.
Our conclusions from this analysis are: • Crosstalk errors are clearly present, but only local context-dependent errors.
• Non-Markovian errors are present at the 0.2%/gate level, but are dominated by Markovian errors (and by Markovian crosstalk errors).
• Crosstalk compensation reduces crosstalk errors significantly, but it does not eliminate them.
• Both with and without crosstalk compensation, the best estimate (process matrices) to examine in detail is the context-dependent estimate.

Extracting detailed crosstalk error rates
We now examine the best GST estimates (with and without crosstalk compensation) in detail. A GST estimate of the context-dependent model specifies a 2-qubit process matrix for each of the 9 parallel-gate layers (e.g. X π/2 ⊗ I, Y π/2 ⊗ X π/2 , etc). Each 2-qubit process matrix is the tensor product of two 1-qubit process matrices. Each 1-qubit process matrix describes one of 3 gates (X π/2 , Y π/2 , or I) acting on one of 2 qubits (Q4 or Q6), in one of 3 contexts (X π/2 , Y π/2 , or I applied to the other qubit). We represent each 1-qubit process matrix using error generators (Sec. II E), and examine the context-tocontext variation of the error generators for each of the 6 single-qubit gates (X π/2 , Y π/2 , and I gates on Q4 and Q6).
The non-unitary parts of each error generator -e.g., rates of stochastic errors -have almost no statistically significant context-to-context variation. The very small number of variations that are statistically significant are comparable in magnitude to the non-Markovian errors (measured by wildcard error at W ≈ 0.2%). In contrast, context-to-context variations in the unitary part of the gates' error generators (∆H in Eq. 7) are both statistically significant and large. Table II presents the coefficients of coherent X, Y , and Z Hamiltonians (in milliradians) for each gate in each context, both without (top) and with (bottom) crosstalk compensation enabled. The target rotation angle θ 0 = π/2 × 10 3 mrad is included where appropriate. This gate set has 6 gauge freedoms (corresponding to 3-parameter unitary changes of basis on each qubit), which are reflected in this table by the unobservable constants c 1 . . . c 6 . The key to interpreting these error rates is that constants (θ 0 and c i ) appear identically in each column. So context-to-context variations in each gatei.e., differences between entries in the same row -are gauge-invariant.
When no crosstalk compensation is applied, every gate displays some statistically significant context-to-context variation. However -as suggested by simultaneous RB results -the variations are significantly larger for Q6. Its idle operation (I) experiences phase (Z) errors that change by 13.1(2) mrad when the spectator qubit is driven. Active gates (X π/2 ,Y π/2 ) on Q6 display similar variations in their rotation angles [e.g. Y errors on the Y π/2 gate, which vary by 9.5(1) mrad] or axes [e.g. Y errors on the X π/2 gate, which vary by 14.4(1.0) mrad]. In contrast, the largest variation observed in a Q4 gate is the rotation angle of the X π/2 gate, which varies by 4.6(5) mrad. Figure 3 illustrates the variation in the effective Hamiltonians that generate X π/2 and Y π/2 gates on Q6, in the X-Y plane.
The context-to-context variations do not follow a simple pattern. For the idle gates, only phase (Z) errors are observed, and they depend significantly only on whether the spectator qubit is driven or not (rather than on which gate is performed on the spectator). For the active gates, however, both rotation angles and axes vary, and they depend not just on whether the spectator is driven, but on whether an X π/2 or Y π/2 gate is performed on the spectator. This detailed information about the nature of the crosstalk could in principle be compared to -or used to inform -physical models of gate context dependence, but we do not currently have such a model.
Enabling crosstalk compensation reduced the overall crosstalk significantly, but not uniformly. It had little effect on Q4's gates (which were already relatively good), but eliminated (a) essentially all context dependence for Q6's Y π/2 and I gates, and (b) essentially all variation in the rotation angle of Q6's X π/2 gate. Interestingly, variations in the X π/2 gate's rotation axis were not eliminated (see Fig. 3).

Discussion of crosstalk in AQT
The asymmetric crosstalk errors identified by our GST experiments are consistent with AC Stark shifts induced by the control fields. As discussed in Sec. III A 1, spectroscopic data predicted that driving Q6 should not influence Q4 because of the large discrepancy in relevant transition frequencies. Conversely, that same analysis predicted Q6 should experience phase shifts when Q4 is driven, because the Q4 drive tone is near-resonant with an excited state transition of Q6. The crosstalk errors we observed are consistent with these predictions.

FIG. 3.
Coherent errors in X π/2 and Y π/2 gates on AQT Q6 depend on context. We used GST to estimate 1-qubit process matrices describing the X π/2 (top) and Y π/2 (bottom) gates on Q6 conditional on {I, X π/2 , and Y π/2 } gates performed at the same time on Q4, both without (left) and with (right) crosstalk compensation. We extracted the effective Hamiltonian that generates the unitary part of each estimated process (see main text), represented it as H eff = h· σ [with σ = (X, Y, Z)], and plotted the projection of each one into the X-Y plane (all hZ components are negligible). Each of the four panels shows a small region of the X-Y plane; each effective Hamiltonian is represented by a line from the origin to (hX , hY ). Uncertainty regions are shown as ellipses, which are too small to be visible here. Both rotation angles and axes vary significantly from context to context; crosstalk compensation reduces this effect, and essentially eliminates it for the Y π/2 gate.

B. The Quantum Scientific Computing Open User Testbed
Experiments on the QSCOUT platform were performed using a prototype system that is nearly identical to the deployed testbed. It was configured to use two qubits encoded as the hyperfine clock states of a pair of 171 Yb + ions [73] that were held in a Sandia-fabricated HOA 2.0 surface ion trap [74]. The ions were trapped together in a single pseudopotential and were spaced 4.5 µm apart. The trap frequencies were approximately 1 MHz axially and 2 MHz radially. The ions' hyperfine states were manipulated using a two-photon Raman transition via a pair of phase-locked co-propagating frequency combs generated by a frequency-tripled Nd:YAG pulsed laser at 355 nm [75]. Each ion was individually addressed by first splitting a single laser beam into multiple beams, and sending each beam through a dedicated channel of a multichannel acousto-optic modulator (AOM) which allows for independent frequency, phase, and amplitude  (2) TABLE II. Fit quality metrics for three models on AQT device. We tabulate metrics of fit quality and unmodeled error for three distinct models' when fit to GST data from the AQT platform (both with and without crosstalk compensation). The performance of these models can be compared in detail using the log-likelihood ratio score λLR, the Nσ of model violation, and the residual wildcard error W . Each of these models further predicts an average diamond error¯ for the gate set. Without crosstalk compensation, the context-dependent model is strongly preferred over the crosstalk-free model by the evidence ratio test γ, and the wildcard error is significantly reduced by moving to the larger model. The full Markovian model, however, requires nearly an order of magnitude more parameters, achieves only a small improvement in the likelihood ratio and the wildcard, and is rejected by the evidence ratio test. When crosstalk compensation in applied, the context-dependent model is again preferred by the evidence ratio test, but much more weakly, and the wildcard error nearly constant across models. Crosstalk compensation further results in a 35% reduction in the average gate error.
control. Each beam was then tightly focused using custom optics to a 0.8 µm axial waist radius, ideally impinging on only a single target ion. During the detection cycle, light from each ion was focused to a different core of a multicore fiber. Each core was then sent to its own photomultiplier tube (PMT), allowing for distinguishable detection of the ions.

Potential (and actual) sources of crosstalk in QSCOUT
Several physical phenomena are expected to manifest as crosstalk errors in the QSCOUT hardware. The first is straightforward: control lasers targeted at one ion have a non-zero beam waist, allowing some light to spill over onto neighboring ions. Because all ions in the system are at the same resonant frequency, this light can cause slow, coherent Rabi oscillations on the neighbor ions. The laser waist is relatively small relative to the ion spacing, so one might expect this effect to be relatively small. But its magnitude can be significant if the system is poorly aligned. A more subtle source of potential crosstalk is the internal dynamics of the multichannel AOM. The nonlinear optical crystals in the AOM are not perfectly isolated from one another, so acoustic drive tones applied to a target crystal can cause neighboring crystals to ring sympathetically. The neighboring channel will then be activated (or perturbed), and its target ion will experience an error. The signals that implement these drive tones can also electrically couple into neighboring channels. The wiring layout of the AOM predicts that next-nearest-neighbor correlations will be greatest. Correlated errors arising from common causes can also manifest as crosstalk. In the trapped-ion hardware we expect amplitude (or phase) fluctuations of the driving laser to result in correlated amplitude (or phase) errors at the qubits. Similarly, magnetic field fluctuations can result in correlated phase errors.

Experiment design
We used the same family of GST circuits for QSCOUT as for AQT (Table VI). Because trapped-ion operations are slower, we took less data on the QSCOUT device. The number of counts per circuit was reduced from 1000 to 80, and the GST circuits were limited to depth 8 (rather than 32). We ran simultaneous DRB circuits generated in exactly the same way as for AQT (maximum depth 256, 30 circuits at each of 8 logarithmically spaced depths), and interleaved them with the GST circuits in the same way. A total of 12,514 unique circuits were run in the QSCOUT experiment, including 11,813 GST circuits and 701 unique DRB circuits. Data was taken in 8 batches, each comprising 10 shots of every circuit. The device was recalibrated between each batch. Approximately one million individual circuit shots were performed over approximately four hours.
The circuits we ran on the QSCOUT platform were implemented using simple pulses, with no compensating pulse sequences of any kind. The QSCOUT hardware can implement gates using, e.g., BB1 composite pulse sequences [76] that can yield significantly reduced error rates. However, they also require more time to implement, complicate the interpretation of unitary error generators, and can reduce the magnitude of certain crosstalk errors below detectable thresholds. So, for the purposes of this work, we restricted gate operations to simple, bare pulses.
Only the X π/2 operation was calibrated. Y π/2 gates use the same calibrated pulses, but phase-shifted by π/2, so effectively Y π/2 was also calibrated. This calibration procedure optimizes the X π/2 and Y π/2 gates on FIG. 4. Simultaneous direct randomized benchmarking (DRB) on the QSCOUT platform. We ran simultaneous DRB experiments (see caption to Fig. 2) on a 2-qubit QSCOUT processor. No crosstalk compensation was used in this experiment. Q0 performs well independent of context (i.e., whether Q1 is driven). In contrast, Q1 performs at least 5× worse in both contexts, and its error-per-gate appears to depend on whether Q0 is driven. Oddly, however, driving Q0 actually reduces Q1's error rate, suggesting a negative rate of crosstalk-induced errors. We identify the particular calibration protocol used as the probable cause (see main text).
each qubit for performance in the very specific context where the same gate is performed on the other qubit. Other circuit layers -e.g.
π/2 I (1) are not necessarily well-calibrated, and may experience different and/or larger crosstalk errors. Independent DRB probes the performance of layers that are strictly uncalibrated (e.g. π/2 I (1) , and I (0) I (1) for Q0). Approximately one third of the circuit layers used in simultaneous DRB have been calibrated, predicting that the simultaneous circuits should perform better -which is what we observe. We shall see further consequences of this calibration protocol in the GST data.

Comparing and selecting crosstalk models
We fit all three gate set models to the data (just one dataset in this case, since no crosstalk compensation was performed) and evaluated their fit quality. Table IV displays the results.
No model fits the data perfectly. The general model displays about 9σ of model violation, suggesting that the gates are somewhat non-Markovian. However, the wildcard error assigned to the general model is almost zero, suggesting that this non-Markovianity is barely visible. These results should not be compared directly to AQTfar less data was taken on QSCOUT, so the experiment is less able to detect and quantify non-Markovianity.
The evidence ratio between the general model and the (smaller) context-dependent model is extraordinarily small (γ = 0.29), indicating no evidence whatsoever of entangling or correlating crosstalk errors between the qubits 3 . The observed data can be described more or less 3 Evidence ratios less than 1 are typically caused by extremely perfectly by context-dependent local errors.
However, the evidence ratio between the contextdependent model and the crosstalk-free model clearly rejects the crosstalk-free model (γ = 134). This constitutes clear and overwhelming evidence of crosstalk (as expected, given the simultaneous RB results). We can also estimate the magnitude of the crosstalk errors from the wildcard error required to reconcile the best crosstalk-free estimate with the data (W = 2.4%). Since the contextdependent model fits the data well, we can ascribe all crosstalk errors to context-to-context variation of local errors, which we proceed to examine in detail.

Extracting detailed crosstalk error rates
The GST estimate of the context-dependent model yields 1-qubit process matrices describing each of the 6 gates in 3 different contexts. We analyze the error generators for these processes. As in the AQT analysis, the non-unitary error generators reveal almost no significant context-to-context variation, so we focus on unitary (coherent) errors. Table V presents the coefficients of coherent X, Y , and Z Hamiltonians (in milliradians) for each gate in each context. The target rotation angle θ 0 = π/2 · 10 3 mrad is included where appropriate. This gate set has 6 gauge freedoms (corresponding to 3-parameter unitary changes of basis on each qubit), which are reflected in this table by the constants c 1 . . . c 6 , which are unobservable. θ 0 and c i appear identically in each column, so context-to-context variations in each gate -i.e., differences between entries in the same row -are automatically gauge-invariant.
We can immediately draw the following conclusions from Table V: • Errors on both idle gates are small (≤ 3 · 10 −3 radians) in all contexts, and almost exclusively indistinguishable from zero. The I gate on Q0 has no evident errors at all. The I gate on Q1 shows barely significant context-dependent rotations by about 3 mrad.
• Z Hamiltonian error rates on each qubit's X π/2 and Y π/2 gates are also effectively negligible -they are all < 4 mrad in all contexts, and mostly indistinguishable from zero.
The remaining errors, with magnitudes of up to 0.1 radians (100 mrad), fall into two categories: 1. Over/under-rotation errors, i.e. X Hamiltonian errors on X π/2 gates and Y Hamiltonian errors on Y π/2 gates.
small sample counts, when many events appear 0 or 1 times, and stem from breakdown of the Gaussian ansatz used in deriving Wilks' theorem. Since each circuit was repeated N = 80 times, it is unsurprising to see this for the QSCOUT data.
2. "Tilt" errors that change a gate's rotation axis, i.e. Y Hamiltonian errors on X π/2 gates and X Hamiltonian errors on Y π/2 gates.
All of these errors show significant context-to-context variation. Figure 5 illustrates this variation by depicting each gate's angle and axis in the X-Y plane, for each context. Gates on Q0 have rotation angles that vary by up to 33(7) mrad, and rotation axes that vary by up to 28(2) mrad. For gates on Q1, rotation angles vary by up to 170(40) mrad, and rotation axes vary by up to 150(1) mrad. The context dependence of errors on Q1's gates are approximately 3-5× larger than Q0's. Because these errors are coherent, the error-per-gate observed in RB scales as θ 2 , and so in RB experiments we expect to see 10-25× more context-dependent error on Q1 than Q0. This is consistent with the observed results of simultaneous RB (Fig. 4). The GST estimate confirms the conjecture we stated in the discussion of simultaneous RB results: errors are minimized when the same active gate (X π/2 or Y π/2 is performed on both qubits. These are the only layers that are explicitly calibrated. Active gates performed in other contexts show clear and significant calibration errors in both their rotation angles and their rotation axes (relative to the calibrated operation).

Discussion of QSCOUT results
Simultaneous RB experiments clearly demonstrate the existence of Q0→Q1 crosstalk, but also demonstrate the counterintuitive result that gates on Q1 perform better when Q0 is driven. GST experiments revealed what was actually happening: each active gate's behavior depends not just on the "spectator driven" and "spectator idled" contexts, but on exactly what gate is performed on the spectator. Most of this effect is due to the calibration protocol, and the fact that only X π/2 X π/2 and Y π/2 Y π/2 gates were specifically calibrated. However, Table V shows additional variations between the other two contexts.
The crosstalk errors we observe in the QSCOUT system reflect the fact that the two qubits have identical energy splitting. Pulse spillover onto a spectator ion is therefore resonant with its qubit transition, leading to coherent Rabi oscillations around an axis on the equator of the Bloch sphere. In contrast, the AQT qubits have different frequencies, so spillover crosstalk acts very differently. Instead of coherent Rabi oscillations, it induces an AC Stark shift on the spectator qubit, which manifests as coherent rotation about the Z axis. The asymmetry in the QSCOUT error rates (particularly the errors in the idle gates) support a hypothesis that the crosstalk arises from beam pointing errors, rather than internal AOM phenomena, which are more likely to result in crosstalk errors that are symmetric between qubits.
Restricting device calibration to parallel operations has serious, observable impacts on the performance on non- FIG. 5. Coherent errors in X π/2 and Y π/2 gates on QSCOUT qubits depend on context. We used GST to estimate 1-qubit process matrices describing the X π/2 (left) and Y π/2 (right) gates on Q0 (bottom) and Q1 (top), conditional on {I, X π/2 , and Y π/2 } gates performed at the same time on the other qubit, in the QSCOUT system. Exactly as for Fig. 3, we extracted the effective Hamiltonian that generates the unitary part of each estimated process (see main text), represented it as H eff = h · σ [with σ = (X, Y, Z)], and plotted the projection of each one into the X-Y plane (all hZ components are negligible). Each of the four panels shows a small region of the X-Y plane; each effective Hamiltonian is represented by a line from the origin to (hX , hY ). Uncertainty regions are shown as ellipses. Q0's gates show essentially no context dependence. In contrast, the rotation angles and axes of the X π/2 and Y π/2 gates on Q1 show extremely strong dependence on the gate performed on the spectator. We also observed small context-dependent phase errors in the I gate (not shown here). parallel layers. This suggests that the active gates' error rates could be reduced significantly and made less context-dependent by explicitly calibrating all of the layers simultaneously. This would require a more complicated tune-up process.

IV. DISCUSSION
This paper presents both a novel device characterization method, and the results of its deployment on two experimental platforms. We separately discuss our conclusions about this method and our insights into the experimental results. TABLE IV. Details of the three best-fit gate models on the QSCOUT device. The performance of these models can be compared in detail using the log-likelihood ratio score λLR, the Nσ of model violation, and the residual wildcard error W . Each of these models further predicts an average diamond error¯ for the gate set. The context-dependent model is strongly preferred over the crosstalk-free model by the evidence ratio test γ, and the wildcard error is significantly reduced by moving to the larger model. The general crosstalk model, however, requires nearly an order of magnitude more parameters, achieves only a small improvement in the likelihood ratio and the wildcard, and is rejected by the evidence ratio test. Notably, W is nonzero for the general model despite being zero for the smaller context-dependent model. This can occur because the wildcard error calculation requires models with more parameters to fit better that those with fewer.

A. Protocols and methods
GST is best known as a replacement for process tomography, and for its use in constructing full-dimensional process matrices that represent gates' action on their target space. But GST is inherently flexible [54], and we used that flexibility here to test multiple error models -ranging from full-dimensional process matrices down to a highly restricted crosstalk-free model -to data generated using parallel gates on two subsystems. This targeted adaptation of GST to probe crosstalk allowed us to extract a lot of information at many levels of detail, ranging from "Yes, there's crosstalk, but it's not entangling" down to the exact details of how much over-rotation each gate induced on the spectator.
Our results should be easy to reproduce, extend, and deploy in many experimental systems. They are explicitly platform-agnostic, and they require only userlevel access to a device and the ability to run simple circuits composed of device-native quantum operations. The data analysis routines are all built from free, widely available software tools, and can often be implemented with just a few lines of Python code (e.g., pyGSTi). All of the experimental data and analysis code necessary to reproduce the results shown here are available upon request as supplementary material.
The investigation and results presented here highlight the complexity of crosstalk errors. If we had stopped after running simultaneous RB, we would have concluded simply that (1) the AQT qubits had a little bit of crosstalk and (2) the QSCOUT qubits had a small negative amount of crosstalk. The results of our detailed GST investigation do not contradict those findings -but they illustrate that crosstalk is not described by a single number. Crosstalk changes errors -e.g., depending on their context -which can cause unexpected harm irrespective of which context (e.g. idle neighbor or driven neighbor) induces the worse error. Eliminating crosstalk means removing all forms of context-dependence entirely, not just ensuring that "idle" and "driven" randomized benchmarking experiments yield the same error-per-gate.
The approach presented here complements high-level crosstalk benchmarks, such as simultaneous RB, by constructing detailed models that can identify specific errors. This low-level diagnostic information can elucidate the physics of quantum information processors, enable better calibration, and inform design of next-generation systems. Detailed error models also enable more accurate estimates of device performance on real tasks, and noise-optimized decoders for quantum error correction.
These advantages do have a cost. Although GST experiments constituted the overwhelming majority of the circuits we ran, the simultaneous DRB experiments were actually more sensitive to certain errors. GST and RB circuits are equally sensitive to the average stochastic error of a gate set, and they both amplify it proportional to circuit length L. We ran GST circuits up to L = 8 or L = 32 (depending on platform), but RB circuits up to L = 256. As a result, the simultaneous RB analysis provides sharper information about the context-dependence of average stochastic error than GST estimates do; GSTbased predictions of the simultaneous DRB results are within error bars, but those error bars are larger than the estimated effect. The GST estimates provide far more information about coherent errors (and individual stochastic error rates), but simultaneous RB extracts its single summary benchmark with unmatched efficiency.
Extensions of our methods to many-qubit systems and two-qubit gates are possible. One obvious and immediately practical extension is to study the crosstalk induced by 2-qubit gates (on a 2-qubit subsystem) on neighboring 1-and 2-qubit subsystems. This requires probing at most 4 qubits at once, which is feasible with existing analytic machinery. Straightforwardly scaling our models to subsystems of 2 qubits will quickly become impractical. Recent advances in many-qubit gate set tomography [77] suggest a path to reducing this overhead and enabling GST for efficient crosstalk characterization of large-scale quantum systems.

B. Experimental results
Our investigation revealed similar crosstalk errors in both platforms -the gate errors were significantly context-dependent, but there was no evidence for entangling or correlated errors. And in both platforms, the dominant crosstalk errors are consistent with simple pulse spillover.
Our experiments found no statistically significant evidence for entangling or correlated stochastic errors in either experiment. We confirmed using numerical simulations that our methodology is sensitive to entangling errors of this type (see Appendix), so our failure to detect them indicates their rate is below the detection threshold of this experiment. GST circuits' sensitivity to coherent errors (of all types) increases proportional to their length, so in future work we intend to search for weak coherent ZZ errors using longer GST circuits. More sensitive GST experiments could also reveal context dependence in the non-unitary errors, and/or correlated stochastic noise.
In contrast, the absence of entangling errors in the QSCOUT platform is unsurprising. This platform's single-qubit gates are not expected to induce the ion/phonon couplings required to couple two ion qubits. Correlated stochastic errors are plausible and could be produced by a wide array of environment effects, but we saw no evidence for such errors in the data. However, this experiment's detection threshold was relatively high -the trapped-ion hardware has a relatively low data rate that limited both the circuit depth (L = 8) and the number of counts (N = 80). As a result, our estimates of stochastic error rates have low precision, and correlated errors could be hiding in the noise. More precise probing of these errors will require streamlined experiments.
One of the most impactful results of our analysis is that -in both platforms -the dominant crosstalk errors are restricted to (1) context-dependent local errors, and (2) coherent unitary rotations. These restrictions single out a very small subset of all the possible crosstalk errors. It is always easier to characterize and track errors that lie in a constrained, low-dimensional set. Perhaps more importantly, context-dependent unitary errors are among the easiest to eliminate. Dynamical decoupling, optimal control, or simple context-dependent calibration can all remove such errors. As demonstrated by AQT's crosstalk compensation pulses, even simple techniques to cancel pulse spillover can improve device performance. Similarly, the dominant errors observed in QSCOUT experiments stemmed directly from how the gates were calibrated. Minor changes to that protocol -e.g., independent calibrations for "idle neighbor" and "driven neighbor" contexts -could reduce crosstalk errors below detectable thresholds.
Our analysis demonstrated the utility and performance of crosstalk compensation in the AQT system. Simultaneous GST could be used to enable continued, iterative reduction of crosstalk errors (via iterative calibration), but we don't think this is the right idea. Full simultaneous two-qubit GST requires too much overhead to be used in an active optimization loop. However, we believe it is possible to construct simpler, more targeted characterization protocols that focus on a particular type of crosstalk. These will run much faster, and be suitable for inclusion in active feedback loops. We propose that the role of "heavy" protocols like GST is to identify which crosstalk errors are dominant, so that specialized "lightweight" protocols can be deployed to tame them. The fiducials and germs were chosen numerically via pyGSTi [58] such that the fiducials generate an informationally complete set of states and measurements, and the germs are sensitive to all parameters in the general crosstalk model. Additionally, the germs and fiducials contain the circuits necessary to run isolated, single-qubit GST on each component qubit. This last requirement did not increase the number of fiducials required, but did necessitate the addition of two germs (the last two in Table VI).

Appendix B: Simulation
For all of the experiments presented in the main text, context-dependent models happened to provide the best balance of explanatory power and simplicity. This is evidence that correlated and entangling errors, which are specifically excluded from context-dependent models, are unnecessary to explain the observed data. Of course, not all experiments can be explained with these models. For instance, as a superconducting processor, the AQT platform might have been expected to experience weak ZZtype coupling between the qubits [28]. In this section, we provide numerical evidence that our methods are capable of detecting a coherent entangling error and selecting the general crosstalk model, as long as the magnitude of the error is above a certain threshold, for a given amount of data gathered.
For our simulations, we use the same target gate set and GST circuit family as in our experiments. We set L (sim) max = 8 and simulate 1000 measurements per circuit. Each circuit layer experiences an identical Z (1) ⊗ Z (2) entangling error of strength , described by a Hamiltonian error generator (which is applied concurrent with each gate operation): For each of 10 distinct, exponentially-spaced values of ε, we simulate data and fit our models. Results of these simulations are presented in Figs. 6 -8.

Analysis of simulations
For the parameters used in our simulations, we identify a threshold value for the rate of the entangling error ε * 4.6 × 10 −3 , above which we successfully detect the injected crosstalk and choose the crosstalk-containing TABLE VI. Building blocks of the GST circuits used in our experiments to investigate crosstalk between two qubits j and k. Operations are applied sequentially from left to right, and parentheses indicate operations on separate qubits that are intended to be applied simultaneously. The full circuits of all possible choices of a preparation fiducial, an n-fold repeated germ operation, and a measurement fiducial (see main text). The particular set of germs and fiducials listed here was selected to enable high-precision estimation of all physical parameters of a general crosstalk model. model as the best fit. As shown in Fig. 6, for simulations with ε ≥ ε * , N σ for the context-dependent and crosstalkfree models grows linearly with ε, but remains approximately constant for the general model. In these cases, the evidence ratio strongly (and correctly) selects the general model. Below the threshold, the context-dependent model is weakly preferred to the crosstalk-free model. The precision of GST scales with the number of measurements N and the maximum length L max as O(1/(L √ N )), and we expect similar scaling in the threshold ε * .
From each fit to the general crosstalk model we can extract an estimate of the strength of the entangling ZZ error present in any gate estimate. We do this for the general crosstalk model's idle gate estimate, and plot it against in Fig. 7. We see in all cases considered that the general crosstalk model is able to accurately reconstruct this entangling error, whether or not the general crosstalk model's evidence ratio indicates its selection over the other models.
In Fig. 8, we show the wildcard error W for the Here, Nσ quantifies fit violation for each model. When ε < ε * = 4.6 × 10 −3 , the predictions of all three estimated models are statistically consistent with the simulated data. As ε increases above this threshold, Nσ for the contextdependent and crosstalk-free models increases rapidly, while for the general crosstalk model remains relatively constant. In this regime, the general crosstalk model is preferred by the evidence ratio test.
context-dependent and crosstalk-free models as a function of ε. For ε > ε * , the wildcard error is proportional to, but non-trivially smaller than, the magnitude of the error. This indicates that the wildcard is capturing some of the unmodeled error, but not all of it. In fact, this is entirely expected behavior. The wildcard error is intended to capture the extra error per gate that is required for the model to be consistent with the data. For the  Here we show the wildcard error assigned to fits of the contextdependent and crosstalk-free models when an entangling error is present. For ε ≤ ε * = 4.6 × 10 −3 , the wildcard error budget for all three models is zero. However, for ε > 4.6 × 10 −3 , the wildcard error for the context-dependent and crosstalk-free models increase approximately linearly with ε in the regime shown. The wildcard error for the general crosstalk model is zero for all values of ε, as this model always properly fits the data.
wildcard to be approximately equal to the true value of the ZZ coherent error rate, there would need to exist a circuit of depth d for which the total variational distance between the observed and predicted probabilities was δ TVD (p o , p p ) = d . No such circuit exists, though some get close. For simulations with ε < ε * , all models are consistent with the data, so W is zero.