Integrated tool-set for Control, Calibration and Characterization of quantum devices applied to superconducting qubits

Efforts to scale-up quantum computation have reached a point where the principal limiting factor is not the number of qubits, but the entangling gate infidelity. However, the highly detailed system characterization required to understand the underlying error sources is an arduous process and impractical with increasing chip size. Open-loop optimal control techniques allow for the improvement of gates but are limited by the models they are based on. To rectify the situation, we provide an integrated open-source tool-set for Control, Calibration and Characterization, capable of open-loop pulse optimization, model-free calibration, model fitting and refinement. We present a methodology to combine these tools to find a quantitatively accurate system model, high-fidelity gates and an approximate error budget, all based on a high-performance, feature-rich simulator. We illustrate our methods using simulated fixed-frequency superconducting qubits for which we learn model parameters with less than 1% error and derive a coherence limited cross-resonance (CR) gate that achieves 99.6% fidelity without need for calibration.

Scaling up quantum processing units (QPUs) is a monumental task, that requires the community to make progress on multiple fronts, most importantly improving gate fidelities and increasing the number of qubits. Over the past few years, significant emphasis has been placed on creating larger devices, yielding great success [1,2]. However, the number of qubits has outstripped the limits that fidelity places on their utility: In [1], a record quantum volume [3] of 64 was demonstrated, loosely translating to the device being able to perform log 2 (64) 2 = 36 entangling gates before fidelity drops below 2 /3, a relatively small number of gates for an array of six qubits; In [2] the circuit fidelity was 0.1% thus requiring 30 million repetitions to achieve the desired statistics. One could even argue that the two-qubit gate fidelities demonstrated in isolation in 2014 [4] (0.994) are comparable with those in 2019's [2] (0.9938), even though the latter are for simultaneous gates in a large 2D qubit array.
The relatively slow progress in improving gate fidelities can be traced back to an incomplete understanding of the sources of error. Indeed, characterization and calibration of QPUs to the desired accuracy is impractical and cumbersome, and operating on devices of increasing * These two authors contributed equally; c3@q-optimize.org qubit number requires entangling gates to be fine-tuned for each individual pair to account for slightly varying properties. The resulting lack of detailed models makes it harder to identify where efforts must be focused to achieve higher fidelity gates [5,6].
Given that "all models are wrong, but some are useful" [7], we describe a Good Model as follows: A Good Model is one that predicts the behavior of the system, for the operations we wish to perform, to accuracies we care about.
For a QPU, a Good Model has to have predictive power for the range of feasible gate-generating pulses and for long sequences of such gates, to a fidelity accuracy of the order of 10 −5 . To the authors' knowledge, no such Good Model for a superconducting QPU has ever been published.
Since models serve as the basis to derive high-fidelity gates in open-loop optimal control [8][9][10][11][12][13][14][15], any inaccuracies of the model will inevitably degrade the experimental accuracy of the resulting gates. This problem is only partially ameliorated by the first-order insensitivity of optimized pulses to model inaccuracies [16,17]. Methodologies such as the adaptive hybrid optimal control (Ad-HOC) protocol [18] -which combines a modelbased open-loop optimization with a closed-loop experimental calibration [12,19] -address this issue but leave one in an unsatisfactory position as the need for calibration proves the inadequacy of the model: the root causes of the remaining infidelities are unexplained.
Conversely, if a Good Model is known, gates generated by open-loop optimal control will, by definition, work on the experiment, not requiring further closed-loop calibration. This enables the use of complex pulses that would otherwise require time-consuming calibration. Such a Good Model would also provide an error budget through a process of exploratory interrogation -evaluating the potential performance of the system where certain limitations have been removed, i.e. asking "what if ...?". Therefore, extracting a Good Model efficiently and in a highly automated manner is key to improving fidelities and a crucial step of QPU scale-up.
In this work we present C 3 , our proposed approach to control, calibrate and characterize QPUs. The paper is organized as follows: We present the conceptual steps of C 3 in Sec. II and illustrate the methodology by example in Sec. III, showing how these steps are implemented. Sec. IV includes a detailed description of the modeling, optimization procedures, the data comparison function and relevant prior work. We conclude in V with a discussion of the effort's current status and long-term directions.

II. C 3 -CONTROL, CALIBRATION, AND CHARACTERIZATION
Current methodology relies on tailored routines to extract individual parameters of the system's model (characterization) or fine-tune specific parameters of pulses used (calibration) [20,21]. This approach becomes cumbersome and impractical as the number of model and pulse parameters increases. With C 3 we propose a different paradigm: Optimizing a figure of merit that is sensitive to the set of parameters we care about. This eliminates the need to design per-parameter measurements, and thus provides a more general approach. C 3 at its core is composed of three separate optimizations, respectively implementing the tasks of control, calibration, and characterization: C 1 : Given a model, find the pulse shapes that maximize fidelity with a target operation. Pulse shapes may be constrained by an ansatz or allow direct arbitrary waveform generator (AWG) parameterization.
C 2 : Given pulse shapes, calibrate their parameters, if possible simultaneously, to maximize a figure of merit measured by the actual experiment, thus improving beyond the limits of a deficient model. The tasks of open-loop optimal control, C 1 , and calibration, C 2 , are fairly established in the community [8-15, 18, 19]. To characterize the system and provide us with a Good Model, we introduce C 3 , a tool to optimize model parameters by comparing model prediction to experimental data. We refer to this task as model learning. For this purpose, one requires an experimental dataset containing information about the implemented pulses and the corresponding measurement outcomes. To test the model accuracy, we reproduce the data-set, applying the same pulses to a simulation of the experiment, and compare the resulting outcomes: This provides a model match score to optimize. Initially, a candidate model is formulated based on previous information or intuition. If the model is suitable to explain the experiment, the optimization will converge to a near perfect match, thus providing numeric values for the model parameters. Instead, if the match is poor, the user supplies a new model, that is either an extension or modification of the previous candidate, and the optimization is repeated. Depending on user choice, learned values are carried over to the parameters of the new model or discarded.
As heterogeneous experimental data is the foundation for model learning, we suggest using the three tasks of C 3 in sequence, as shown in Fig. 1. However, their applica-  1. Diagram of the C 3 tool-set in an integrated characterization loop. C1 is a tool for obtaining optimal pulses by finding the control parameters α that minimize a goal function f1(α) in simulation. The gate-set G includes all the operation that one wishes to perform on the experiment, including the information of the ideal logical operations and the optimal pulse parameters α that implement them. C2 is a model-free experimental calibration procedure that optimizes pulse shapes with a gradient-free search to minimize an infidelity function f2(α) by varying all parameters at once. A data-set is a collection of experiment/result pairs, including information about the pulses parameters used α, the sequences S k performed and the final outcomes measured, m k . C3 is a tool for model learning that determines the model parameters β that best explain the data-set. It minimizes a goal function f3(β|D) obtained by recreating experiments S k (α) in simulation and comparing the results to the ones in the experiment. In C 3 different parameterized models can be provided to represent various elements of the experiment to find the one that best describes it. After the learning, the resulting model can be the basis for another characterization loop, refining both model and controls. tion is by no means limited to this use case and one may choose to view them as stand-alone routines. The same tools used to realize C 1 , C 2 and C 3 can also be used to further interrogate the system to obtain a sensitivity analysis of the optimized model in light of the experimental data and a breakdown of possible error sources.
We note that the intertwining of control and characterization has been raised in the more general context of control theory [22][23][24][25]. In quantum technology, there are some works which combine two of the three tasks: Ad-HOC [18] calls for optimal control followed by calibration; a combination of model-based gradient calculations and experimental calibrations is proposed by [26,27], but the data gathered is not used to improve the system model; in [28] pulses are designed specifically for the purpose of reconstructing the noise spectrum.

III. SYNTHETIC APPLICATION EXAMPLE
The following synthetic example illustrates how C 3 is used to obtain a Good Model in a realistic setting. We simulate a two-qubit QPU device using an underlying model, labeled the "real" model, which includes control discretization effects, electronics transfer functions, Markovian noise, and state preparation and measurement (SPAM) errors.
In this example, the simulated device is treated as a black-box, which we interrogate with C 3 . We derive (C 1 ) and calibrate (C 2 ) optimal control pulses and use the resulting data to extract a Good Model (C 3 ) by comparing the black-box to three candidates: Simple model: Two uncoupled qubits, closed system dynamics; Intermediate model: Two coupled qubits, closed system dynamics; Full model: Two coupled qubits, open system dynamics, including SPAM errors. Same structure as the "real" model, but a priori undetermined parameter values.
We systematically enrich the model until it reproduces the behavior of the device observed in C 2 . The recovered model is then used to design a two-qubit gate that performs well on the black-box without the need for tune-up.
A. The black-box device ("real" model) The "real" model is composed of two coupled three-level Duffing oscillators, labeled by A and B, each directly driven by an external field c i (t). Initialization, dynamics and readout are performed in the dressed basis by solving the master equation in Lindblad form [29,30]. with where ω i is the frequency of qubit i, δ i is the anhar- is the lowering (raising) operator, and g is the coupling strength. Open-system effects are expressed by the dephasing and relaxation Lindblad opera- , where the transfer function ϕ [31] accounts for discretization introduced by the AWG, bandwidth limitations of hardware, and for a constant scaling ϕ 0 , which translates input voltages to field amplitudes. We implement state preparation errors due to a non-zero initial temperature T by starting each experiment from the thermal state where Z = 2 k=0 exp{−E k /k B T } is the partition function with energies E 0,1,2 = 0, ω q , (2ω q + δ), and k B is the Boltzmann constant. readout misclassification is included, measuring state |n as state |m with probability p n→m . For example the probability of measuring a state ρ ψ = |ψ ψ| as |0 0| is Similarly to experiment, populations are estimated by averaging the results of multiple projective measurements, simulated as a multinomial draw from the distribution with probabilities {Π n }, thus introducing noise stemming from a finite number of measurement repetitions (commonly known as 'shot noise'). The values of the "real" model parameters are summarized in Tab. I.

B. Open-loop Optimal Control, C1
We assume that at the start of the C 3 procedure the parameters of the system are only known to a rough precision, with its qubit frequencies and anharmonicities chosen to be within a few MHz of their "true" values. In the simple model, the qubits are uncoupled three-level Duffing oscillators, evolution follows closed systems dynamics, and state preparation and measurement are assumed perfect. The Hamiltonian is Assuming this model, we design pulses for single-qubit gates using C 1 . To mitigate leakage, we choose a pulse ansatz with a Gaussian shape and a correction given by the derivative removal by adiabatic gate (DRAG) method [14], Here, Ω Gauss is a Gaussian envelope,Ω Gauss (t) is its time derivative, A is the amplitude of the drive, ω off is a frequency offset and the DRAG parameter η can be adjusted to reduce leakage into the second excited state [14,33]. The rotation axis can be freely chosen in the x-y plane by changing the phase of the drive signal ω d t → ω d t + φ xy , implementing the unitary rotations R(φ xy , θ) = exp{−i(cos φ xy σ x + sin φ xy σ y )θ}. By setting φ xy = n π 2 with n = 0, 1, 2, 3 and changing α = (A, η, ω off ) we aim to realize the single qubit gate-set for each qubit separately, eight gates in total, where X π/2 = {R(0, π/2)} and so on. With C 1 we use a gradient-descent method to find the parameters α that minimize the mean average gate infidelity where, χ 0,0 is the (0, 0)-th element of the Chi matrix representation of the gate error U † • U (α) between the ideal gate U and the implemented gate U (α) [34]. We optimize Gaussian pulses with a gate length of t g = 7 ns, for both qubits, using the gradient-based L-BFGS algorithm [36]. The obtained optimal pulses yield a mean infidelity of f 1 (α) = 6.6 × 10 −4 and f 1 (α) = 4.9 × 10 −4 on the simple model for qubit A and qubit B respectively, realistic values for fast gates using this simple parametrization. Next, we compare the performance of these pulses on the black-box device, where the gates instead yield a mean infidelity of 2.4×10 −3 for qubit A and 1.5×10 −3 for qubit B. In fact, performing an experimentally realistic randomized benchmarking (RB) [37][38][39][40] measurement on the device yields an error per gate of 2.3×10 −3 and 1.3×10 −3 comparable with the theoretical average infidelity. The degradation of performance from optimal control simulation (≈ 10 −4 ) to experiment (≈ 10 −3 ) shows a clear mismatch between the device and the simple model. The larger blue diamonds (larger red circles) highlight the best of 25 points generated and sampled at each iteration. In experimental practice, this batching helps reduce the overhead of loading pulses in AWG programming [32]. Both calibrations achieve the same final fidelity, however the optimal gates derived from the learned model provide a better initial guess. Assuming no SPAM errors the ORBIT value can be translated into an error per gate, indicated on the right axis. This is only meant to provide a rough estimate of the performance of the gate, noting that an ORBIT value of 0.5 represents maximum error per gate, i.e. completely depolarizing channels.

C. Calibration, C2
The next step is to calibrate the pulses derived by C 1 and improve their performance on the device. We use C 2 and employ a closed-loop, model-free, gradient-free optimization algorithm on an experimentally accessible figure of merit f 2 . Since the goal is to evaluate a gateset, we choose f 2 to be the ORBIT [19] (single-length RB) function averaging over N sequences. The survival probability, , is the probability to measure the state |0 (see Eq. 4) after applying random sequences composed of L Clifford gates [19], to the initial thermal state ρ init . The C k,j are the random gates sampled from the Clifford group C (for a single qubit |C| = 24), and C inv is chosen so that S k ≡ I in the ideal case. We use the atomic operations G from Eq. (7) to construct the set of Clifford gates, e.g. C 6 = X −π/2 • Y −π/2 • X π/2 , and from them construct N = 25 RB sequences of length L = 100. The survival probabilities m k are estimated by performing s = 1000 projective measurements and averaging.
To minimize f 2 , we employ the CMA-ES [41] algorithm, a gradient-free search that samples the loss function in batches, and is fairly robust to local minima and noise [42]. See [32] for an experimental demonstration. The optimal pulse parameters from C 1 are used as the starting point of the optimization, and the parametrization is kept as in Eq. (6). We perform the calibration for each qubit independently, with similar results. See Fig. 2 for the ORBIT calibration data of qubit B. The initial point suggested by C 1 has an ORBIT infidelity of 0.50 and is improved by the optimization to 0.12. To account for SPAM errors, we perform a full RB measurement and estimate the infidelity of the gates before and after as 1.3 × 10 −3 and 3.4 × 10 −4 respectively. Qubit A shows a similar improvement of RB estimated error from 2.3 × 10 −3 to 7.5 × 10 −4 .
FIG. 4. Sensitivity analysis of selected parameters for different models and data sets. In (a) we sweep the qubit frequency ωB and evaluate the goal function fLL on the learning data at each value. The star represents the optimal value returned by C3. Intermediate (blue, dot-dashed) and Full (red, solid) models, show the same frequency value, while the value of the Simple model (green, dashed) is dispersively shifted, as expected. In (b) and (c) we perform the same sweep for the chip temperature and T * 2 of qubit B respectively, evaluated on the full model for different learning data-sets. The ORBIT data is the same used in the full model in (a). Introducing the Quantum Process Tomography (QPT) data (purple, dotted), allows a more precise definition of the temperature. To determine T * 2 we use relaxation and dephasing data (orange, dot-dashed). Match values below 0 can occur because of noisy data, finite sampling and deviation from the assumption of Gaussian distribution of the data. More sensitivity plots are shown in the Supplemental Material [35].
For the purpose of learning we define the data-set D := {S k (α j ) → m j,k }, the collection of the experiments conducted during the C 2 calibration, consisting of pulse parameters α j and gate sequences S k (α j ), and the corresponding measurement outcomes m j,k .

D. Characterization, C3
In C 3 , we use the data-set D obtained during ORBIT calibration to improve the model of the system. For each measurement result m j,k we compute the equivalent simulation result m j,k (β) by calculating the dynamics of the sequence S k (α j ) given a set of model parameter values β = (ω i , δ i , ...). Since simulating the whole data-set is computationally costly, for the purpose of model learning we make a selection of eight pulse parameter sets j per qubit from the full data-set. Each parameter set includes k = 1, .., 25 sequences, meaning that we learn from a total of N = 400 measurement results, relabeled as m n . We then construct a goal function that captures how well the model prediction m n , with standard deviation σ n , agrees with the recorded values m n . Because of the finite number of measurements, the averaged m n are noisy estimates of the population, with a mean µ n and standard deviation σ n . Thus, they cannot be matched perfectly even when all model parameters are exact. However, we can determine the expectation value of the goal function f LL in the scenario where all m n are exactly a given number of standard deviations away from the underlying true value µ n . A detailed mathematical discussion is presented in Sec. IV C. To provide a more intuitive measure, we express the match f LL in terms of the number of standard deviations that would result in the same score.
To minimize f 3 (β), we use a combination of two algorithms: Gradient-free (CMA-ES) to avoid local minima and gradient-based (L-BFGS) to converge quickly once the right minimum has been identified. Fig. 3 shows the convergence of the C 3 optimization for different models. The simple model is not able to reproduce the device's results, as the optimization ends at approximately 8 standard deviations away. This demonstrates that the experiment on the device includes behavior not captured by the simple model. Spectator effects might be significant even when performing only single qubit experiments, making the completely uncoupled model insufficient. Another source of this inconsistency might be SPAM errors not accounted for in the model, that might play a large role in actual measurement results. The parameter values resulting from this C 3 process and all following ones are shown in Table I.
Going forward an informed decision has to be made about how to enhance the model. Since the true values of the parameters are not known in an experimental setting, we require a tool to determine the precision to which they are learned. We estimate the sensitivity to changes of model parameters around the optimal values β by performing one-dimensional scans and observing the degradation in model match score, f LL (D|β + δβ). Fig. 4(a) shows that sweeping the value of frequency of qubit B produces a highly irregular landscape of the match score f LL . Learning begins using just ORBIT data (left white section) that fixes qubit frequencies, anharmonicities, coupling and line transfer function to their true values. Then tomography data from a two qubit experiment is added (right gray section), which allows better identification the chip temperature T and the misclassification constants p0→1, and p1→0.
The simple model is then extended by adding the static coupling g of unknown exact value, resulting in the intermediate model. When repeating C 3 , we initialize model parameters from the initial, rough values. We do not carry over the learned parameters from the simple model to the intermediate model because, by introducing a coupling, we expect slightly shifted frequencies compared to the bare frequencies of the uncoupled qubits. Nonetheless, convergence of the match score shows no improvement from the simple model, still only reaching within approximately 8 standard deviations from experiment results (Fig. 3) and resulting in a similar sensitivity landscape in Fig. 4(a). This suggests that the simple model is a close dispersive approximation of the intermediate model. Indeed, we observe a dispersive shift [43] of 593 KHz, consistent with the expected g 2 /(ω B − ω A ) 666 KHz, given the coupling of g 20 MHz and the frequency difference ω B − ω A 600 MHz.
Finally, model complexity is increased by adding three relevant features: Markovian noise simulated by Lindblad master equation, initialization errors due to finite operating temperature and measurement errors in the form of misclassification. The system model is now of the same structure as the "real" model of the device. Starting from the best intermediate model parameters, the C 3 procedure converges satisfactorily, approaching the 0 standard deviations mark (Fig. 3) In Fig. 5(a) we show the value of each parameter of the full model during optimization, as we introduce different learning data (in the next sections), and compare with their true value (dashed lines). By learning the model parameters with the ORBIT data (white left section of each plot) the model frequencies ω A/B , anharmonicities δ A/B , coupling g and line transfer function ϕ 0 converge to their true value. The temperature and misclassification parameters are not recovered, and we believe this is due to an extra degree of freedom that is not resolved by the experiments we have performed, as the effects of misclassification, Eq. (4), and initial thermal distribution, Eq. (3), are similar and can be partially exchanged. Dephasing and relaxation times (not shown) are also not recovered. Indeed, in Fig. 4(c) we show that the sensitivity of the data to dephasing time T * 2 of qubit B is minimal. RB sequences perform an effective random dynamical decoupling [44], providing a possible explanation to this result. The data-set is the so-called validation set: The data points k that were not used in optimization of the model parameters. The simple and intermediate models show poor correlation as the simulation predicts a wide distribution of measurement outcomes for each recorded value. They also exhibit a tilt that can be attributed to SPAM errors not considered in the models. Only the full model produces a consistently high density distribution centered around the diagonal, with minimal spread due to the noisy data.

E. Validation of the learned model
After model matching on a subset of the data in the C 3 step, we now evaluate the predictive power of the learned models by computing the score on the rest of the data set (this is also known as a validation set in machine learning). This verifies that the selected subset captures all relevant behavior and alleviates the danger of overfitting.
Figures 6(a) to 6(c) depict the correlation between calibration data points m j,k and their model-based reconstructions m j,k . We evaluate the goal function f LL (β) over the validation set for the Simple, Intermediate and Full models and obtain values of 36.5 (≈ 8.4σ), 42.0 (≈ 9.2σ) and 0.028 (≈ 0.2σ) respectively. The conclusion is that, even though some parameters were not recovered by C 3 , the learned full model is indeed a Good Model, as it reproduces the behavior of the system on all previously recorded data points to satisfying accuracy. This does not prevent additional measurement data to expose new behavior in the system: The notion of the Good Model is always tied to the underlying data-set.
Furthermore, we now repeat the C 1 procedure on the Good Model (yielding average gate infidelities of 6.3 × 10 −4 and 1.1 × 10 −3 for qubit A and B respectively) and show that the resulting pulses give a near optimal performance on the actual device and allow for faster C 2 convergence, as seen in Fig. 2. One would expect the pulses derived from the Good Model to be exactly optimal on the actual device. Even though it is not the case here, this is not because of an inaccurate model, but rather because of a disparity between the figures of merit used in C 1 (average infidelity) and C 2 (single-qubit ORBIT). Average fidelity captures effects of the whole system, including, in this case, an effective ZZ-coupling between the two qubits caused by a slight repulsion of the |02 and |11 states, that are 300 MHz apart. Minimizing a single-qubit ORBIT infidelity does not adjust for this effect, as we can verify by evaluating both RB (which captures only one qubit at a time) and average infidelity before and after calibration. Indeed, the average infidelity of qubit B is 1.2 × 10 −3 (compatible with the performance of 1.1 × 10 −3 on the Good Model) but the error per gate is estimated by RB as 4.1×10 −4 . After the calibration the RB estimate is improved to 2.9×10 −4 but the average infidelity is worsened to 1.9 × 10 −3 . Performing simultaneous RB could resolve this issue.

F. Entangling gate
We further investigate the Good Model which was determined using only single-qubit calibration data by deriving a two-qubit cross-resonance (CR) gate [45,46] with C 1 . Both qubits are driven simultaneously at ω B , the resonant frequency of qubit B, to accumulate a phase ±π/2 conditioned on the state of qubit A [20]. Both drives are parameterized by flattop Gaussians. The resulting CR pulse has a gate infidelity of f av = 3.8 × 10 −3 . When evaluated on the "real" model the gate has an infidelity of f av = 4.3×10 −3 , again showing that the learned model predicts device behavior to high accuracy. Notably, the model learned using only single-qubit data was sufficient to accurately predict the performance of the two-qubit gate on the device. We suspect this to be caused by exchange interactions due to coupling and finite temperature: Even when performing only single-qubit gates, the finite temperature causes a partial excitation of higher states, which are then exchanged with the other qubit via the coupling and thus visible in the ORBIT data.
The performance of the gate on the device is verified with Quantum Process Tomography (QPT): We apply the CR gate preceded and followed by single-qubit gates to prepare and measure in the basis states, e.g. S = X π/2 ⊗ Y π/2 • CR • X −π/2 ⊗ Y π/2 [47], and again collect these measurements into our data-set. We believe that the entangling gate lifts the degree of freedom between misclassification and initial thermal distribution discussed before, hence we now perform another C 3 optimization, using the QPT data (256 sequences) and one ORBIT parameter per qubit (2 × 25 sequences) as the learning data. Parameter convergence is shown in the gray areas of Fig. 5(a), where temperature and confusion matrix values are adjusted closer to the true values. Fig. 4(b) substantiates the claim that the entangling gate data allows for a more precise learning of the chip temperature, exhibiting a narrower valley at the true value. However, we are still not able to learn the T 1 and T * 2 parameters, since the sequences in QPT are too short to be sensitive.

G. Relaxation and dephasing
To demonstrate how a specialized measurement is formulated within C 3 we determine the values of T 1 and T * 2 , using simple established sequences that are known to be sensitive to these parameters. The decay lifetime T 1 is determined by preparing the excited state of the qubit, followed by increasing wait times and then measuring the ground state population. We write the sequence as S (n) where X π/2 is our previously optimized π/2 gate and I n signifies n repetitions of the identity gate I. Similarly
Using this data-set we perform another C 3 optimization, freezing all model parameters learned until now and varying only the values of T 1 and T * 2 . By doing so we manage to determine their values to within 1µs of the true values (Fig. 7). This procedure is the C 3 equivalent of a common exponential decay fit to the data. However, with C 3 one does not require prior knowledge on the expected structure of the experimental results, i.e. an exponential decay. Hence, when matching the data C 3 also accounts for SPAM errors without the need to adjust the fitting function. Fig. 4(c) shows the sensitivity of f LL to the value of T * 2 of qubit B. The new data shows a clear improvement in the accuracy of the value obtained and the minimum is better defined. For increased sensitivity one would require more decay data to learn from.

H. Sources of error
The Good Model allows us to break down which of the model properties are preventing higher gate fidelities. To this end, we investigate the Good Model for components limiting the performance of the CR gate by idealizing aspects of the model. We investigate whether the Gaussian ansatz is limiting gate fidelities by further refining the optimal pulses using a piece-wise constant optimization with one pixel per AWG sample (as is done in [32]). Average infidelity improves only marginally from f DRAG av = 3.8 × 10 −3 to f PWC av = 3.6 × 10 −3 , suggesting other factors are limiting fidelities.
To find out if performance is limited by decoherence effects, we re-optimize the CR gate while disabling Lindbladian dynamics. By open-loop optimization in this idealized coherent setting the error is decreased from 3.8 × 10 −3 to 1.3 × 10 −5 . Thus, the 100 ns CR gate considered here is coherence limited, as is the case in most experimental implementations [20,48], making improvements in gate time essential [49].

IV. C 3 IN-DEPTH
Following is a detailed description of the C 3 tool-set, its modeling capabilities and a general formulation of the optimization problems discussed in the previous section.

A. Experiment modeling
To combine control and characterization, C 3 provides a detailed simulation that endeavors to encompass all relevant practical considerations of the experiment such as signal processing, SPAM errors, control transfer functions and Markovian noise. The simulator is used as the basis of the open-loop optimal control optimization (C 1 ) and the model parameter optimization (C 3 ). In both cases it is desirable to use gradient-based optimization algorithms [10,50]. However, it is extremely cumbersome to manually derive the full analytical gradients of the quantum dynamics, especially when it includes the properties described above. Instead, C 3 uses a numerics framework [51] which allows for automatic differentiation [52], making the tool-set more flexible and easily extendable. A similar approach is also used by [28,53] for control and characterization.

Signal processing
The simulation allows for the specification of control signals ε(t) as either analytical functions or as direct, piecewise constant AWG parameterization. Analytic parametrizations are sampled at the resolution of the waveform generator producing the envelope signal ε i = ε(t i ), representing voltages being applied to the control line, where the {t i } are the AWG sample times. The resulting signal will exhibit a rise time τ , due to the finite bandwidth of the control electronics. We model this by applying a convolution with modeling a Gaussian filter, and interpolating the sampled signal to higher resolution for simulation. An I/Q-Mixer combines this envelope with a local oscillator signal of frequency ω lo to where the in-phase and quadrature components are assigned by a control parameter φ xy , and modulated to introduce a frequency offset ω off on the drive. As noted in [54], in practice there will be additional errors during the mixing, which are not currently modeled. In transmitting this signal to the experiment, various distortions can occur, modeled by a response function ϕ, which also converts the field from line voltage to an amplitude c(t) = ϕ[u(t)].

Time evolution
The system Hamiltonian is with a drift H 0 and optional control Hamiltonians H k . The dynamics of the system are described by the timeordered propagator given by solving the time-dependent Schrödinger equation, and approximated numerically by U (t) 0 i=N U i . Here, U i = exp − i H(t i )∆t , and the total time is divided into N slices of length ∆t that are fine enough so that the Hamiltonian can be considered constant in the interval.
In application, we will rarely perform a single gate or pulse in isolation. Experiments such as randomized benchmarking or the various flavors of tomography involve long pulse sequences, that are inefficient to simulate as a whole. Instead, the C 3 simulator computes each propagator G of a defined gate-set G individually and compiles these matrix representations into sequences. This avoids the need to solve the equations of motions multiple times for the same exact pulses. As the propagators are calculated in the dressed laboratory frame (as opposed to the single-particle rotating frame), consecutive gates need to be adjusted to realign with the rotating frame of the drive signal, by applying a Z rotation with an angle of (ω lo + ω off )t g [54].
To include open-system effects, we apply the equivalent procedure to obtain the process matrix by solving the master equation in Lindblad form, where H is the Hamiltonian from Eq. (19), the L j s are Lindblad operators, and L is the generator in superoperator form [55]. The evolution of a state is obtained by applying the propagator as ρ f = U (t g )ρ i U † (t g ) for coherent evolution or ρ f = E(t g )[ρ i ] for incoherent evolution.

Initialization and readout
Given the temperature T of the device, the system is initialized in a mixed state where {|φ k } is the eigenbasis of H 0 and the normalization is given by the canonical partition function Z = We simulate readout by post-processing the final state ρ f : From the density matrix, represented in the dressed basis, we obtain a vector of populations p = (p k ) by taking the absolute square of the diagonal. This is consistent with a slow (or weak) readout scheme in experiment. Measurement and classification errors are modeled with a misclassification (confusion) matrix (p i→j ) ij [56] such that the measured populations are To simulate an experimental measurement with an average of l repetitions, we draw from a multinomial distribution of l trails and with probabilities p j .

B. Optimizations
For open and closed-loop optimal control as well as model learning, performing optimization processes is required.

Open-loop Model-based Control: C1
In the typical setting of open-loop optimal control [8,9], given a model of a system, we search for the optimal control pulses to drive the system to a desired state or generate a certain gate. Pulses are parameterized by an analytic ansatz (e.g. Gaussian pulse with DRAG correction [14] to remove Fourier components coupling to leakage levels), or by direct AWG samples. Constraints may be imposed to conform with experimental feasibility, such as power and bandwidth limitations. The goal function to be minimized is selected depending on the specified optimal control task, e.g. state infidelity for state transfer problems, or unitary trace infidelity for quantum gates [9,34]. We suggest the use of average gate infidelity as the goal function, as it is experimentally accessible by performing RB or QPT, allowing comparison of performance in simulation and experiment.
Formally, the controls are parameterized as a vector of real numbers α. Given a goal function f 1 (α), we search for min α f 1 (α). Optimal control methods such as GRAPE [11], Krotov [13,[57][58][59], and GOAT [10] have been devised to determine the gradient ∂ α f 1 (α) in order to facilitate convergence. These methods require a specific formulation of the problem and the analytical calculation of the gradient any additional elements in the model, whereas in C 3 , automatic differentiation allows to systematically account for any model feature, including, for example, line response functions or SPAM error.
The disadvantage of gradient-based algorithms is their propensity to get trapped in local minima. The severity of the problem is reduced by using a hierarchy of progressively more complex control ansätze. If this is insufficient, a short preliminary gradient-free search to find the convergence basin most often resolves the problem.

Closed-loop Model-free Calibration: C2
In calibration, a given pulse is optimized to improve a figure of merit f 2 (α), computed from experimental measurement results. In addition to gradient-free optimization algorithms, there are methods to approximate the gradients (e.g. [60]), however, such approaches are generally less efficient than gradient-free algorithms [10,61] as they require a high number of evaluations [62]. If the initial point of the optimization is given by C 1 , this implements the already established Ad-HOC [18] method. During calibration, sets of control parameters α j are sent to the experimental setup, alongside instructions of how to evaluate the current controls. For evaluating gate-sets, we suggest the ORBIT figure of merit, as it naturally performs a twirling of all sources of error, providing a single number to optimize. However, protocols tailored to specific needs can also be used, e.g. to obtain a desired conditional phase [5]. C 2 then optimizes the control parameters α j to minimize a figure of merit.
While specialized measurements provide a straightforward way to fine-tune controls related to specific device properties, they do not generally account for interdependency. For more complex setups with many parameters, such calibrations cannot be done without extraordinary effort [63]. In contrast, C 3 employs modern gradient-free optimization algorithms, such as CMA-ES (see App. B for further discussion), capable of optimizing dozens of parameters simultaneously, automating the task.

Model Learning: C3
Extracting the model from a data-set D can be thought of formally as analogous to the C 1 optimization task, where one varies the model parameters instead of the control parameters. For each measurement outcome m k in the data-set, the corresponding gate or pulse sequences S k (α j ) with control parameters α j are used to simulate the model's prediction m j,k = m(S k (α j ), β). The model learning goal function quantifies the match between the data-set and the simulation of a system with parameters β. In this paper, we use a rescaled log-likelihood where the σ k is the standard deviation of a binomial distribution with mean m k , resulting in a variation of the Mahalanobis distance [64]. This function is strictly correct under the Gaussian assumption and a two-level readout. See Sec. IV C for the extension for a multiple outcome readout. The measurement process on any physical device is noisy, i.e. each m k is an estimate of a true underlying µ k . Therefore, a realistic data-set D cannot be matched exactly by a deterministic simulation. The function f LL is designed such that, for n data points, its expectation value is 0 when the model predicts the means µ k correctly, and 1 2 n 2 if the distance is µ k − m k = nσ k for all ks, according to Eq. (33). This provides a more intuitive measure of model match than the abstract value of f LL , i.e. it allows to make a statement like "the model differs from the experiment by approximately n standard deviations".
Due to the complexity of the physical systems, a potentially high number of interdependent parameters and complex features of the landscape, it is difficult for the optimization to converge to the global optimum. Therefore, we take the tried-and-tested experimental approach of starting with a simple model and iteratively refining it. We modify the model and repeat the C 3 fit, optionally retaining the optimized parameters which are shared by the previous and new model. Alternatively, we collect additional data and repeat the optimization on the same model. We emphasize that at each of these steps the physicists' insights are required to evaluate the optimization's results, extend or discard models and decide whether collecting additional data is required. Furthermore, employing a gradient-based algorithm can, depending on the initial point, result in a local minimum. The optimizations presented here were successful when starting with a gradient-free CMA-ES search, known to be robust against local minima, switching over to the faster converging gradient-based L-BFGS method when a promising parameter region is identified. However, further research is required to find the best optimization strategy.
In contrast, we note that in C 3 a model takes explicit values for all its parameters, and is not represented as a high-dimensional distribution over model parameter space. This choice is driven by classical computationload considerations: Because the C 3 model is highly de-tailed, and, as consequence, associated simulations are non-trivial, we believe a full Bayesian approach to any of C 3 optimizations is not computationally viable at this time.

C. C3 Model Fitting Goal Function
When performing a series of experiments, k ∈ [1, . . . , K], on a quantum device, each experiment k is repeated a number of times and the normalized occurrences of the measurement outcomes are store in a result vector m k . These are collected in {m k } k=1,...,K , or {m k } for shorthand. Given a model and its parameters β, we aim to quantify how likely it is that it underlies the observed data with a function f 3 ({m k }|β). Hence, we need to determine the distance between the experimental result, m k , and the model prediction, m k (β). We define p k (m k |β) be the model-predicted probability distribution function (PDF) for the result of experiment k. As the m k are sampled from the readout distribution, we do not expect m k = m k (β). Rather, we aim to define the function f such that its expectation value, E[f ({m k } , β)], is zero if the underlying distributions from which the {m k } are drawn are the same as the model-predicted PDFs.

The Gaussian Assumption
To simplify calculation of E[f ({m k } , β)], we can make some assumptions regarding the underlying distributions. The natural p k (m k |β) PDF is multinomial, determined by the dimension of the qubit Hilbert space d k (or binomial if dealing with a single qubit with no leakage levels). Under the assumption that for a large number of shots all possible readouts values are likely to appear, then by the central limit theorem (De Moivre-Laplace theorem), we can approximate p k with a multivariate normal distribution. Although, the multinomial distribution has a non-diagonal covariance matrix, one can diagonalize the distribution and decompose it as a product of onedimensional Gaussian distributions. Thus, we write the PDF as a sum of k (d k − 1) such distributions, redefine K to equal the previous k (d k − 1) and the {m k } to be their means.

The model match distribution
We shall use { µ k (β)} and { σ k (β)} to denote the mean and standard deviation of the model-predicted PDFs (after Gaussian assumption and multinomial diagonalization), and {µ k } and {σ k } to denote the commensurate experimental values. We note that {µ k } and {σ k } are unknown and unmeasured, and {m k } only provides an estimate of the mean. The simulation values { m k } on the other hand are deterministic and thus represent an exact estimate of the mean, hence { m k ≡ µ k }.
The model-predicted PDF is given by the product of normalized Gaussian distributions, and gives the likelihood of the {m k } given the model parameters β as are the individual Gaussian distributions. We then construct the goal function as the average log-likelihood, rescaled to give the desired expectation value, Here the K √ · gives the average of the log-likehoods, the √ 2π σ k removes the normalization of the Gaussians, such that they take value 1 when m k − µ k = 0, and the loglikelihood is zero, and the √ e removes the residual part of the expectation caused by the noise in the {m k }.
Then, in the general case, when the Gaussians determined by the model are not the same as the Gaussians in the experimental data: In the limit that both distributions have the same stan- Var f σ← σ . Such values indicate the standard deviation expected by the model, σ k is larger than the standard deviation observed experimentally, σ k .

D. Model analysis
Both during and after the learning process, it is beneficial to interrogate the model to estimate its properties and their impact on the system behavior. As part of the C 3 tool-set we perform sensitivity analysis for system parameters: Sweeping a single parameter, e.g. qubit frequency, across the range of interest, while keeping other parameters at their current best value, evaluating the model match score at each point, as seen in the example ( Fig.  4(a)). The result is a 1-D cut through the optimization landscape that may exhibit a well-defined minimum, multiple extrema indicating a difficult optimization, or even appear flat in the case when a parameter does not affect the behavior of the current experiment. This landscape depends on both the selected model and data it is compared to. Depending on the ruggedness of the sensitivity, one might choose to utilize a gradient-based algorithm from the start or to first perform a gradient-free exploratory search to avoid local minima. In the case of a flat sensitivity, there are two courses of action: If the parameter is of little interest for successive experiments, it may be removed or set to a convenient value within the flat range; otherwise, one needs to design an experiment producing additional data that is sensitive to the parameter. The physicists' knowledge of common experimental practices (e.g. Rabi, Ramsey, Hahn echo sequences) and intuition guides the decision for the experiment design. When suspecting correlations between parameters, cuts in single dimensions are not enough and higher dimensional sweeps are necessary. After a successful learning process, the sensitivity analysis gives an estimate of the precision to which each parameter has been determined.
Furthermore, the simulation allows insight into the behavior of the system. Using well established methods such as time-resolved state and process tomography, it is possible to identify coherent errors, such as leakage out of the computational subspace, over-rotations, and the effects of noise. A Good Model also provides the basis for an error budget, as it contains the same limitations as the experiment it accurately predicts. The model can be used for extrapolation by idealizing certain aspects suspected as causes of infidelity (e.g. T 1 setting to infinity), and re-deriving control pulses using a C 1 optimization. The respective gain in fidelity gives an estimate of the error that this aspect is responsible for, suggesting areas of improvement for future devices.

V. DISCUSSION AND OUTLOOK
In conclusion, we have described C 3 , an integrated methodology to improve quantum device performance that combines characterization, calibration and control. We have detailed its approach and implementation, demonstrating, on a synthetic QPU device, the individual methods and how they are synthesized into a more integrated concept. Analyzing single-qubit calibration data we successfully extracted an accurate model of the device, including realistic experimental considerations: line transfer functions, limitations of control electronics, readout error and finite operating temperature. From this model we were able to derive a working high-fidelity twoqubit gate, without requiring any further calibration.
This approach represents a holistic theoretical take on the experimental workflow of a complex quantum computing experiment, that takes into account interactions between different tasks of an experimental lab. C 3 provides a path to achieve, starting from an incomplete understanding of the system, both high-fidelity pulses and an accurate model. It integrates the tasks of open-loop control (that would require an already accurate model) and of calibration (that would require an experimentspecific fine-tuning procedure). Most notably, it provides the tools to reflect on the experiment outcome and gate performance, improving the model description of the system and providing insight into its behavior. C 3 is not a "black-box" experiment controller that replaces physicists or engineers -rather, it reduces tedious tasks allowing for interaction with the quantum device on a more structural level. Instead of simply producing high-fidelity operations, C 3 provides meaningful output in the form of a Good Model of the system, and other insights such as an error budget and a sensitivity analysis. In this sense, C 3 is not to be confused with any single optimal control or benchmarking technique, as it includes results from decades of research in these fields aimed at making controls that allow to actually reach high fidelities efficiently [8,50], unifying them into one framework.
We expect that the application of C 3 will first benefit scalable implementations of quantum processors based on manufactured solid-state systems, such as superconducting and semiconducting qubits. There, the dependence of model parameters on fabrication means that many elements of the model are in fact uncertain. On the other hand, other scalable implementations of quantum computers contain such elements directly in their quantum description: Ion trap gates involve degrees of freedom of the trap, impurity spins depend on their detailed position etc. -thus we expect that C 3 will also play a key role in those types of systems.
In the near future, we intend on extending the initial version of the C 3 tool-set to the generation of robust controls, automatic experiment design, multi-parameter sensitivity analysis, active model learning, and more. The simulator will be enhanced to include non-Markovian noise, a detailed simulation of the readout process, echoes on control lines and all other phenomena needed to produce a Good Model of real-world systems. C 3 is currently being integrated with existing quantum computing software stacks, which would allow users to study custom pulse schedules [47,94] and perform model learning based on data gathered from quantum computers, for example, with Qiskit Ignis [95]. Experimental application of C 3 is ongoing (e.g. [32]).
It is our hope that C 3 will not only provide insights into, and assist optimization of, current experiments, but also help guide the design of next-generation quantum devices, be it manually or by integration into automatic hardware design frameworks [76,77,96].
The task of characterization of quantum devices in general has received extensive attention. It would be presumptuous of us to even attempt a complete survey, therefore, we shall limit ourselves to a very limited look at a subset of model-specific methods we have subjectively found informative to our own work.
The standard approach at addressing the lack of a Good Model, as defined above, is to perform a long list of model-specific characterization experiments, each designed to measure a different parameter of the model: measure parameters of the readout resonator using frequency sweeps; qubit frequency measurements and relaxation time T 1 require Rabi experiments [97] (and with some extra effort the higher levels can be extracted); Ramsey [98] and Hahn echo measurements [99] provide dephasing data (under the Markovian assumption, which is known to be an over-simplification [100,101]); measuring the control line response functions [102][103][104][105][106], the noise spectra [28,101,107,108], continuous drifts in system parameters [109][110][111][112], and discontinuous jumps in parameters such as T 1 [113,114]; state Preparation and Measurement Errors (SPAM) can be extracted from Randomized Benchmarking (RB; e.g. [37]) or dedicated procedures, such as [115]; qubit cross-talk can be measured by the method described in [116,117] and many more. Model specific methods also exist for learning spin chain, lattice Hamiltonians and other multi-particle systems with a predefined network topology under limited access [118][119][120][121][122].
Appendix B: Choice of Gradient-Free Algorithms Naturally, using an automated approach to solve the gradient-free optimization problem of calibration seems promising, and many experiments are currently using the Nelder-Mead algorithm [123,124] for their calibration tasks. As an example, Fig. 9 shows an underlying parameter landscape of an optimization, obtained by simulating DRAG corrected pulses. Calibration can be rather difficult, as depending on the starting position the optimizer, algorithms will have to overcome local minima and deal with intrinsic noise. It is further noteworthy that this landscape is rather unique to the used parametrization, the chosen goal function and ultimately the properties of the physical system.
Research of gradient-free optimizers is a vast and active field, with hundreds of published black-box optimizers, including Evolutionary Strategies (ES), Particle Swarm Optimization (PSO), Differential Evolution (DE), Random Search, Simultaneous Pertubation Stochastics Approximation (SPSA), Nelder-Mead Method, Bayesian Optimization, and more. From a preliminary investigation, we recommend evolutionary strategies such as CMA-ES, which is currently the default in C 2 . This algorithm performed well in most cases, exhibited good robustness to noise, can handle local extrema and requires relatively few evaluations. Similar conclusions have been reached independently by [125]. We make no claim as to the optimality of the optimizer chosen, and defer a more detailed discussion of the subject to future publications.
Appendix C: Open-source implementation C 3 is implemented as an open source project available at https://q-optimize.org under the Apache 2.0 license. The software is written in Python to interface conveniently with common experiment controllers, and has already been used in tandem with PycQED [126], Labber [127] and LabView [128]. The interface can occur at various levels of abstraction, from sharing control parameters to sampled waveform values. A modular design allows for Hamiltonian or Lindbladian descriptions of common physical systems (fixed and flux-tunable qubits, res-

FIG. 9.
A 2D cut through the landscape of a calibration goal function. The system chosen is a multi-level qubit, with the control sequence being a single length RB-sequence. Control parameters αj were the Gaussian pulse amplitude, DRAG coefficient and frequency offset of a DRAG-corrected pulse. The RB-sequence was chosen to implement the identity operation, ideally leaving the qubit in the ground-state. The infidelity (seen as the color ranging from 0 to 1) is defined as the overlap of the final state with the system's ground state. (a) The landscape of the simulated system, as a cut through the plane of pulse amplitude A and DRAG coefficient η, for a fixed RB-sequence. Multiple local minima can be observed. (b) A higher resolution plot of the same landscape. Further local minima can be observed in the neighborhood of the global minimum.
onators, different types of coupling), specification of a list of devices to model the signal chain of the experiment (local oscillator, AWG, mixers, distortions and attenuations), different types of readout processing, and various fidelity functions. All components can be edited by the user or taken from reference libraries, accommodating to different needs. Configurations and data are stored as JSON files, and the full capabilities are accessible as command-line scripts, allowing for easy automation.
Numeric calculations are performed using TensorFlow [51]: The simulation of the dynamics and the pre and post processing are formulated as a network, with welldefined inputs (e.g. control and model parameters) and outputs (goal function values), connected by many nodes, each performing a relatively simple operation (e.g. matrix exponentiation). TensorFlow enables the numerical computation of the Jacobian of a calculation -the gradient of each of the network outputs with respect to the network inputs (this capability is the evolution of what is known as back-propagation learning process in neural networks [129]). This process of automatic differentiation facilitates the modular structure, as any new component inherits this property, removing the need to analytically derive its gradient. Furthermore, the TensorFlow simulator is scalable, allowing deployment on a variety of high-performance computing hardware.
Each component of the control stack and model needs to conform to a general boilerplate that specifies what parameters it contains and how they are used. In this modular design, each class represents a component of the experiment that takes an input applies some parameter-dependent function to it and returns a result. For example, an envelope function for pulses would have this structure: The only requirement to this code is that mathematical functions have to be taken from the TensorFlow package to allow for automatic differentiation. As an example of a control stack element, the finite rise time of an AWG is realized with the following code: A signal processing chain is represented by putting the output of one control stack element into the next. In calculating figures of merit, the user can choose from a library of functions or supply their own. For example, the infidelity of a state transfer process from |ψ 0 to |ψ ideal , implemented by the simulated propagator U as follows: def s t a t e _ t r a n s f e r _ i n f i d (U , psi_ideal , psi_0 ) : psi_actual = tf . matmul (U , psi_0 ) overlap = tf_abs ( tf . matmul ( tf . linalg . adjoint ( psi_ideal ) , psi_actual ) ) infid = 1 -overlap return infid