Experimental Deep Reinforcement Learning for Error-Robust Gateset Design on a Superconducting Quantum Computer

Quantum computers promise tremendous impact across applications -- and have shown great strides in hardware engineering -- but remain notoriously error prone. Careful design of low-level controls has been shown to compensate for the processes which induce hardware errors, leveraging techniques from optimal and robust control. However, these techniques rely heavily on the availability of highly accurate and detailed physical models which generally only achieve sufficient representative fidelity for the most simple operations and generic noise modes. In this work, we use deep reinforcement learning to design a universal set of error-robust quantum logic gates on a superconducting quantum computer, without requiring knowledge of a specific Hamiltonian model of the system, its controls, or its underlying error processes. We experimentally demonstrate that a fully autonomous deep reinforcement learning agent can design single qubit gates up to $3\times$ faster than default DRAG operations without additional leakage error, and exhibiting robustness against calibration drifts over weeks. We then show that $ZX(-\pi/2)$ operations implemented using the cross-resonance interaction can outperform hardware default gates by over $2\times$ and equivalently exhibit superior calibration-free performance up to 25 days post optimization using various metrics. We benchmark the performance of deep reinforcement learning derived gates against other black box optimization techniques, showing that deep reinforcement learning can achieve comparable or marginally superior performance, even with limited hardware access.


I. INTRODUCTION
Large-scale fault-tolerant quantum computers are likely to enable new solutions for problems known to be hard for classical computers. The quantum information community has recently made great strides towards realizing such systems; a significant step towards quantum advantage (when a quantum computer can solve a practically relevant problem "faster" than a classical computer) was demonstrated by Google [1] and the Chinese Academy of Sciences [2]; and cloud-based quantum computing offerings are now commercially available [3][4][5][6]. However, demonstrating a computational advantage on a problem of practical importance remains an outstanding challenge for the sector.
The extreme sensitivity of today's hardware to noise, fabrication variability, and imperfect quantum logic gates are the key factors limiting the community's ability to reliably perform quantum computations at scale [7]. Fortunately, it has been repeatedly demonstrated that the use of robust and optimal control techniques for gateset design may lead to dramatic improvements in hardware performance and downstream computational capabilities, as a complement to both ongoing hardware improvements and future application of quantum error correction . The design process is straightforward in cases where Hamiltonian representations of both the physical and the noise models in the underlying system are precisely known, but has proven considerably more * Also ARC Centre for Engineered Quantum Systems, The University of Sydney, NSW Australia difficult in state-of-the-art large-scale experimental systems. A combination of effects introduces challenges not faced in simpler systems including: unknown and transient Hamiltonian terms [32] with nonlinear dependence on applied control signals [33]; control signal distortion in transmission [34]; undesired crosstalk [35] and frequency collisions between neighboring qubits; and time varying environmental noise. In all cases, complete characterization of Hamiltonian terms [35], their functional dependencies, and their temporal dynamics becomes unwieldy as the system size grows.
In this manuscript, we overcome this fundamental challenge by demonstrating the experimental use of Deep Reinforcement Learning (DRL) for the efficient design of an error-robust universal quantum-logic gateset on a superconducting quantum computer without any a priori hardware model. In our approach, we design an agent which iterative and autonomously constructs a model of the relevant effects of a set of available controls on quantum computer hardware, incorporating both targeted responses and undesired effects such as cross-couplings, nonlinearities, and signal distortions. We task the agent with learning how to execute high fidelity constituent operations which can be used to construct a universal gateset -an R X (π/2) single-qubit driven rotation and a ZX(−π/2) multiqubit entangling operation -by allowing it to explore a space of piecewise constant (PWC) operations executed on a superconducting quantum computer programmed using Qiskit Pulse [36], and with constrained access to measurement data in stark contrast to previous theoretical studies [37]. First, we demonstrate that single-qubit gates can be designed which outperform the default DRAG gates in randomized benchmarking arXiv:2105.01079v1 [quant-ph] 3 May 2021 with up to a 3× reduction in gate duration, down to ≈ 12.4 ns, and without evident increases in leakage error. Due to the training process, these gates are shown to exhibit robustness against common system drifts (characterized in [13]), providing up to two weeks of outperformance with no need for intermediate re-calibration of the gate parameters; the daily performance variation of these gates is consistent with limits imposed by measured fluctuations in device T 1 . Next, we show that the DRL agent is able to create novel implementations of the crossresonance interaction which show up to ≈ 2.38× higher fidelity than calibrated hardware defaults. We characterize gate performance using both interleaved randomized benchmarking and gate repetition to better reveal coherent errors, achieving F ZX > 99.5% across multiple hardware systems, and maintaining F ZX > 99.3% over a period of up to 25 days with no additional recalibration. Finally, we demonstrate the use of these DRL defined entangling gates within quantum circuits for the SWAP operation and show 1.45× lower error than calibrated hardware defaults. Across these demonstrations we benchmark the DRL designed two-qubit gates against gates created using a black box automated optimization routine leveraging a custom implementation of Simulated Annealing (SA) and observe comparable performance between the two methods, suggesting incoherent processes as the current bottleneck.

II. OPTIMIZED QUANTUM LOGIC DESIGN WITH DEEP REINFORCEMENT LEARNING
We begin by defining the problem of quantum logic gate optimization and its measures of success. Consider a conventional Hamiltonian description for coupled transmons written using a control Hamiltonian, H ctrl (t) {Ω ω I,j (t), Ω ω Q,j (t)} . This Hamiltonian possesses some functional dependence on applied time varying microwave control signals Ω ω I/Q,j targeting qubits labeled by index j and written in a conventional I/Q decomposition for a drive at frequency ω. All other terms that are not in our control Hamiltonian are contained in an additional term H system (t), including the bare qubit Hamiltonian, deterministic "drift" terms and stochastic noise terms.
The aim of the control problem is to find the optimal set of functions {Ω ω I,j (t), Ω ω Q,j (t)}, such that the resultant unitary evolution, U = T exp − i T 0 dt H tot (t ) , matches a target unitary U target . In the above, T is the total evolution time and T is the time-ordering operator which defines the calculation procedure. Typically, numerical optimization [13,[38][39][40] techniques may be employed in appropriate reference frames in order to construct the controls via minimization of the gate infidelity for a D-dimensional Hilbert space. Performing useful gate optimization in this way -especially when trying to exceed state-of-the-art experimental fidelities in realistic systems -requires a comprehensive understanding of the relevant terms in the system Hamiltonian.
In practice, the real Hamiltonian of the system typically may include nonlinear distortions of the control fields (F I/Q (·)), nonlinear coupling of the distorted control fields into new Hamiltonian terms (H nl ctrl ), and additional hidden terms in the Hamiltonian that may change with application of the control fields: It is not generally tractable to identify this diverse range of Hamiltonian terms in large interacting systems, and even small simplifying approximations will typically lead to catastrophic failure of numerically optimized controls due to the sensitive manner in which the system is steered through its Hilbert space.
An alternative approach to quantum logic design based on iterative interactive learning obviates the need of having an accurate representative model of the physical system -in particular the various new terms appearing iñ H tot . Deep reinforcement learning techniques stand out in their ability to deal with high-dimensional optimization problems and in the absence of labeled data or an underlying noise model [41,42]. In DRL, an agent interacts with its environment by taking actions and receiving feedback in the form of state observables and rewards. By learning to maximize these rewards, the agent learns a targeted behavior such as coordinated robotic motion [43], autonomous driving [44], or in the present case, how to perform high fidelity quantum logic operations. The uniqueness of DRL comes from the fact that, unlike in other closed-loop optimization methods, intermediate information is extracted as well as a final measure of reward in order to construct and refine an internal model of the system's most relevant dynamics (this model need not be interpretable under human examination).
The basic ingredients of DRL are states, actions and rewards (Fig. 1a). The state is snapshot of the environment at a given time, and in the context of gate optimization, can be any suitable representation of the quantum state of the quantum device. The action is the means by which we affect or stir the state of the environment. For example, an action can be a low level electromagnetic control pulse or any other operation that alters the state of the environment. Finally, the reward encapsulates the feedback from the environment. After an action is executed in the environment, the state of the environment is observed again, and a reward is given in order to quantify the quality of the last action. The essence of DRL is the ability to estimate and maximize the long term cumulative reward learned when a deep neural network based agent interacts with an environment through many trials. At each DRL step the action, e.g., amplitude and phase of the next segment(s) is chosen from a set of allowed actions while the previous segments actions are held fixed. (c) A set of actions that form a full control pulse is an episode. (d) After every step, an estimation of the quantum state is performed using simplified tomography, and at the conclusion of each episode an estimate of fidelity is made. Each constituent measurement contributing to this process is repeated 1024 (256) times for final (intermediate) states. In order to reduce the effect of the readout errors, we employ a standard measurement error mitigation scheme [45]. (e) In order to estimate the gate quality, a reward is constructed by performing a weighted average of measured fidelities for sequences employing different numbers of control pulse repetitions.(f) Example of DRL optimization convergence for a control pulse implementing a single-qubit gate.
DRL has recently been deployed for a variety of quantum control problems using numerical simulation [46], including both theoretical gate optimization [37,47,48] and other tasks [49][50][51][52][53][54][55][56][57]. In these simulation based studies it was possible to make at least one of the following strong assumptions: the system suffers zero noise or has a deterministic error model; controls are perfect and instantaneous; and quantum states are completely observable across time. Moving beyond these studies to efficient experimental implementations on real quantum computers, we focus on designing DRL algorithms compatible with realistic execution and measurement constraints and the complexity ofH tot .
This involves overcoming three main challenges that we discuss below: (i) creating an efficient representation of the effective Hamiltonian in a manner compatible with experimental implementation, (ii) creating a suitable measurement routine compatible with the limited observability of the system state and its unitary evolution in a quantum computer, (iii) designing appropriate agents which can efficiently learn system dynamics based on these constrained controls and limited access to measurements. An overview of the complete DRL learning process we employ to perform experimental gateset design using DR is featured in Fig. 1, and complemented by pseudocode in Alg. 1.
First, we seek an effective PWC control Hamiltonian, which physically corresponds to control pulses with a PWC envelope. The N seg -segment control Hamiltonian may be written as where 1 k (t) = 1 if (k − 1)∆t < t < k∆t and is zero otherwise, with ∆t being the duration of each segment. The definition of this PWC control Hamiltonian -which can have multiple constituent terms -is found through the iterative reinforcement learning process shown schematically in Fig. 1b-c. This choice is convenient due to limitations in programming and manipulating real hardware, and any deviations from the idealized PWC waveform applied to the device, due to e.g. signal distortions, are directly captured in the measurements performed by the agent and the effective model it constructs. Next, in order to effectively learn the hardware model, we observe and store the state of the quantum computer in intermediate steps. We observe the state after step k, and allow the agent to choose the action for the next step. Due to quantum state collapse on projective measurement, at the beginning of a new step our protocol must first reinitialize the qubits and repeat the exact sequence of actions through step k, before applying the new action at step k + 1. Thus we are able to sequentially build up to a complete execution of a candidate quantum logic gate at the conclusion of an episode over N seg state-observation cycles (giving a total of N seg N ep state observations over N ep episodes).
The measurement protocols we employ to observe the system state are based on the concepts of quantum state tomography (Fig. 1d). For optimization of the singlequbit R X (π/2), we prepare and measure qubits in the three different Pauli bases, and also measure population leakage beyond the computational subspace of the transmon qubit. For the two-qubit ZX(−π/2) entangling gate, we perform full tomography of the computational basis by collecting nine measurements in order to evaluate the expectation values of all Pauli strings. The measurement protocol is repeated to build projectivemeasurement statistics, and several different initial states are chosen in order to specify the gate uniquely. For all gates the state of the system is represented by a real vector of expectation values. We find that this approach provides a sufficiently reliable approximation of the state and system dynamics. In addition to state observation we must explicitly calculate the reward at the end of an episode. To do this the fidelity is estimated as an element of the reward at the end of full episodes, i.e., after the full implementation of the candidate gate, and thus occurs with the number of episodes, N ep . We evaluate fidelity relative to the target operation by acting on each initial quantum state of the qubits with a variable number of gate repetitions. The final fidelity of the waveform is then estimated as a weighted mean of the different repetitions applied to the different initial states (Fig. 1e). The set of initial states and the number of gate repetitions are chosen such that the gate operation is uniquely tested and the cost/reward function captures the theoretical error in Eqn. 1. This repetition-based measurement scheme serves to amplify the gate error in order to overcome the so-called state preparation and measurement (SPAM) error endemic to real hardware, and estimated at ≈ 4% in the hardware we employ. Combining fidelity estimates produced for different numbers of gate repetitions also averages over pathological contextual errors that arise in experimental hardware as circuit lengths vary.
Finally, we design a DRL algorithm which maximizes the long term reward using an agent compatible with the above constraints, based on a policy gradient optimization algorithm. A policy π(a|s) is a function that receives as an input the state of the system and returns a probability distribution over actions. This distribution is then used to decide the next action such that over many steps and episodes the reward is maximized. The policy function π(a|s) is represented by a deep neural network whose trainable parameters are updated during the learning process in order to efficiently approximate the optimal policy over this space. A rigorous comparison between the performance of different DRL algorithms for this problem in a simulated environment -which led to our selection of the policy gradient -was investigated in [58].

5:
Initialize the quantum state of the qubit(s) |ψ0,j 6: Evolve the state by the first k segments 7: Estimate the qubit(s) state s k 8: Choose next action according to π θ (a k |s k )

9:
Store trajectory (s k−1 , a k−1 , s k ) 10: end for 11: for p in repetitions do 12: Repeat the final pulse p times 13: Measure, compute and store state fidelity 14: end for 15: end for 16: Compute reward based on state fidelities 17: Update the policy's parameters θ 18: end for

III. EXPERIMENTAL DEEP REINFORCEMENT LEARNING ON SUPERCONDUCTING QUANTUM COMPUTERS
The DRL algorithm is executed on experimental hardware via cloud-access to a superconducting quantum computer operated by IBM. Commands are executed using Qiskit Pulse to program various accessible analog control channels relevant to implementation of single and multiqubit gates. The DRL agent is separately hosted on a cloud server in order to allow an efficient learning procedure and is provided command of the quantum computer for fixed blocks of time.
The primary experimental constraint we face is limited hardware access, and our approach must function even with these restrictions. In order to reduce the effect of overhead due to cloud access to hardware, we generally batch several episodes in the learning procedure, meaning we execute them sequentially prior to the resulting measurement data being provided to the agent. With our selected DRL-agent implementation, convergence typically occurs over as little as 10 − 20 experimental batches (batch sizes for single and two qubit gates are 25 and 16 respectively), which corresponds to 0.5−1 wallclock hours. This time is dominated by hardware-API and access-queue times, with typical total experimental execution of less than a minute, and agent calculations consuming negligible time on a cloud server.
An example optimization convergence over episodes for a single-qubit gate (see details below) is shown in Fig. 1f, and notably requires more than an order of magnitude less episodes than previous numerical studies [37] to achieve a high fidelity gate. The convergence need not exhibit a monotonic increase in fidelity as the DRL agent is allowed to freely explore the space of available controls. Once the learning process converges, the resultant gates are consistently nontrivial, showing structure in the relevant control parameters but exploiting physics which is not obvious upon examination.
We now describe the specific optimizations we have performed and benchmark the performance of the resulting gates. Beginning with the single-qubit gate we target optimization of R x (π/2), a π/2 radian rotation around the x axis. This gate is implemented as a driven, microwave mediated operation and serves as a fundamental building block for arbitrary single-qubit rotations in a U 3 decomposition when combined with virtual Z-rotations. In the following, all pulses are presented in terms of arbitrary amplitudes for the I and the Q components of each driving channel.
We target gates with reduced duration and enhanced error-robustness relative to a default 36 ns analytic DRAG [25] pulse calibrated through a daily routine that is inaccessible to us. We select a gate duration and task the DRL agent with discovering high fidelity PWC wave-forms with eight time segments, constituting a total 16dimensional optimization when accounting for freedom in both the I and Q channels. Again due to the iterative and stepwise nature of DRL algorithms, the effective initial seed is random.
We select two target gate durations informed by hardware constraints to serve as illustrative examples. First, we choose a PWC pulse with a total duration 28.4 ns, 20% shorter than the default, because the default gate performance is near the T 1 limit and leaves only approximately 20 − 30% maximum achievable performance enhancement in base gate fidelity. Next, we select a gate which is 12.4 ns, or ≈ 3× shorter than the default in order to probe the ability of the RL agent to suppress leakage arising from fast pulses with spectral weight overlapping higher-order transitions. Further details on the reward/cost function in use are presented in the Appendix. The results of DRL gate optimization executed on a superconducting quantum computer called ibmq rome are shown in Fig. 2. We evaluate the performance of the gate implementation by utilizing Clifford randomized bench- (d-f) Similar plots to (a-c) as achieved via simulated annealing (SA) on the same IBM hardware. Derived benefit using SA is similar to DRL, showing 200% improvement over the IBM default gate. These initial performance calibration measurements were performed ≈ 48 h after initial gate design due to hardware access constraints and comparison is made to the most recently updated default gate. marking (RB) [59] which provides an estimate of average error-per-gate (EPG). The 24 Clifford gates used in RB are generated using the R x (π/2) gate together with virtual Z-rotations in a U 3 decomposition, and sequences are constructed using a custom compiler allowing incorporation of arbitrary gate definitions into RB sequences (see Appendix A for details).
In Fig. 2 we see that the 28.4 ns optimized pulse achieves an EPG 3.7×10 −4 , ≈ 25% lower than the default and consistent with expectations based on T 1 limits. Further, we observe reduced variance of individual sequence performance about the mean (indicated by colored shading), consistent with additional suppression of coherent errors [13,60,61]. We have observed performance enhancements ≈ 2.13× in RB using 28.4 ns gates defined via a black box closed-loop optimization; at this stage we are not able to distinguish whether this difference is due to the underlying method of gate optimization or the fact that a different machine was employed for these tests (see Appendix C).
The performance of the 12.4 ns optimized pulse is comparable to the default, despite being 3× faster, indicating that leakage errors can be suppressed via appropriate definition of the DRL reward function. It is not clear whether the lack of further improvement arises due to an overly strict constraint on the actions afforded to the agent, or whether there is a trade off between increased leakage errors and reduced incoherent errors.
We examine the robustness of both DRL defined gates by comparing the achieved EPG from RB for the same gates applied on different days. The default gates are recalibrated daily and can show amplitude variations on the order of several percent; fixed waveforms are used in each experiment involving DRL defined gates without recalibration. For both gates we observe that we achieve consistent performance relative to the default gates over a period up to two weeks, with measured EPG closely tracking fluctuations in the measured hardware T 1 (Figs. 2g, h). In previous experiments we observed that default gates on comparable hardware could vary substantially in performance after ≈ 12 hours elapsed since last calibration [13]. These findings suggest that while temporal robustness was not explicitly included in the reward function employed, the agent may have discovered robustness as the underlying hardware varied during the training process.
For the two-qubit gate, we implement the ZX(−π/2) gate using an entangling cross-resonance pulse [32,[62][63][64][65][66][67][68] on the control qubit in combination with multiple singlequbit gates applied to both the control and the target qubits in an echo-like sequence [35,69,70]. The default gate implementation applies a "square-Gaussian" crossresonance pulse and a simultaneous cancellation tone applied to the resonant drive of the target qubit in order to compensate for direct classical crosstalk (Fig. 3b).
We employ the same base structure and ask the DRL agent to find 10-segment PWC waveforms for the crossresonance interaction without application of an additional crosstalk cancellation tone. This corresponds to a 20-dimensional optimization problem due to the variable amplitude and phase of the cross-resonance drive. In this instance we also compare the DRL procedure, which builds a gate from scratch, to a black box gate optimization using an autonomous simulated annealing (SA) algorithm, and seeded with the initial calibrated default gate.
Optimized ZX(−π/2) gates were found using both DRL and SA on two different IBM devices. Optimizations and comparisons to the default were performed on different days (due to access limits), resulting in the variation between calibrated default gate definition and performance observed. Due to the collection of intermediate information, a typical DRL optimization step is approximately 2× longer than a SA optimization step, but we observe that both optimization methods converge in roughly the same number of iterations.
We first compare gate performance using a repetition scheme in which the same gate is applied multiple times and on different initial states. For a given number of repetitions we act with the gate on two orthogonal initial states five times each and average the state fidelity of these 10 different experiments. From the fidelity decay with repetition number we can simply extract a gate error from the approximate slope of these curves.
Results are summarised in Fig. 3 showing that with both methods, the optimized pulses outperform the default pulses with up to 2.38× reductions in error-per-gate, achieving a gate fidelity > 99.5%. These results show that both agents are able to identify superior ZX(−π/2) gates without the need for use of a cancellation tone on channel d1 (Fig. 3c, f), and that we are able to avoid the potential pitfalls of the learning procedure using DRL relative to a direct fidelity optimization The benefits of using DRL for the design of entangling gates can expand beyond direct improvements in instantaneous gate fidelity. In a manner similar to the single-qubit robustness studies, we have seen that DRL optimized ZX(−π/2) gates outperform the default even up to 25 days since optimization by 2×, again with no recalibration or tuning (Fig. 4a). As another measure, we construct a CNOT gate from the optimized ZX(−π/2) gate and compare it to the default CNOT using interleaved randomized benchmarking (IRB) [71]. The absolute IRB gate fidelities achieved and the relative improvements ≈ 25 − 70% vary with machine in use and time (as T 1 can fluctuate substantially), but optimized gates consistently outperform the default across multiple metrics and over long delays since calibration. For the example of testing 25+ days post calibration shown in Fig. 4b, we achieve a DRL optimized CNOT gate fidelity > 99%.
Finally, we demonstrate that DRL can be used to directly optimize the SWAP gate in situ. The SWAP gate involves sequential application of three CNOT gates, built in turn from ZX(−π/2) entangling operations and single-qubit unitaries. We maintain this overall decomposition but create a new reward function for the DRL algorithm as follows. We apply the full SWAP schedule with varying repetition values r i on initial states |+0 and | + 1 , and average the different state fidelities which we extract from a full state tomography. Again we are able to see improvements in the SWAP gate fidelity through both direct repetition and interleaved randomized benchmarking. Using DRL optimization on ibmq bogota, we achieve up to 1.45× improvement in the achieved interleaved randomized benchmarking fidelity.

IV. CONCLUSIONS AND OUTLOOK
In this work, we have shown the benefits of using deep reinforcement learning for the autonomous experimental design of high fidelity and error-robust gatesets on superconducting quantum computers. We demonstrated that by manipulating a small set of accessible experimental controls, such as the envelope functions for microwave pulses, our method was able to provide low level implementations of novel quantum gates based only on measured system responses without requiring any prior knowledge of the particular device model or its underlying error processes. These gates were validated to outperform the best competitive alternatives in the challenging case of crafting multiqubit entangling gates.
We first constructed single-qubit R x (π/2) gates, which outperform the IBM default gate in RB with up to a 3× reduction in gate duration and robustness against common system drifts. We then constructed novel implementations of the entangling ZX(−π/2) gate which show up to ≈ 2.38× higher fidelity, achieving F ZX > 99.5%. With these two driven gates, we used randomized benchmarking techniques to validate a complete universal gateset with performance superior to hardware defaults even weeks past last calibration.
From these results We conclude that DRL is an effective tool for achieving error robust gatesets which outperform default, human defined operations by capturing unknown Hamiltonian terms through direct interaction with experimental hardware, and without the need for onerous Hamiltonian tomography methods. Moreover, we have validated that even in the face of restricted access to measurement data, DRL can effectively design useful novel controls. We expect that in circumstances allowing better access to measurement data, the richness of DRL may allow the construction of gate implementations which are out of reach for simpler cost function minimization methods.
Looking forward, we believe these results validate DRL's utility for directly improving the performance of small-to-medium-scale algorithms, beyond individual gate operations. For instance, it may be beneficial to directly optimize frequently employed circuit elements outside of the underlying universal gateset [72,73]. Our early experimental exploration of the SWAP gate suggests that additional optimization benefits may be achieved through autonomous gate optimization in situ, in order to effectively capture additional transients and context dependent error sources that arise at the circuit level. We look forward to future work extending the applicability of DRL to deliver further algorithmic advantages across a variety of quantum computing applications.

ACKNOWLEDGMENTS
We acknowledge the IBM Quantum Startup Network for provision of hardware access supporting this work. The views expressed are those of the authors, and do not reflect the official policy or position of IBM or the IBM Quantum team. The authors also acknowledge N. Earnest-Noble for technical discussions and his support enabling our experiments. The authors are grateful to all other colleagues at Q-CTRL whose technical, product engineering, and design work has supported the results presented in this paper.

Appendix A: Methods
In this section we briefly summarize the parameters and procedures we used to produce the results in the main text.
Reward (cost) Functions for Single Qubit Gate -In order to evaluate the complete gate performance we repeat the candidate implementation of R x (π/2) a variable number of times r i . First, we perform a full state tomography in the computational space to find the fidelity with respect to the ideal target state F (qubit) ri . We then calculate the population of the second level, = P (|2 ), and re-scale the fidelity F . The reward function is then calculated as a weighed mean of the different repetitions (Fig. 1d-e).
Single Qubit RB -For the single qubit case we used a customized RB module which generates the 24 single-qubit Clifford gates using only virtual Z-rotations together with a given R x (π/2) (optimized or default) and construct arbitrary RB sequences out of these gates. The data in the main text consist 18 sequence lengths up to a maximal sequence length of 2280 Clifford gates. For each sequence length we generated 20 random sequences and repeated each 1024 times in order to estimate the survival probability. The mean (over random sequences) of the survival probability F was then fitted against the sequence length m to the following functional form: F = Aα m + B. Since the error per gate is related to the α parameter and since the A, B parameters capture device effects such as SPAM, we first fit all three parameters, both for the default and for the optimized pulse. Then, we re-fit for α with fixed values of A and B which we set to the mean of the unconstrained values. The error per Clifford is then given by r = (1 − α)/2 and the error per gate is 6r/7 as the chosen set of the 24 single-qubit Clifford gates contains 28 appearances of R x (π/2), meaning 7/6 R x (π/2) per Clifford.
Two Qubit RB/IRB -For evaluating two qubit gates we employ the IBM module both for RB and IRB. The IRB procedure involves comparing two survivalprobability decay curves (each averaged over randomizations) in order to extract an effective EPG for the target CNOT in isolation [71]. The RB protocol only uses default gates while IRB interleaves a target gate under test with default-implemented Clifford gates. The data in the main text consists of 10 sequence lengths up to a maximal sequence length of 90 Clifford gates for the CNOT testing and up to 65 Clifford gates for the SWAP testing. Similar to the single qubit case, for each sequence length we generated 20 random sequences and repeated each 1024 times in order to estimate the survival probability. Similar fitting technique was used for the IRB data in order to estimate the relevant α parameter.
Repetition Based Experiments -In these experiments we fit the mean infidelity vs. the number of gate repetitions N , separate from the repetitions employed in evaluating the reward function. For each value of N we average over 5 experiments applying the gate under test N times on one initial state, and repeat for a different initial state. For the ZX gate testing the initial states were |00 and |10 and for the SWAP gate testing | + 0 and | + 1 . After each run, a full state tomography was performed and the infidelity with respect to the ideal target state was calculated. An effective measure of error-pergate is extracted by applying a linear fit to the average infidelity as a function of N , which provides a measure to gate error in the low error limit (as cross-validated using IRB).

Appendix B: Reinforcement Learning Algorithm
The DRL algorithm used for the gate optimizations in this paper is an on-policy algorithm from the policy gradient family with a stochastic policy and a discrete action space. It is a variant of the well known Monte-Carlo policy gradient algorithm, REINFORCE [74]. A parameterized policy π θ is iteratively updated to maximize the discounted episodic return, J(θ) = E τ ∼π θ (τ ) [R(τ )]. It does so by directly estimating the objective's gradient with respect to the policy parameters θ, and then performs a policy update using the Adam [75] optimization algorithm, which was chosen due to its overall effectiveness in dealing with non-convex and slowly changing objective landscapes.
Perform an Adam update step on θ using ∇ θ J 6: if α > αmin then 7: Perform learning rate decay: α ← δα 8: end if It can be shown that where R(τ ) holds the total discounted return for an episode under trajectory τ . This expectation can be efficiently estimated by averaging over a batch of concurrent episodes. This overall learning process has the advantage of being straightforward, not requiring nor forming a model of the learning environment, and having sufficient computational efficiency to be effective for gate optimization. The agent's policy π θ provides actions which change the agent's state within the environment. Each action consists of amplitude and phase values, with a sequence of N seg actions constructing a full PWC control pulse. The policy is represented by a feedforward network with one tanh activated hidden layer and a softmax output layer. For the single-qubit case, a hidden layer of size 10 was used, and for the two-qubit case, the size was 18. The softmax output layer provides a probability distribution over the discrete action space, which is then sampled to select the next concrete action to take in the environment.
The agent's policy is updated at the end of each episode, using a batch of trajectories collected throughout the episode. The training step is summarized in Alg. 2, and is considered on-policy, since the actions used for the update were generated by the agent using its current policy, as opposed to a previous policy or an -greedy version of π θ . In our use case, a trajectory τ consists of a sequence of pulse segments applied, the tomographic state measurements, and the fidelity based reward received after completion of the pulse construction at the conclusion of an episode.
The policy updates work to maximize Eqn. B1. Practically, these updates are determined by computing the policy network loss, which is the negative cross-entropy show ∼ 2× improvement in the gate error compared to the default gate, consistent with previous data sets appearing in the main text. Absolute error rates for the ibmq bogota device appeared consistently higher than other machines tested.
between the predicted probabilities for each possible action and the chosen actions throughout the episode, weighted by the episode's discounted rewards. This is then minimized using the Adam optimizer using default parameters. By minimizing the loss, or equivalently maximizing the log-likelihood, the network is encouraged to assign higher probabilities to actions which previously led to larger episodic returns.

Appendix C: Fast Simulated Annealing
In the main text we explored the performance of an automated Fast Simulated Annealing algorithm, Cauchy machine, in optimizing a two-qubit gate on the IBM machine. Here we provide details about the SA optimization process and present additional results of an R x (π/2) gate optimization on ibmq rome and an optimization of ZX(−π/2) on ibmq bogota. The results of the optimization processes appear in Fig. 5.
For the SA optimization process, no intermediate information is collected and the evaluation of the full gate performance is performed with the same reward function we used in the RL optimization to estimate the full gate implementations, i.e., a weighed mean of the state fidelities. The starting point of the SA algorithm is the device default for the gate we wish to optimize. The general SA optimization process is summarized in Alg. 3. Calculate candidate cost C = cost(A temp , φ temp ) 8: if C < C best then 9:

Algorithm 3 SA Training Loop
Accept candidate 10: