Continuous quantum gate sets and pulse class meta-optimization

Reducing the circuit depth of quantum circuits is a crucial bottleneck to enabling quantum technology. This depth is inversely proportional to the number of available quantum gates that have been synthesised. Moreover, quantum gate synthesis and control problems exhibit a vast range of external parameter dependencies, both physical and application-specific. In this article we address the possibility of learning families of optimal control pulses which depend adaptively on various parameters, in order to obtain a global optimal mapping from the space of potential parameter values to the control space, and hence continuous classes of gates. Our proposed method is tested on different experimentally relevant quantum gates and proves capable of producing high-fidelity pulses even in presence of multiple variable or uncertain parameters with wide ranges.


INTRODUCTION
The standard view of quantum computation [1] uses the classical-computing abstraction of a subdivision into a finite set of gates, measurements, and state reset tasks. This paradigm has a number of benefits: notably, it permits formal derivation of universal computation [2,3], that is, the composition of a quantum circuit for any desired unitary operation, as well as error-correcting codes [4,5], where specific error syndromes can be measured and corrected. On the other hand, this abstraction abandons the essential analog character of quantum devices, from which they have the most to gain or lose in terms of their expressibility or fragility, respectively. That is, the power of quantum processors depends strongly on the number of usable gates available to them.
Practically speaking, the usage of discrete gate sets falls short in at least four important respects. First, the analog character is a more complete (and therefore efficient) description of variability between different qubits, which is inevitable, for instance, in solid-state qubits [6]. The use of qubit-agnostic gate sets as the computational primitive means that each qubit must be individually optimized and calibrated to yield each such gate [6], where parametric description of the gates would naturally capture the variations. Second, these variations can occur for a given qubit as a function of time [7], e.g., due to time-dependent noise, and require complete recalibrations where, typically, drift involves only a single parameter. This is very taxing on the routines for error suppression, mitigation, and correction.
Third, the complexity arising from parameter variations in space and time is exacerbated by the subsequent circuit complexity in composing useful circuits. It is well known that, while discrete gate sets can be universal, the required number of discrete constituent gates can be * f.preti@fz-juelich.de † f.motzoi@fz-juelich.de very large even for a simple circuit [3,8,9]. Allowing for analog parameter tuning can dramatically increase the controllability of the system and thereby necessarily reduce the depth of the quantum circuit for arbitrary tasks (see, e.g., Appendix C). Since errors accrue with circuit depth (notably due to decoherence), this increase in the circuit success probability may be highly beneficial to both short-term and long-term (i.e., fault-tolerant) approaches. Fourth, a further optimization layer is currently ubiquitous in noisy intermediate-state quantum (NISQ) algorithms [10]. Such optimizable circuits include adiabatic [11,12], annealing [13,14], and variational [15,16] quantum algorithms. The common denominator in these approaches is the circuit being treated as a black box, with a set of analog parameters acting as knobs to tune as inputs for the respective algorithm. These analog inputs act as terms in the Hamiltonian and thus may generate various quantum continuous gate sets. These cannot realistically be compiled with digital gates, and, moreover, to ensure that the gate set is correctly specified requires a formal approach for their general construction.
Our contribution, in this work, is to present a unified framework to efficiently describe and optimize continuous quantum gate sets in these scenarios. This framework allows learning of the parametric gates that can be tuned to very high fidelity across a large number and wide range of parameter values. We refer to our method as single-optimization multiple-application (SOMA) quantum gate synthesis. This kind of learning can be understood as an instance of meta-optimization [23,24] or adaptive-trajectory learning [25][26][27][28][29][30][31] and can be related to more recent uses of neural networks in quantum physics [9,[32][33][34][35][36]. That is, we find a solution for an optimizer that itself provides solutions to specific problem instances or, with a different formulation, automatically discovers heuristics to construct solutions for a specific optimization problem. In particular, we see that our algorithms are able to synthesize heuristics for general Hamiltonians.
We break down the problem of generalized gate synthesis into the following components. We parametrize (a) Long circuit with discrete gate set q 1 • · · · S · · · · · · S √ S √ S Shorter circuit with continuous gate set λ (1) λ (2) λ (3) λ (D) g (1) ( λ) Figure 1. An explanatory diagram of the concepts discussed in Section I. (a) A general quantum circuit containing a long sequence of discrete unitaries (Clifford + T set), which do not exhibit any dependence on continuous parameters. (b) A general quantum circuit containing different unitaries with different continuous parameters s1, s2, ..., s5 for different qubits q1, q2, ..., q5 and angles θ1, θ2, ..., θ8 parametrizing each gate. R, U , and W represent an analog single-, two-and three-qubit gate, respectively. Here, the same unitary parametrization is capable of representing all the different gates needed in the circuit. We show some notable analytical solutions used to engineer specific gate operations, which are usually implemented due to their adaptive character and simplicity (see Eqs. 5,6, and 7). (c) STIRAP [17,18]. (d) The Mølmer-Sørensen gate [19,20]. (e) DRAG [21,22]. (f) The method that we propose, SOMA, does not make strict assumptions on how the pulse depends on the problem parameters, but, rather, discovers it more generally through training.
our quantum gate set using continuous indices that represent either physical system parameters or desired angles of a continuous Lie group. We then present two different machine-learning methods for obtaining continuous control parametrizations that generate the indexed gate set. The first is a supervised approach where traditional optimal-control theory is used to generate an operational data set from which a generalized gate-set recipe can be trained. The second is an unsupervised approach using back propagation from which the continuous gate set can be incrementally learned over the entire training population. Finally, we show that our approaches encompass the various situations discussed above, includ-ing general solutions for gates given generic physical architectures with wide parameter ranges, noise-adaptive optimal-control theory, and compilation of a Lie group instead of a single element.
The paper is organized as follows. In Sec. I, we introduce the notation for supervised and unsupervised training of parametrized pulses. In Sec. II, we discuss the results obtained by applying these methods to single-qubit and two-qubit gates in the presence of leakage, showing that they display similarities with known analyticalsolution families. Furthermore, we also investigate how our methods perform compared to other existing algorithms. Finally, in Sec. III we analyze the dependence on the parameter variability range, the training data set size and batch size, the system and network size and again compare our algorithms to other numerical robust approaches. We summarize our conclusions in Sec. IV.

A. Definitions
We define a continuous gate set in terms of some continuous sets or distributions of n system parameters s 1 , s 2 , . . . , s n and m gate specifications θ 1 , θ 2 , . . . , θ m . An element of such a gate set is a unitary transformation between two Hilbert spaces H 1 and H 2 : Obtaining a single element of this set is a well-known problem in quantum information. Depending on whether the unitary is synthesized from discrete or analog dynamics, its composition is referred to as circuit compilation [2,3,37,38] or optimal-control theory [39][40][41][42][43][44], respectively. In Fig. 1(a), we see an example of a generic circuit acting on different qubits. Each qubit in space (and time) will have different values of the common system parameters {s k }. In addition, the different unitaries in the gate set {U i }, each additionally characterized by rotation angles {θ i,j }, may vary both throughout the circuit and in iterated uses of the circuit (e.g., in NISQ algorithms). For compactness we now regroup the continuous indices into a common array of indices λ = (s 1 , s 2 , . . . , s n , θ 1 , θ 2 , . . . , θ m ).
Such a generalization of the gate-synthesis problem can be framed as solving for the inverse function of the general dynamics given in Eq. (1), that is, for where u(t) is a wave-form function in the standard case of optimal-control theory, but can also be thought of as a discrete sequence of hard pulses, as in NMR applications, or of unitary gates in circuit compilation. Importantly, the optimal u(t) changes for each parametrization λ of the unitary, which can be generated from e.g. the Schrödinger equation.
or other equations of motion defining the system. This task can be cast as an instance of meta-optimization [45] or trajectory learning for control [30,46,47]. We formalize the meta-optimization problem as list of problemparameter vectors λ 1 , ..., λ L ∈ V λ ⊂ R D with figures of merit F 1 (x) = F (x, λ 1 ), ..., F L (x) = F (x, λ L ) and initial guess x 0 , the physical parameters of which vary somewhat from each other, such that λ i ∼ π( λ|v) for 0 ≤ i ≤ L and v ∈ V λ , is drawn from a parameter distribution. Throughout the paper we assume, without loss of generality, π to be a multidimensional uniform distribution, such that π( λ|v) = U ( λ min , λ max ), with v = ( λ min , λ max ) defining the parameter space V λ . The objective is to find optimal parameters w * , which allow for simultaneous optimization of all the systems considered within the range of sampled parameters.

B. Analytical adaptive control
The most straightforward way to generate classes of solutions to quantum gates has been the development of analytical solutions for particular quantum systems.
They have in common the knowledge of the relevant state of the system at all times during the evolution. In particular, analytical knowledge of the eigenvalues [48] allows reverse engineering of the pulses to provide (near) exact solutions for the desired states. Fig. 1(c)-(f) shows several celebrated examples where such general classes of solutions have been found. We quickly review some of their main features. Given a trial pulse shape such as a Gaussian envelope p(t, t 1 , t 2 , θ) = Ae (t−t2/2+t1/2) 2 /σ 2 , where θ denotes the area under the curve, the following dynamical solutions have been found. Fig. 1(c) shows the stimulated Raman adiabatic passage (STIRAP) solution for transfering population between disconnected states |0 and |2 , while avoiding any (nonvanishing) temporary population in the intermediary connecting state |1 . The pulse shaping is given by u 1 (t) = Ω 1 e iθ p(t, T /3, T, π)e i∆t |0 1| where the order of the pulses u 1 and u 2 is famously counterintuitive [17,18]. Fig. 1(d) shows the Mølmer-Sørensen (MS) method for trapped-ion gates [19,20]. The required laser pulses are given by MS(σ x ⊗ σ x ⊗ · · · ⊗ σ x ) : (Ω 1 , Ω 2 , δ, θ 1 , θ 2 ) → u(t), (6) u 1 (t) = Ω 1 p(t, 0, T, θ 1 )e iθ2 e iδt |g, n e, n − 1| u 2 (t) = Ω 2 p(t, 0, T, θ 1 )e −iδt |g, n e, n + 1| , where the first state index denotes the state of the relevant qubit and second index denotes the phonon occupation number. Because of the symmetry of the gate, it can in theory be used on an arbitrarily large number of qubits, i.e., it is a collective gate [38,49]. Fig. 1(e) shows the Derivative Removal for Adiabatic Gate (DRAG) solution, for leakage suppression in multilevel systems [21,22]. This pulse profile is given by u 1 (t) = e iθ2 e iδt p(t, 0, T, θ 1 )(Ω 1 |0 1| + Ω 2 |1 2|) We see, with these families of solutions, the common trend that they allow for different known system parameters or for different rotation or phase angles. Naturally, this is just a representative set, but where the equations of motion are integrable such solutions are numerous in the literature.
However, there are a few evident concerns about finding such analytical solutions. Firstly, it is a labourintensive task with no guaranteed result, where even particular solutions do not preclude that a more systematic search would produce better results. Secondly, it is important to emphasize that these are solutions to idealized models, and in practice the more accurate physical models are not exactly solved by these ansätze. Finally, most physical systems have to date not been able to find general analytical solutions beyond qubits, qutrits and highly symmetric systems, both because of the larger state space and the larger parameter space. This limits their viability for quantum computing, which requires much larger Hilbert spaces.
where H 0 is a time-independent drift Hamiltonian and H m , with m = 1, ..., M , are different suitable control Hamiltonians with corresponding control fields u m (t). In simulations, U (t) is often computed through different types of Trotterization [52]. GRAPE provides us with an efficient gradient of the merit function with respect to the control pulse values. The merit function is usually given by the gate fidelity: where the normalization factor d corresponds to the dimension of the Hilbert space and G is a target unitary, which we would like to generate using the unitary dynamics. We assume a Trotter-like unitary evolution of the system of type: where U j (t j , t j−1 ) = e −iHdt , with dt = T /N evo = t j − t j−1 ∀j = 1, ..., N evo , where N evo defines the number of time steps used in the Trotterization. In particular, considering Eq. (8), the unitary step U j reads The gradient of the fidelity can be computed iteratively starting from the so-called propagated optimal state [41]: so that the gradient approximately results in: This approach, however, cannot directly account for variations of the underlying dynamics due to e.g., stochastic noise [53], field inhomogeneity [54], or Hamiltonian uncertainties [55] and the optimal pulses output by following the native gradient direction can prove significantly worse than expected if some of the underlying problem parameters vary. A possible way around this is to switch to a robust control approach, in which the cost function (9) accounts for parameter shifts. A simple way [41] is to use an average fidelity over the parameter space sampled with quasi-Monte Carlo, where L is the number of samples.

D. Robust control
Similar to the adaptive solutions using analytical methods, solutions for controls robust to uncertainty in parameters have been found both by analytical and numerical means [33,40,[53][54][55][56][57][58][59]. Since this involves only a single solution and not a family thereof, this has been the preferred method for parameter variability in quantum gate sets, as they are easier to design.
Robust solutions are generally defined slightly differently from the adaptive solutions, using the figure of merit Figure 2. The Single-Optimization Multiple-Application gate synthesis method represented in its two variants. (a) In SOMA SL we first optimize N different QOC problems with different problem parameters λ1, λ2, ..., λN using an optimal quantum control algorithm such as [41,42,50,51] to obtain optima x * 1 , x * 2 , ..., x * N , then use a function approximator to learn the mapping g : R D → R Q , λ −→ x * ( λ) between the problem parameters and the optimal pulses. (b) In SOMA BP, we sample L OQC problems λ1, λ2, ..., λL and train the function approximator to minimize the average infidelity of the ensemble of problems using back propagation, without generating optimal solutions for a single problem with a standard quantum control method.
This cost function is more tractable from an optimization point of view because we can simply account for an ensemble of individual cost functions by taking the average of the cost functions as the objective of the optimization.
Robust solutions have also been found using analytic methods. In particular, a common pulse sequence for gates with robustness to amplitude noise is the Broad- which is independent of offsets in Rabi frequency ∆Ω up to sixth order, Likewise, when applying gates with unknown frequency offsets, the Compensation for Off-Resonance with a Pulse SEquence (CORPSE) method [61] given by with β = sin θ 2 is robust to the exact value of ∆, for small enough ∆.
Robustness has found widespread use in quantum computation where fabrication uncertainty, use of ensemble systems, and noise have made control challenging. The use of robust control is especially important where small deviations occur over time scales roughly on par with gate durations. Nevertheless, if deviations are not small, if they are over a much longer (or much shorter) timescale, or if very high fidelity is sought after, then typically they have limited value. This is especially the case where variability occurs as result of design uncertainty or slow parameter drift, or when continuous gate sets are needed as for NISQ algorithms. To understand why maximum fidelities suffer as a result of improved robustness, notice that a longer pulse sequence (with multiple pulses) will necessarily incur more decoherence. Thus while pulses such as BB1 and CORPSE will reduce drift error, the overall fidelity will not be as high as a single pulse at a fraction of the duration could have yielded. We will also show this quantitatively in the next sections.

E. Supervised training method: SOMA SL
Rather than constructively producing such classes, or relying solely on optimal control theoretic methods, here we pursue the approach of machine learning a functional approximation to the general solutions. Function approximators are mathematical objects capable of reproducing arbitrary functions using families of functions [62]. They are normally identified with neural networks and find extensive application in machine learning, data analysis, etc. For the supervised approach, which is sometimes referred to as trajectory learning in the robotic control literature [47], we employ an optimizer to find the corresponding minima for a set of problems and a regressor g : R D → R Q which maps the problem parameter space to the space of solutions to the given problem. We refer to this approach as SOMA SL (SOMA with supervised learning). A sketch of the algorithm is provided in Fig. 2 (a). Starting from a seed problem with solution x * 0 , generated previously, for N different optimal quantum control problems parametrized by λ 1 , ..., , λ N , we generate N solutions x * 1 , ..., x * N . Then we train the neural network to find the best non-linear mapping between the original parameters and the solutions. Training is performed via a standard mean squared error (MSE) loss: wherex is the mean value of the generated data, σ x its standard deviation, and z * i the normalized data. An adaptive trajectory is a solution that depends on a specific parameter of the physical system, which the optimizer does not control, although it can make use of it. Usually, for many robotics applications, the system does not have an analytical model, thereby preventing directlearning strategies. However, for quantum dynamics, we show how we can use the model to more directly train the function approximator. We refer to this approach as SOMA with back propagation (SOMA BP). A sketch of the algorithm is provided in Fig. 2 Prime examples of adaptive trajectories are the analytical pulses in Sec. I B. For instance, by using a frequencydependent solution, DRAG eliminates the leakage inside a qutrit. Moreover, this solution parametrized by the function approximator, g, directly depends on the physical system values and can therefore be tuned if these are shifted.
The cost function for an ensemble of L quantumoptimal-control (QOC) problems defined by problem parameters λ 1 , ..., λ L is given by: where F , as before, is the figure of merit, e.g., the overlap fidelity of the operation with the target quantum gate [64].
By parametrizing the solution in terms of a neural network that depends on the gate parameters, gradientbased optimization algorithms can be used to train the network directly off the above cost function. In essence, the usual back propagation of neural networks matches naturally with gradient-descent optimal-control methods such as GRAPE [41]. Thus, while optimizing in the fidelity landscape of the controls, our algorithm is able to simultaneously train the network to adapt to the extraneous system and gate parameters. This method can also be used in combination with a more standard robust-GRAPE approach. In this case, those parameters the calibration and control of which proves difficult, can be excluded from the network input.
with random restart (a) Ensemble of qubits with different λ We consider a chip with several qubits, each one with its own Hamiltonian parameter λ. We assume parameter variations δ λ to be so large that robust pulses are generally ineffective. (b) In the second use case, we consider a single system, the parameters of which vary with time. Pulses are trained on the average cost function over ensembles of qubits. (c) The approximator is trained directly on the experimental setting by using a gradient estimator, in this case a policy gradient [63], which usually requires large numbers of samples to be drawn from the system. (d) The regressor is first trained first in open-loop simulation and then used as an ansatz for a closed-loop optimization, which leaves the neural-network parameter untouched but modifies the output amplitude parameter (or, alternatively and if possible, the input parameters) to maximize the fidelity for specific experimental configurations. The optimization routine can be freely chosen among gradientfree algorithms, such as Nelder-Mead in Ref. [43]. This method can be a viable option if the experimental setting can be simulated with sufficient precision.

G. Experimental adaptation
Direct experimental application of SOMA is possible both for model-based and model-free implementations. For these purposes, one must have access to a controlled distribution of λ values, corresponding to gate parameters, pulse parameters, and system parameters. For in situ optimization of the gates, one further requires access to an experimental cost function to ascertain (with low noise and bias) the merit of the pulse sequence. For model-free control learning, the system parameters should still be indexed in some way, e.g., by performing characterization, by using a proxy such as other known characteristics, or by externally tuning parameters (e.g., the qubit frequency via magnetic or Stark shifts). For model-based approaches, one can vary the parameters λ in software to map the solution space of the continuous gate sets. Thus, in both cases, provided that there is known variation in some parameters, then one can index them, e.g., discretely in space or (slowly varying) continuously in time, as shown in Fig. 3 (a) and (b) respectively.
Whichever parameters for λ one chooses for the experiment, the task then becomes to learn the neuralnetwork weights for the maximal performance on the relevant device and gate defined uniquely by λ. Two differ-ent approaches are shown in Fig. 3 (c) and (d). When an accurate model for the generators of the dynamics is known, then Fig. 3 (d) is a natural choice whereby offline (i.e., open-loop) optimization of the simulated gates is first brought up, and only once the solution class has been learned in situ (i.e., closed-loop) is control learning performed. In this secondary step, one can reoptimize either over the space of solutions (fine tuning optimal λ * ) or over the space of physical controls (fine tuning optimal x * ).
Closed-loop optimization directly on the experiment can be performed a number of ways, including numerical and parameter-shift approximations of the gradient, Nelder-Mead [65], or evolutionary algorithms [66]. In the latter case, for example, Monte-Carlo gradient sampling can be used to estimate the update direction [67]: where i ∼ N (0, I Q ), i = 1, ..,Ñ is a normally distributed stochastic variable sampledÑ times, x is the network output representing the pulse and σ is the standard deviation of the sampled pulses. This method is often referred to as natural evolution strategy [66]. Fig. 3(c) shows how the experimental gradient of the cost function can be then back propagated via the optimizer (similarly to the GRAPE implementation above) in order to update the network weights w * directly, when the different λ can be sampled simultaneously.

II. RESULTS
We test our methods and compare to previous optimalßcontrol theoretic approaches. For this purpose, we train our solution networks to learn how to perform continuous gate sets for both single-and two-qubit operations. As a figure of merit of the QOC problems, we choose the gate fidelity defined in Eq. (9). While the gradients of the fidelity with respect to the pulse parameters x can be computed using GRAPE [41,68] (see Section I C and Appendix A), in the context of these simulations the gradient can also be obtained through automatic differentiation [69,70]. To simulate the quantum system, we use a second-order Magnus propagator as derived in Refs. [71,72], and which is compatible with analytical or automatic differentiation [72]. For the optimization of all the parameters, we employ the algorithm L-BFGS-B [73].
By choosing the vector parameter θ appropriately, one can construct arbitrary single-qubit unitaries. Therefore, in any optimal quantum control problem, this triple can be considered as a vector of problem parameters, since they define the entire class of QOC problems, the goal of which is the optimization of arbitrary single-qubit gates.
In the following section we consider an ensemble of QOC problems defined by parameters of the unitary target gate, Hamiltonian parameters, parameters of the control fields, and the evolution time T . To simplify the problem, we consider a target gate of type by setting θ 1 = θ, θ 2 = π and θ 3 = 0 in Eq. (22). For θ = π 2 , the gate is the H gate, whereas for θ = π it produces the X gate. The second family of unitaries that we consider can be obtained by setting θ 1 = π and θ 2 = θ 3 = θ − π in Eq. (22). The resulting family of gates, generates, among other unitaries, the X gate for θ = π and the Y gate for θ = π 2 . We would like to point out that a combination of single-qubit unitary gates as in Eq. (24) and Eq. (23) is sufficient to generate arbitrary single-qubit unitaries [1].
For single-qubit simulations, we consider the Hamiltonian of a superconducting transmon qubit [75]. This system can be effectively reduced to a qutrit Hamiltonian [75,76], where the |0 and |1 levels provide the computational subspace and the |2 level represents the leakage. The drift Hamiltonian for our system [21] reads where ω d is the qubit frequency and α is the anharmonicity. Furthermore, we consider a control Hamiltonian of type: where ω is the driving frequency and φ represents a time-independent phase shift between the raisingσ + and the loweringσ − operators, which can be related to the rotating-wave approximation (RWA) [68,77] and which is, in this model, the only problem parameter influencing the control fields.
Computing the RWA with detuning δ allows us to rewrite the drift Hamiltonian as: where δ = ω d − ω,X(φ) = e iφσ + + e −iφσ − , and iŶ (φ) = e iφσ + − e −iφσ − . For the control fields, we employ a Fourier ansatz: with K Fourier modes. For the QOC simulations, we set the central values δ 0 = 0 GHz and α 0 = −0.34 GHz [78]. The parameter vector of the QOC-problem class is given by For all the single-qubit gate simulations, we consider a multidimensional rectangle centered at λ 0 =  (δ 0 , α 0 , φ 0 , θ 0 , T 0 ) T and with upper bounds defined by ± λ max . We consider four methods, which allow us to optimize multiple systems simultaneously: standard GRAPE; robust GRAPE, which uses the average GRAPE gradient over an ensemble of QOC-problems in Eq. (15); a supervised training method, which we refer to as SOMA SL (Algorithm 1), using both linear and nonlinear models and where GRAPE solutions are first generated and then approximated via Eq. (18); and the unsupervised method, which trains a neural-network pulse directly on the fidelity of an ensemble of QOC problems using back propagation on Eq. (20). The latter is referred to as SOMA BP (Algorithm 2). In the following section, we consider a QOC problem with N = 500 time steps of a Magnus-type time integrator, which approximates the unitary temporal evolution of the quantum system [72]. Our pulses are parametrized as in Eq. (28) by K = 4 Fourier components according to Eq. (31) for each one of the two control fieldsX and Y . The control fields are multiplied with a scaling factor equal to 1/A 0 to ensure that the pulse amplitudes do not exceed the typical experimental maximal value of 1 GHz.
The results are divided into four blocks of four plots each (Fig. 4, Fig. 5), representing models trained on two different sets of problem parameters, the ranges of which are given in Tab. I and in Tab. II. Here we limit the number of different parameters to three, although larger numbers are possible depending on the parameter range taken into account, the specifics of the physical system, the size of the neural network, and the number of systems sampled (see Appendix B). The first three plots in each row show the infidelity as a function of one varying parameter, while the other parameters are kept at their original values given by λ 0 . The fourth plot shows the average performance of the different methods as a function of the radial distance from the initial QOC problem λ 0 . This last plot ensures that the performance of the methods is stable when different parameters are changed simultaneously across the entire range of the parameter space. In Fig. 4, the problem parameters δ (the qubit-drive detuning), α (the nonlinearity of the qutrit system), and T (the gate evolution time) and the target gates R 1 (θ) defined by Eq. (23) -first row -and R 2 (θ) defined by Eq. (24) -second row -are considered. The blue continuous line shows the GRAPE solution for the λ 0 parameter, whose fidelity shows an exponential decay as a function of the distance from the center of the parameter space. The orange dashed line shows the performance of robust GRAPE and the brown dashed line the performance of SOMA SL with a linear model, both with infidelities around 10 −2 . The pink dashed line shows the performance of SOMA SL with a neural network and the green dashed line the performance of SOMA BP. We observe that for the second gate R 2 (θ), both methods are able to deliver an average infidelity below 10 −4 , whereas for the first one R 1 (θ), SOMA BP clearly outperforms SOMA SL.
In Fig. 5, the problem parameters are as follows: drive frequency detuning δ, the phase mismatch in the control fields φ, and the gate angle θ, together with the target gates R 1 (θ) defined by Eq. (23) -first row -and R 2 (θ) definied by Eq. (24) -second row. We note the same general trends as in Fig. 4. In particular, we observe that although both SOMA methods are closer in terms of performance for R 1 (θ) with a little advantage for SOMA BP, for gate R 2 (θ) SOMA BP clearly outperforms SOMA SL. We can again note that SOMA BP mostly outperforms all the other methods, providing pulses that are stable over a large range of parameters. For simulations with single-qubit gates, we use a two-layered network with 256 components per layer and eight neurons in output -the output space of the approximator has an output dimension Q = 2K.

B. CR gate with leakage
The cross-resonance (CR) gate is a two-qubit gate activated by microwave fields, which drive one of the qubits  Table I. The parameter range for the optimization of the single-qubit gates R1(θ) and R2(θ) as given in Fig. 4. For this simulation, we do not vary the angle parameters of the gate, but, rather, the physical parameters of the qutrit, such as δ and α and the evolution time T .  10   Table II. The parameter range for the optimization of the single-qubit gates R1(θ) and R2(θ) as given in Fig. 5. In this case, we not only vary the detuning but also the angles of the gates and the phase factor φ.
(target) at the frequency of the other (control). The gate is implemented in the context of quantum computing with superconducting qubits [79][80][81][82]. The gate gives rise to a ZX-type interaction [79], which can then be used to generate different two-qubit entangling unitaries. The CR gate can be embedded in a higher-dimensional system, e.g., where two qutrits are capacitively coupled together. In this case, the target gate is a two-qubit gate, but the unitary generated by the Hamiltonian evolution is a two-qutrit gate, thus the fidelity is only computed with respect to the computational subspace -the CR gate is constructed using the |0 and |1 levels. The Hamiltonian of a two-transmon system reads [80] where J is the coupling strength of the transmontransmon interaction, b j is the lowering operator on the jth transmon, Ω is the driving field, and a Duffingoscillator approximation is performed [76]. A further standard RWA allows us to simplify the problem, introducing at the same time the notation from Eq. (25), such that As control operators, we use the projectorsX j (φ) = e −iφb j + e iφb † j and iŶ j (φ) = e iφb † j − e −iφb j both on the control qutrit (j = 1) and the target qutrit (j = 2), as in Eq. (28), such that where u (1) j (t) are the control fields on the jth qutrit (for theX j and theŶ j operators respectively). Note that all the fields operate at frequencies nearresonant to the target qubit. The Hamiltonian parameters are centered at values ∆ 0 = 0.2 GHz, α 0 = −0.34 GHz, J 0 = 0.01 GHz. For each control field, we use a Fourier parametrization [43,81]: We further assume, as in the single qubit case -see Eq. (28), that the control fields are influenced by a phase factor φ, which then counts as a QOC-problem parameter in the two-qubit simulations.
For two-qubit gates with leakage we present two different simulations, one for the CNOT gate alone and one for a family of CR-like gates with the following parametrization:  90   Table III. The parameter range for the optimization of the two-qubit CNOT gate. The range of values ∆, α and φ, as well as the evolution time T , are determined by considering typical experimental settings of state-of-the-art superconducting quantum circuits [82][83][84]. For the value of φ, we assume a small phase error influencing theX andŶ control fields both on the target and the control qubits.  90   Table IV. The parameter space for the optimization of the two-qubit CR gate as given in Eq. 38. Compared to the simulation of the CNOT gate, given in Tab. III, the range of the target-control frequency detuning ∆ is reduced but the continuous parameter θ is also considered.
For the two-qutrit gates we use a two-layered neural network with 500 neurons per layer and 120 neurons in output. Here, the output of the approximator has a dimensionality Q = 4K, where K = 30 is the number of Fourier frequencies for each one of the four control fields -see Eq. (37). The time evolution is given by 5000 Magnus steps.
The results of the pulse class learning are shown in Fig. 6. The evolution time is chosen as T = 90 ns, an improvement of about a factor of 2 over typical experimental durations. The first row of plots shows the results for the CNOT gate, where variations of the parameters ∆ (the frequency difference between the control and target qubits), α (the nonlinearity of the qutrits), and φ (the phase term of the control fields) are considered. The second line shows the results for the CR(θ) gate as defined in Eq. (38). The maximal range for each parameter for each one of the two gates is described in Tab. III and Tab. IV respectively. We observe that, for both gates, the neural-network pulse trained with SOMA BP outperforms the other algorithms. For the CNOT gate, it can produce pulses that are robust over a very large detuning range (|∆ max − ∆ min | = 200 MHz) by training on just 100 different systems sampled from a uniform distribution. For the CR gate, we choose a range of 100 MHz, while also taking into account the dependence on the angle θ. It is possible that a very large amount of samples or long training times could improve the performance of SOMA SL further; however as the size of the Hilbert space increases, the optimization of large sample numbers on classical machines can become computationally too time consuming. The maximum range defines the parameter space over which the QOC problems are sampled. We observe how the neural network trained with the supervised algorithm outputs pulses with higher average fidelity than the other methods but it fails nonetheless when the detuning variation in the two different systems is increased up to the order of 100 MHz. In this case, the model trained with SOMA BP, however, still outputs high-fidelity pulses.

III. PERFORMANCE ANALYSIS
In this section, we compare the different metaoptimization methods as the size of the QOC problemparameter space increases. In particular, we vary the order of magnitude of the laser-frequency detuning δ, the qubit-frequency detuning ∆ for the two-qubit gate, and the nonlinearity α for both the single-qubit gate and the CR gate. For the sake of this analysis, we only vary the maximal range of one single parameter at a time, while the other problem parameters have a fixed maximal range of 10 −3 GHz. As hyperparameters for the different algorithms, we use the same values that have proved to be effective in the previous simulations. For SOMA SL, we use up to 10000 sample problems optimized with GRAPE, whereas for SOMA BP we optimize the average fidelity with 500 system samples for the single-qubit gate and 100 systems for the two-qubit gate.
The results are shown in Fig. 7. We observe that although supervised training using the minima produced by GRAPE shows lower infidelities for small variations of the detuning (up to 10 MHz), it fails when this value is increased to 100 MHz, whereas the neural network trained with back propagation of the fidelity still produces valid optima of the QOC problem. In particular, we observe a crossing between 10 and 100 MHz for both types of detuning (δ and ∆) and for the nonlinearity α in the two-qutrit case, where the performance of the supervised method (SOMA SL, green line) worsens dramatically, whereas SOMA BP (orange line) is able to keep the fidelity at high, experimentally valid values. In general, we observed that SOMA SL significantly outperforms SOMA BP for small parameter variations, where the precision of the non-linear regression is high enough to reproduce the pulse variations perfectly (see also Sec. 7). As a consequence, SOMA SL can be a useful tool to achieve adaptive robustness against small parameter variations, where it clearly surpasses SOMA BP, whereas the latter shines when the parameter variations and the number of Fourier components are comparatively larger. Furthermore, SOMA BP is capable of handling large variations of multiple parameters at the same time, as shown in Fig. 5. In all cases, at least one neuralnetwork approach outperforms robust GRAPE and the linear regression model significantly. We argue that sampling large amounts of problem parameters could actually lead to a better performance of SOMA SL for in-situ physical systems or numerical simulations where sampling proves fast and efficient. However, as we show in Fig. 8, it seems that for SOMA SL (green line) there exist systems where its improvement as a function of the sample size is limited, which makes SOMA BP (orange line) a better option, if its implementation is possible. Likewise robust GRAPE (blue line) and the linear model (pink line) gain limited benefit from increasing sample sizes. In particular for SOMA SL, we can often observe behaviors such as branching and outliers in the training data set, which probably contribute to a loss in the quality of the approximations. One may try to increase the precision of the model by using loss functions that are sensitive to outliers, such as the Huber loss [85], or to restrict the use of SOMA SL to systems where limited parameter drifting does not prevent the regressor from learning a high-quality representation of the solution space.
We also study whether varying the chosen samples during training (i.e., minibatching) can affect the performance of our algorithms; in particular for robust GRAPE and SOMA BP. Computing the gradient over a batch of samples gives rise to a batched version of the algorithms. More specifically, a batched version of robust GRAPE, called bGRAPE, has been studied in Ref. [35], where impressive robustness is achieved by combining batches of problem parameters with momentum-based stochastic gradient descent. This is, of course, a different algorithm than L-BFGS-B, i.e., the optimization algorithm employed in all simulations discussed in this paper, and it does not similarly guarantee near-quadratic convergence [73]. For SOMA SL, computation of the gradient based on the MSE loss over a batch of samples leads to standard neural-network training with stochastic gradient descent. For the systems we consider, we do not observe an improvement of bGRAPE over robust GRAPE (called sGRAPE in Ref. [35]) neither with L-BFGS-B nor with ADAM [86]. We believe that this is due to the use of Fourier components, which allow for more controllability of the quantum system [68], and the use of curvature information granted by both algorithms. Nonetheless, exploration of the effect of varying samples during training remains an interesting perspective worthy of further studies and commitment, both in the context of adaptive and robust control.
In the last part of the analysis, we discuss the scaling of the network approximations as we increase the output dimension, the number of qubits and the input dimension. Since we aim at controlling single-and two-qubit gates, the number of expected controls only scales linearly with the number of qubits; and since the weak coupling between qubits drops off roughly exponentially with distance, we also do not expect the search complexity in state space to increase dramatically. In simulations, we expect both algorithms to behave similarly to GRAPE (or a different gradient-based control algorithm, if this is implemented) as the dimension of the Hilbert space increases. This is due to the underlying time evolution, which in both cases is given by the Trotterization or, as in our case, the Magnus expansion. The approximation of the quantum propagator affects the gradient-based optimization for single QOC problems, which generates the target data for SOMA SL, but also acts as an activation function for the neural network [35] in SOMA BP. As a consequence, we do expect the state space and the equations of motion to scale exponentially with the number of qubits, which is a general problem for all control and compilation tasks. Just as in the general case, we expect a combination of informed state-space parametrizations (such as tensor networks and sparse algebras) and of quantum-aided optimization (as in Sec. I) largely to address this important problem. Moreover, by considering increasing numbers of qubits, the number of pulse parameters and control fields increases consequently. However control fields acting on different qubits usually commute, which can dramatically decrease correlations between the pulse components. Therefore, one does not need, in general, to have a single neural network outputting all pulse parameters at once, but, rather, several different neural networks, one for each group of control fields, which commute between each other. For SOMA SL, we need to generate a large data set of QOC-problem solutions by means of a standard quantum control algorithm, which may be slow for many-qubit systems. However, this task can be easily parallelized, since the different optimizations are independent of each other. In this case, the main obstacle is not represented by the nonlinear regression over the data, but, rather, by the quality of the data generated by the quantum control algorithm for each solution in the parameter space. Since the problem is high dimensional, the solution space will probably exhibit structures such as branching and outliers that are difficult to include in the nonlinear regression. A possible option here is to employ algorithms for data reduction and clustering, in order to obtain a high-quality representation of the solution space. As for SOMA BP, one may distinguish between simulation and experimental implementation. In the former case, the evaluation of the infidelity and its gradient as a function of the time evolution represents the main bottleneck -see Tab. V in Appendix B -together with vanishing gradients, which are also a well-known problem for other NISQ use cases and represent one of the main obstacles to any experimental application. Vanishing gradients could also be tackled with alternatives to back propagation (see, e.g., Ref. [87]). Finally, for large output spaces and very deep networks, GPU training and stochastic gradient-descent algorithms may provide useful speedup, as it is the standard in deep learning.

IV. CONCLUSION
In this work, we show how to engineer solutions of problems in quantum optimal control that depend on problem parameters located outside the optimization routine. This includes physical system parameters, other external parameters such as the pulse time or bandwidth, and gate parameters such as rotation angles.
We therefore propose two methods to learn large classes of quantum gate-synthesis problems, SOMA SL and SOMA BP. We show through experimentally relevant examples that these methods prove able to learn adaptive solutions to generalized QOC problems. The output gates have fidelities that remain very high over the entire continuous parametrization of the gate sets, for typically large ranges as would be encountered experimentally. These continuous gate sets provide the opportunity to be used as computational primitives in compilation tasks, in NISQ variational algorithms, and for entire arrays of qubits rather than individually optimized ones.
where the k index runs over the number of basis functions components, the i index over the time-slice and the j index over the different control operators. In a similar way, this can be applied to a neural-network parametrization g : R D → R Q , λ −→ g( λ) of the GRAPE pulse, mapping a given number of meta-parameters to the pulse space, the original formula can be rewritten to output the gradients of the fidelity with respect to the neural-network parameters (w ml , b l ): with the same indexing as Eq. (A1). In this case, the neural-network output values, which give the coefficients of the time-dependent basis functions, i.e., the terms x ik = g( λ) ik depend on the QOC problem parameters λ.
Appendix B: Implementation details For every physical system considered -single qubit gate with leakage and two-qubit gate with leakage, each one with different target gates and system parameters -we first define an interval for each parameter. Values are sampled from a uniform distribution over the hypervolume defined by the intervals of parameters. The system together with the interval of parameters defines a family of QOC problems, which we analyze with one of 3 methods: Robust control with GRAPE, which simply seeks for the best pulse for a set of different systems, SOMA SL, which first solves a sample of systems and then performs a regression over the sampled values -for the sake of completeness we consider here both a linear and a non-linear approach -and SOMA BP, where the network is trained directly on the average infidelity of an ensemble of systems with back propagation. In the case of Robust GRAPE, we sample L systems and run the optimization with random restart, i.e., we run the optimization N guesses = 5 times with different conditions and then choose the pulse with the smallest average infidelity. For SOMA SL, we sample up to N = 10000 points within the parameter space. For each one of these, we optimize the corresponding QOC problem with GRAPE -this can be performed with any proper optimal quantum control method -and then train the model to map the corresponding parameter to the optimal pulses. As for SOMA BP, we sample L systems and run the neural-network training with random restart. The number of total samples required by the regression is usually larger than the one needed by the other two methods, the corresponding single-system optimization is nonetheless much faster and can be run in parallel. Therefore we set N = 10000 for each simulation, whereas for the sake of comparison and due to the similarity of the other two methods always use the same number of samples (either L = 500 for the single qubit case or L = 100 for the twoqubit case). Both SOMA methods employ two-layered neural networks with 256 neurons per layer for the single qubit system and 500 neurons per layer for the two-qubit system. The linear model performs multi-linear regression on the same data used by SOMA SL.
Comparison of SOMA methods Algorithm parameter SOMA SL SOMA BP QOC cost function evaluations n fev · N n fev · L · N guesses Computation time t GRAPE · N + t REG t BPROP · L Number of samples N L Table V. Comparison of SOMA SL and SOMA BP in terms of their performance features. n f ev and tGRAPE are here the maximal number of function evaluations and computation time required by L-BFGS-B with GRAPE to solve a single quantum control problem. tREG is the time required for the neural-network regression. tBPROP is the time required by the neural network back propagation for one sample.
For the random restart of SOMA methods, we run the optimization N guesses = 5 times with different initial conditions and then test the quality of the predictions on a test set. Afterwards we choose the model with the lowest average test infidelity. For both algorithms, we use the L-BFGS-B algorithm. Since the training of SOMA SL, which uses a MSE loss, is much faster than SOMA BP, we do not limit its maximal number of iterations. For SOMA BP, however, this is limited to 6000 for the single qubit gates and to 7000 for the two-qubit gates. Here, the input parameters for both SOMA SL and SOMA BP should be re-scaled to lie e.g., within the range [0,1]. For SOMA BP, this can be avoided in certain cases, as long as the size of the pulses remains large enough. For details about the performance of the algorithms, see Tab. V.
For each system, we evaluate the performance of the neural-network pulses on the entire parameter space. In order to visualize the performance of the algorithms considered in a one-dimensional plot, we consider the nor- By then considering D-dimensional spheres of radius r ∈ [0, 1] D , we sample N s = 1000 systems on the surfaces of such spheres and compute the average infidelity over these systems. By doing so, we can evaluate the performance over the parameter space, thereby ensuring that our methods are effective for every combination of problem parameter values. The result is then averaged over N s samples, i.e., we plot the mean and its standard deviation. The latter is pictured as a shady region. We also want to consider how the standard deviation of the average infidelity behaves when the different algorithms are tested against a batch of quantum systems sampled according to the parameter space. In particular, the standard deviation of the infidelity for N s test systems is bounded by the average performance of the algorithms: where F test is the average fidelity of the trained pulses on the N s test systems. For SOMA BP, assuming that overfitting is negligible, we have σ(w) 2 ≤ 1 − (1 − L(w)) 2 . which means that the fidelity of the algorithms considered is guaranteed not to drop significantly below the average performance showed in Fig. 4, Fig. 5 and Fig. II B. Figure 9. Infidelities reached by applying stochastic descent search (a) and brute-force (exhaustive) search (b) on the CR the gates defined by the angles in Tab. VI in order to decompose them in discrete circuits. As we can observe, both algorithms provide us with the same results. However, they cannot find a circuit representation with F > 0.98 for the first three gates, thus indicating that circuits decomposing these gates with a higher fidelity are longer, more error-prone, and harder to discover.
For other angles, it becomes increasingly difficult to find good circuits, with only partial approximations being possible at reasonably short depth. In order to test the quality of the approximation, we search for optimal discrete circuits representing a quantum circuit according to a given parametrization. We use exhaustive search [93] of all possible circuits (up to circuit depth 10) and stochastic descent, a special type of structured random search -see RL approach in [32] -(up to circuit depth 20) to search for optimal decompositions of discrete gates and try to reproduce the chosen circuit with increasing number of unitaries. The results are given in Tab. VI and Fig. 9 for the CR gate with angles π √ 2 , π √ 3 , π √ 5 , π √ 7 . We see that although the fidelity of the discrete gates increases with the size of the quantum circuit, in three cases it cannot reach the value of F=0.99 for circuits of depth smaller than 20. In the case of θ = π √ 7 , a valid decomposition with fidelity F=0.9999 is found. The search is performed exhaustively for l circ < 10 and then using stochastic descent for l circ > 10. Moreover, the decomposition of CR( π √ 7 ) contains 2 CNOT gates, which again implies a slowdown of the gate execution time. Other examples for gates with superior analog performance can be seen, e.g., Ref. [8], with superior performance especially expected for variational circuits.