Gradient Ascent Pulse Engineering with Feedback

Efficient approaches to quantum control and feedback are essential for quantum technologies, from sensing to quantum computation. Open-loop control tasks have been successfully solved using optimization techniques, including methods like gradient-ascent pulse engineering (GRAPE), relying on a differentiable model of the quantum dynamics. For feedback tasks, such methods are not directly applicable, since the aim is to discover strategies conditioned on measurement outcomes. In this work, we introduce feedback-GRAPE, which borrows some concepts from model-free reinforcement learning to incorporate the response to strong stochastic (discrete or continuous) measurements, while still performing direct gradient ascent through the quantum dynamics. We illustrate its power considering various scenarios based on cavity QED setups. Our method yields interpretable feedback strategies for state preparation and stabilization in the presence of noise. Our approach could be employed for discovering strategies in a wide range of feedback tasks, from calibration of multi-qubit devices to linear-optics quantum computation strategies, quantum-enhanced sensing with adaptive measurements, and quantum error correction.


I. INTRODUCTION
The application of optimal-control techniques to quantum systems [1,2] forms a cornerstone of modern quantum technologies, ranging from the tailoring of laser pulses acting on molecules to the synthesis of unitaries in multi-qubit systems as part of the "compilation" of quantum algorithms for specific hardware platforms. Since the equations of quantum dynamics are explicitly known and even differentiable, one can exploit this knowledge and specifically make use of powerful gradient-based techniques. The most prominent approach is "gradientascent pulse engineering" (GRAPE) [3,4], with its efficient evaluation of gradients, together with its variants. GRAPE is the state of the art method for quantum optimal control and is extremely widely employed. In fact, it has been used to find optimal control sequences for spin systems [3,5,6], coupled qubits [7,8], an implementation of the Jaynes-Cummings model [9], and qubit-cavity lattices [10], among many other examples. It has also been used to optimize open dynamics [11,12], has been turned into an adaptive approach to cope with parameter uncertainties [13][14][15][16], and has been extended to second-order optimization techniques [17]. Other efficient gradientbased optimal control approaches have also been presented recently (e.g. [18]).
However, there is one crucial extension that is not easily addressed by such gradient-based techniques: feedback. Conditioning the control sequence based on the stochastic outcomes of quantum measurements is an important component of many more challenging tasks [19]. It allows to remove entropy from the system and is therefore essential in applications like state preparation and stabilization in the presence of noise [20][21][22][23][24], adaptive * riccardo.porotti@mpl.mpg.de measurements [25], or quantum error correction with its syndrome extraction (e.g. [26][27][28][29]). Unfortunately, discovering feedback strategies is a formidable challenge. These strategies live in a space that is combinatorially larger than that of open-loop control strategies, since every sequence of measurement outcomes may require a different response. Beyond that general difficulty, it is unclear a priori how to take gradients through strong stochastic quantum measurement events.
In principle, there is a set of techniques from machine learning that can discover feedback strategies without taking gradients through quantum dynamics: so-called model-free reinforcement learning (RL) [30] approaches (for a brief review, see Appendix A). During the last few years, a number of groups have demonstrated numerically the promise of model-free RL for quantum physics.
This included both open-loop control tasks (e.g. [31][32][33][34], also in an experiment [35]), but in particular the more challenging quantum real-time feedback tasks that rely on adaptive responses to measurement outcomes [36][37][38][39], recently showcased in first experiments [40,41]. In model-free RL, the quantum device is treated as a black box, which can be an advantage in applications to experimental setups with unknown parameters [35,38,40,41]. On the other hand, much of the training time is therefore spent in learning implicitly a model of the dynamics while simultaneously attempting to find good feedback strategies. This can make learning inefficient, leading to longer training times and/or suboptimal strategies.
It would therefore seem desirable to find a way to incorporate feedback based on arbitrary quantum measurements into a direct gradient-based optimal control technique like GRAPE, making efficient use of our knowledge of the differentiable quantum dynamics.
In this work, we present such a technique, which we refer to as 'feedback-GRAPE'. In the language of RL, it would be classified as a model-based technique [42]. On the one hand, it keeps the ability of GRAPE to exploit 2 gradients through the quantum dynamics, making the learning more efficient. On the the other hand, similar to model-free RL, it provides a flexible approach to incorporate feedback, even in the presence of strong stochastic measurements, and to do this efficiently using Monte-Carlo simulations.
Overall, the technique we introduce here, feedback-GRAPE, is conceptually simple: GRAPE-type gradient ascent for the continuous control parts, possibly implemented using automatic differentiation for convenience [43][44][45][46][47][48], and in any case exploiting modern gradient optimizers, supplemented with stochastic sampling of measurement outcomes.
We show that introducing Monte-Carlo sampling in the framework of gradient-based feedback optimization requires the addition of an important correction term to the overall reward function, for discrete measurement outcomes, or a "reparametrization" of the measurement probability density, for continuous measurement outcomes. This innovation allows us to optimize any differentiable reward over long sequences of measurements, both discrete and continuous.
In this way, feedback-GRAPE is able to go significantly beyond existing gradient-based optimization methods for feedback. These are typically limited to greedy optimization over one or, at most, a few measurements [21]. Otherwise, for non-greedy optimization, they are limited to optimal control problems that can be mapped onto socalled classical linear-quadratic-Gaussian control problems, or to a special linear ansatz for the feedback protocol as in so-called Markovian quantum feedback methods [19,49]. The special limiting case of weak Gaussiandistributed measurements, which does not yet require the mathematical treatment that we will introduce, has recently been considered by [50], which can thus be considered an important first step towards the general method we are going to discuss here. Another important aspect of our method is that, in contrast to so-called Bayesian quantum feedback approaches [19,22,51], it does not require to simulate the system dynamics during deployment in an experiment, as the controller is only provided with the measurement outcomes. This feature is important both for real-time control at fast time scales and the scalability to more complex systems.
We illustrate the power of feedback-GRAPE in a series of different tasks, considering different experimental setups relevant for modern quantum computing employing cavity modes [52]. Although our method is general, we have focused on feedback sequences with a modular structure, i.e. where building blocks like unitaries and measurements are combined in discrete time steps. These are useful scenarios, since they can make it easier to interpret the resulting strategies.
In the following, we will first present the general method, then analyze the numerical examples, discuss aspects of the optimization landscape and scalability, and finally present further extensions. (c) Schematic decision-tree representation of a feedback strategy for discrete measurement outcomes. Intervals of differentiable evolution with optimizable control functions F j θ depend on the sequence of outcomes m1, . . . , mj. In general, the evolution can be dissipative.

II. FEEDBACK-GRAPE METHOD
We consider a general dissipative quantum system with feedback (for an overview of the scheme, see Fig. 1). Suppose measurements are performed at times t 1 , t 2 , . . . , t N , and the evolution is controlled -in a manner to be optimized -based on the corresponding measurement outcomes m j . Specifically, the control parameter F j θ (which might be a vector) applied during the time interval [t j , t j+1 ] is a function of all previous measurement results m j = (m 1 , . . . , m j−1 , m j ). Below, we refer to the set of controls {F j θ (m j )} for all possible measurement outcomes as a feedback strategy or simply strategy, see Fig. 1(c). Here, we anticipated that the feedback-control functions F j θ (m j ) are parametrized, depending on trainable parameters θ that will be optimized via gradient ascent (θ is typically a high-dimensional vector). We assume F j θ to be differentiable with respect to θ. Ultimately, the value of F j θ will be provided by a neural network, or, alternatively, a lookup table: we comment on these different approaches further below, but the present considerations are independent of this aspect. In practice, the control vector F j θ might enter a Hamiltonian or directly a parametrized unitary gate. On a minor note, in some scenarios, during the first time interval [0, t 1 ], one might apply a control F 0 that does not depend on any previous measurement outcomes but can still be op-timized.
With this notation in place, the time evolution of the system's density matrix, for a particular measurement sequence m = (m 1 , m 2 , . . .), can be written in the general formρ where Φ (m) is the map that depends on the control parameters and implements the quantum-dissipative time evolution throughout the whole time interval [0, T ], conditioned on the given fixed sequence m of measurement outcomes. Note that our definition implies that Φ (m) itself is not a completely positive (CP) map, because it contains the renormalization of the quantum state required after each measurement (it implements a "quantum instrument"), which introduces a nonlinear dependence on the initial state. To obtain the unconditional average quantum state, the average . . . m of this expression may be taken over all possible measurement sequences, weighted with their respective probabilities. Eq. (1) is valid formally even if the overall evolution is non-Markovian. It can be simplified in the important Markovian case. Then, evolution proceeds step-wise. Let us denote by Φ j the CP map for the continuous evolution during the time interval [t + j , t − j+1 ], where t − is shorthand for a time point just prior to the measurement at t, and correspondingly t + is right after the measurement. Then we haveρ(t − j+1 ) = Φ j (F j θ (m j ))[ρ(t + j )]. In the special case of unitary dynamics, the evolution itself simplifies further toρ(t − j+1 ) =Û j (F j θ (m j ))ρ(t + j )Û j (F j θ (m j )) † . Herê ρ(t − j+1 ) is understood to be the quantum state at time t j+1 for a fixed sequence m 1 , . . . , m j of previous measurement outcomes, just prior to the next positive-operatorvalued measure (POVM) measurement implemented at t j+1 . This measurement is described by some POVM element that can be written in the form M (m ) †M (m ), with the POVM normalization condition m M (m ) †M (m ) = 1 andM ≡M j+1 depending on the physics of the measurement.
It will yield a particular outcome m j+1 ≡ m with probability P (m ) = tr[M (m ) †M (m )ρ(t − j+1 )] and an updated statê ρ(t + j+1 ) =M (m )ρ(t − j+1 )M (m ) † /P (m ). Our goal is to maximize some overall cumulative reward R, which is called "return" in the nomenclature of reinforcement learning. For example, in a state-preparation task this might be the final fidelity with respect to some target stateσ. For a given sequence m of outcomes, we would define R(m) = tr √σρ (T |m) √σ 2 . This would be averaged eventually over all possible measurement outcome sequences to yieldR = R(m) m . The return R could also involve penalties for suppressing larger control amplitudes etc. These additional contributions depend on the specific sequence m as well, via the controls F j θ (m j ).
It might now seem straightforward to employ automatic differentiation for optimizingR via gradient ascent, updating δθ = η ∂R ∂θ , with some learning rate η and with all the trainable parameters combined in a vector θ.
The crucial observation to be made at this stage is that the introduction of stochastic measurement results into this scheme requires some extra care. The following considerations constitute the main conceptual steps needed to enable the discovery of feedback-based quantum control strategies based on gradient ascent.
We have to distinguish between discrete and continuous measurement outcomes, which require substantially different treatment.
For the particularly interesting discrete case (e.g. strong projective qubit measurements), the essential insight is that the probabilities P for obtaining the different measurement outcomes themselves depend on all the controls F j θ applied during previous time intervals, simply because the quantum state itself carries this dependence. This has to be taken care of during the evaluation of gradients with respect to θ. Illustrating this in the case of a single measurement m at time t 1 ∈ [0, T ], we have Here P (m|ρ) is the probability for measurement outcome m given stateρ. As we take the gradient with respect to the trainable parameters θ, we observe that the derivative acts not only on the return R based on the time-evolved state Φ 1 (F 1 θ (m))[ρ(t + 1 )] (the second factor inside the sum) but also on the probability P (m) itself, due to its dependence on the initial control,ρ(t − 1 ) = Φ 0 (F 0 θ )[ρ(0)]. Generalizing this observation, we cannot simply implement gradients of the measurement-averaged return R = R(m) m by averaging the gradient of the sequence-specific return, ∂R(m)/∂θ m . Rather, observe R(m) m = m P (m)R(m). Thus, when evaluating ∂ R(m) m /∂θ, we will get two contributions: ∂[R(m)P (m)]/∂θ = P (m)∂R(m)/∂θ +R(m)∂P (m)/∂θ. To enable stochastic sampling of the second term, we rewrite it using ∂P (m)/∂θ = P (m)∂ ln P (m)/∂θ. This then leads to: (3) Here we displayed explicitly the parameter-dependence of P θ (m), which represents the probability of the full sequence of outcomes m = (m 1 , m 2 , . . .), given the parameters θ that determined the shape of the control functions F j θ .
The mathematics for the extra term appearing here, with the gradient of the log-likelihood, is well known from policy-gradient-based approaches in model-free reinforcement learning. However, there this term appears for a different reason. It arises due to the deliberate choice of The measurement samples a stochastic outcome mt, adopting a different method depending on whether the outcome is continuous or discrete. In both cases, the probability distribution depends in a differentiable way on the trainable parameters θ, via the preceding unitary controls that have generated the present quantum state ρ θ . Depending on the measurement outcome, a learnable control F is applied that may be implemented either via a neural network or a lookup table. (b) Full sequence. This consists of repeated application of the blocks depicted in (a), plus subsequent implementation of unitary controls depending on F , potentially with decay and decoherence included in the model of the system's evolution. For discrete measurement outcomes, the logarithmic term (in brackets) has to be accumulated and is used to evaluate a loglikelihood correction term when optimizing the overall return R using gradient ascent, cf Eq. (3). An algorithmic flow chart representation of the learning pipeline for the case of discrete measurements is provided in Appendix B.
implementing stochastic controls, in order to avoid any need to take gradients through the possibly unknown dynamics of the system to be controlled. For more details see Appendix A. In our case, by contrast, we do take gradients through the known dynamics and the controls themselves are deterministic when conditioned on a fixed sequence of measurements. The randomness enters via the stochastic measurement outcomes (these are observations of the "environment" in RL language).
Due to the sequential nature of the control procedure, the log-likelihood term can be rewritten as a sum of contributions, ln P θ (m) = j ln P θ (m j |m j−1 ). Thus, during the individual time evolution trajectory, this term may be easily accumulated step by step, since the conditional probabilities are known (these are just the POVM measurement probabilities). The gradients of Eq. (3) can then be taken for such an individual trajectory or a batch, substituting stochastic sampling for an exact average over m. The whole approach, with its calculational pipeline, is illustrated schematically in Fig. 2. Additionally, a more detailed algorithmic flow-chart representation is given in Appendix B.
The Monte-Carlo evaluation of the average reward R(m) m = m P (m)R(m) is a crucial ingredient to tackle long sequences of stochastic measurements, as it allows to focus only on the most likely measurement outcomes among the exponentially large set of such outcomes. Only for very short sequences would it be feasible to instead explicitly evaluate the sum over all possible measurement outcomes, producing less noisy gradients.
The evaluation of the gradients of the return with respect to the trainable parameters θ can proceed in two different ways, using either automatic differentiation (see below) or exploiting analytical approaches to obtain explicit expressions for the gradients that can then be evaluated numerically. In the latter case, one can either set up evolution equations for the parameter-gradient of the quantum state, ∂ θρ or, in the suitable scenario, directly apply a modified version of the original GRAPE technique to efficiently evaluate the gradients. We describe both of these procedures in detail in Appendix C. In the language of current machine learning concepts, taking the gradient through the continuous-evolution intervals would be generally speaking an example of the concept of neural ordinary differential equations, a rather recent development [53].
Alternatively, and sometimes more conveniently, the whole evolution pipeline described above can straightforwardly be implemented in an automatic differentiation framework, such as TensorFlow [54], PyTorch, JAX, or others. Gradients of the resulting overall return and of the log-probability can then be obtained using that framework without extra effort. The sequence of discrete measurement outcomes of a given trajectory is considered fixed when taking the gradient in this manner. The automatic-differentiation approach is particularly helpful and efficient in cases where the whole time evolution can be split into many building blocks (parametrized gates, i.e. unitaries, acting during fixed time intervals), as is common practice for many quantum control tasks in present quantum computing platforms. Whenever this latter situation is encountered, it also aids interpretability, as we will see in the numerical examples.
As we remarked at the beginning, the central quantity of our approach are the measurement-dependent controls F j θ (m j ). For accessing those, one can simply adopt a lookup table, at least for the case of discrete measurements discussed up to now and when the total number of measurements during the full time evolution is not too large. The table for F j θ needs M j entries, if there are M possible outcomes for each measurement, corresponding to the exponentially many possible sequences. In that case, the entries of this table would directly represent the trainable parameters θ. Alternatively, the controls F j θ (m j ) can be implemented via a neural network that takes measurement results as input and maps those to the current control vector. Since the number of available measurement results is different for each time step j, one may choose to set up a different network at each j. However, training efficiency and generalization ability can be improved by constructing a single recurrent network, i.e. a network with memory that is employed in sequence processing tasks [55]. It takes the temporal sequence of measurements as input, one step at a time, producing a control vector at each such time step. This approach can possibly generalize to infinitely long feedback control sequences, for example during state stabilization tasks. In the course of our numerical experiments, to be detailed later, we observed both scenarios where the neural network outperformed the lookup table but also the reverse.
We note in passing that both the look-up table and recurrent NN approaches do not require a Bayesian estimate of the state during deployment in an experiment: they operate purely on the measurement outcomes. For the sake of comparison, however, we also considered a Bayesian quantum feedback approach [51] in which a fully connected NN is provided with the quantum state before each time step. This approach suffers problems with scalability, because it can be deployed in an experiment only in combination with real-time simulations of a stochastic master equation [51] to update the quantum state based on the measurement outcomes. We will see later that our numerical results indicate that the knowledge of the full quantum state does not appear to provide any substantial learning advantage compared to our non-Bayesian approaches.
Continuous measurement outcomes can be treated in exactly the same way as discrete ones. However, for that scenario there also exists an alternative, which obviates the need for the logarithmic-likelihood correction term: we can adopt a general version of what is known as the 'reparametrization trick' in stochastic neural networks (e.g. in variational autoencoders). The idea is that we can generate a stochastic variable z according to some fixed probability density and then transform this into the required measurement probability density p(m|ρ), which does depend on control parameters (via the stateρ, as explained above) and must be subjected to gradients. This parameter-dependent transformation can be implemented in a differentiable way, as we now show. We first obtain the cumulative distribution function f (m) = m −∞ p(m |ρ)dm , by discretizing p as a vector on a lattice and using a cumulative sum for an Euler approximation of the integral (this operation exists in frameworks like TensorFlow). We then draw a random uniformly distributed z ∈ [0, 1] and invert f (m). The last step also needs to be performed in a differentiable way. One option is to set Here z n = f (m n ) defines the lattice version of f , H is the Heaviside step function, the sum ranges over the lattice points, andm n solves the piecewise linearized approximation of m = f −1 (z) associated with the interval n: m n = (m n+1 − m n )(z − z n )/(z n+1 − z n ) + m n . The set of measure zero where the gradient is undefined can be ignored, as is common practice in using activation functions like rectified linear units in neural networks.
In this way, one can implement, within the automatic differentiation framework, for example measurements of discrete variables with continuous outcomes. A typical case would be a qubit measurement with m = σ + ξ, where σ = ±1 is the qubit state and ξ some measurement noise of density q(ξ). Formally, p(m|ρ) = σ q(m− σ)ρ σσ , andM (m) = σ q(m − σ) |σ σ|. One can also perform measurements on continuous variables, e.g. a weak measurement of position, p(m|ρ) = dxq(m − x)ρ(x, x), withM (m) = dx q(m − x) |x x|. The dependence of the probability density p in each case on the parameters determining the control functions at earlier times will be correctly taken into account, and one can now use the straightforward formula ∂R/∂θ = ∂R(m)/∂θ m for stochastic sampling of the gradient. Note that the discrete-outcome case (above) and the continuous-outcome case can also be easily combined in our approach.
Our reparametrization trick allows to switch from any arbitrary state-dependent probability density p(m|ρ) to an easy-to-sample fixed probability density, allowing to obviate the need for the log-likelihood term that was required in the case of discrete outcomes discussed earlier.
There is a special limiting case in which an even simpler linear reparametrization achieves the same goal. We are referring to the case in which p(m|ρ) is a Gaussian which depends on the state of the system only via its mean valuē m, p(m|ρ) = N (m(σ), 1). In this case, one can switch to a fixed (easy-to-sample) Gaussian probability density with the reparametrization z = m −m(σ) ∼ N (0, 1). This reparametrization has the advantage that it can be trivially inverted. This kind of description automatically arises in the well-known quantum trajectories approach applied to homodyne detection [49] of light emerging from a cavity. In that setting, the observable m of interest is the observed homodyne detection current, appropriately rescaled and integrated over a time window ∆t much smaller than the typical system decay time. Its distribution depends on the state of the system only via its mean valuem = tr Ôρ whereÔ is the system observable that couple to the bath and ∝ √ ∆t. In the framework of this quantum trajectory approach, one can, thus, find feedback strategies conditioned on the homodyne current by differentiating through the system dynamics without introducing our more flexible reparametrization of p(m|ρ) or our additional log-likelihood term. This approach has been pursued recently by Schäfer et al. [50].
So far, controls have been continuous and represented via functions (differentiable with respect to parameters) depending on previous measurement results. However, sometimes one might want to also take discrete actions, e.g. deciding whether some measurement should be performed at all or not, or whether some fixed qubit gate should be applied. This can be incorporated without any substantial changes to the approach discussed here, borrowing from policy-gradient model-free reinforcement learning, by introducing stochastic actions a, in contrast to the deterministic continuous actions discussed so far: use a network or a lookup-table to calculate the probability P θ (a j |m j , a j−1 ) of taking a discrete action a j at 6 qubit drive step j given the previous measurement record m j and actions a j−1 = (a j−1 , . . . , a 2 , a 1 ) and then sample from all actions accordingly. Then, the same form for the gradient for the reward applies as in Eq. (3) but now with both the reward R and the probability P θ depending not only on the measurement history m but also on the history of all the actions taken throughout the trajectory, a = (a 1 , a 2 , . . .). As before, the probability in Eq. (3) can be replaced by Monte Carlo sampling while the log-likelihood term can be accumulated according to ln P θ (m, a) = j ln P θ (m j |m j−1 , a j−1 ) + ln P θ (a j |m j , a j−1 ).

III. NUMERICAL EXAMPLES
We now turn to an illustration of the feedback-GRAPE method by solving several different challenging quantum feedback control tasks. We will consider five separate tasks of increasing difficulty: starting with noiseless state preparation (an open-loop control task) as a baseline benchmark for GRAPE-type control in this scenario, then moving to purification (a task that already benefits from feedback, i.e. adaptive measurements), to feedbackbased state preparation in the presence of noisy control parameters or out of a thermal state, and feedback-based state stabilization. Along the way, we will explore a handful of different experimental scenarios.
A. State preparation with Jaynes-Cummings controls As a preliminary step, we consider state preparation of a target state starting from a pure state. In addition, we assume that any coupling to an external environment is negligible and that the parametrized controls can be implemented perfectly. In this setting, the preparation of a quantum state does not require any feedback and, thus, we will not yet be able to test our feedback extension of GRAPE. Instead, the purpose of this section is to provide a compelling motivation for our approach, showing that, even before feedback is introduced, (GRAPE-type) model-based optimal-control approaches outperform, some times dramatically, their model-free counterparts.
As a first example, we consider the state preparation of a cavity resonantly coupled to an externally driven qubit, cf Fig. 3a. This scenario is modelled by the wellknown Jaynes-Cummings Hamiltonian [56]. It is the first and simplest light-matter coupling scenario that emerged in quantum optics [56,57] but is nowadays of practical relevance for modern quantum-computing platforms [52]. In those, it is employed both for qubit readout and for qubit-enabled nonlinear manipulation of cavity states. Here, we will consider a particular sequence of parametrized unitary gates originally introduced by Law and Eberly [58]. The sequence consists in a series of two interleaved gates, cf Fig.3b. In the first gate, the qubit is driven externally to implement an arbitrary rotation about an equatorial axis, implementing the unitary gateÛ q (α) = exp −i ασ c + + α * σc − /2 . Here we introduced the qubit raising (lowering) operatorσ c + (σ c − ), and |α| is the rotation angle while arg(α/|α|) is the azimuthal angle of the rotation axis. In the second gate, the qubit and cavity mode with ladder operatorâ can be coupled for a variable duration, exchanging excitations,Û qc (β) = exp −i βâσ c + + β * â †σc − /2 . Here, |β| is proportional to the interaction time, see Appendix D. Depending on the target state, the control parameters α and β can be chosen to be complex or real. In the latter case, the control vector F j defined in Section II is simply F j = (α j , β j ).
In their groundbreaking work [58], Law and Eberly showed that any arbitrary superposition of Fock states with maximal excitation number N can be prepared out of the ground state in a sequence of N such interleaved gates, providing also an algorithm to find the correct angles and interaction durations (see Appendix D). This solution has been used to remarkable effect in experiments with superconducting qubits [59]. Here, we use it as a benchmark to test different RL approaches.
With the goal of recovering the strategies predicted by Law and Eberly, we set the return R equal to the state fidelity at the final time step, prescribing a fixed number of time steps equal to the maximum number of excitations in the target state, e.g. for the state ∝ |1 +|3 we set N = 3.
Somewhat surprisingly, state-of-the-art model-free reinforcement learning is not able to cope well with this challenge. We employed proximal-policy optimization (PPO) [60], a powerful and widely used modern generalpurpose advantage actor-critic approach to optimize the continuous controls. It performs well only for the very simple task of preparing Fock state |1 , while getting stuck at bad final overlaps for higher Fock states, cf Fig. 3c. This statement holds even after training for many episodes and varying the hyperparameters, and even for other modern general model-free RL algorithms that we tried, see Appendix E.
In contrast, direct gradient ascent through the unitary evolution, using the control parameters as learning parameters, θ = {F j = (α j , β j )}, allows to find optimal state preparation strategies performing as well as the known Law-Eberly algorithm for a broad range of states. As examples, we have prepared Fock states with excitations numbers up to n = 10, and superpositions of two Fock states, cf Fig. 3d. For all these states, convergence of the training protocol has been obtained in a single run and the infidelity can be decreased up to the numerical precision of the algorithm.
In addition, we have also considered a much more challenging four-component kitten state built from four coherent states, |ψ Kit4 α ∝ 3 j=0 |i jα . Here, we consider α = 3, corresponding to the average photon number n ≈ |α| 2 = 9. We attempt to prepare this state using a long sequence of 20 time steps. In this case, we see that the infidelity tends to decrease step-wise during the training, cf Fig. 3d. In Appendix F, we show that the detailed evolution of the fidelity during training depends strongly on the initialization. On the other hand, the height of the steps is an intrinsic feature of the target state. Interestingly, each step can be associated to a particular intermediate state that can be reached using a large number of different strategies. Each such strategy corresponds to a saddle point of the optimization landscape. The training becomes particularly slow close to these saddle points because the curvature of the optimization landscape in the direction of increasing fidelity is zero, giving rise to a narrow valley. This can cause the training to stall on a suboptimal solution. Nevertheless, if the preparation sequences are made longer (larger N ), good solutions can be found in any training run. For more details see Appendix F.
Physically, the presence of a large number of narrow valleys and plateaus in the optimization landscape is due to the fact that in the Law and Eberly protocol the excitations can only be added one by one, first exciting the qubit and then transferring them to the oscillator. They are present even for simple tasks such as the preparation of a simple Fock state with excitation number n prepared in N = n time steps (as in Fig. 3c,d). In this case, it is enough that a single control parameter (β j or α j for any j) is zero for the fidelity and its gradient to vanish. If two or more parameters are small also the Hessian is zero, leading to a plateau. In the model-based approach this leads to a slow convergence for large n, cf Fig. 3d). On the other hand, lack of direct access to the gradient in a model-free approach prevents convergence even for small values of n, cf Fig. 3d.
In order to further substantiate that model-based GRAPE-type optimal control approaches are more efficient than their model-free counterparts, we analyze the state preparation of a so-called Gottesman-Kitaev-Preskill state [61] using the same set of universal controls recently adopted by Sivak et al. [38] to demonstrate their model-free quantum control approach. Direct comparison of our results, reported in Appendix G, with those of Ref. [38] shows the following: in this setting with more powerful controls, in which model-free RL performs already well, a GRAPE-type approach performs even better. In particular, it allows to explore much larger parameter spaces and, thus, obtain better quality solutions using only a small fraction (about one percent) of simulated training trajectories.
Regardless of these detailed observations, these examples indicate that model-based gradient ascent approaches can outperform model-free generic methods for optimizing quantum control in settings relevant for quantum technologies. Given the large performance difference already in the open-loop control scenario we focused entirely on the feedback-GRAPE approach in the subsequent exploration of the more advanced challenges that do include feedback.

B. State purification with qubit-mediated measurement
Next, we move to a first example of a situation that requires feedback. We will now imagine that the cavity is initially in a mixed state. The goal will be to purify the cavity's state, i.e. the reward is determined by the purity trρ 2 cav of the cavity state at the final time. Purification can be achieved by applying repeated quantum measurements, which remove entropy from the quantum system.
In the following, we consider an adaptive measurement scheme originally proposed in [62] and demonstrated in a series of experiments on Rydberg atoms interacting with microwave cavities [21,[63][64][65]. In this scheme, the cavity is coupled to an ancilla qubit, which can then be read out to update our knowledge of the cavity's quantum state, cf sketch in Fig. 4a.
The measurement comprises several steps, which we will list individually before summarizing their combined effect on the cavity state. In a first step, the ancilla qubit with Pauli operatorsσ a i=x,y,z is prepared in the +x eigenstate. Subsequently it is coupled dispersively to the cavity for a variable amount of time. The dispersive coupling in experiments is linear in the photon number to a very good approximation in the low-photon regime, and is described by a unitary of the form: with parameter γ depending linearly on the interaction time. This means the qubit precesses by an angle that depends linearly on the number of photons inside the cavity. In the next, final step, the ancilla qubit is projected along some selected axisσ a x cos δ +σ a y sin δ, yielding a discrete result m ∈ {−1, +1}. The combined effect of these operations is to perform a POVM on the cavity, with outcome probability P (m) = tr[M (m) †M (m)ρ] and an updated stateM (m)ρM (m) † /P (m). Hereρ is the state of the qubit-cavity system, excluding the measurement qubit which has been eliminated in this description. The measurement operatorM (m) is given bŷ and likewise for m = −1, with cos replaced by sin. This formula indicates that after the measurement the probabilities of the different cavity Fock states |n will be multiplied by a sinusoidal "mask", where the period is determined by 1/γ and the phase shift is set by both δ and the measurement outcome m. This helps to pinpoint the state of the cavity, especially when multiple such measurements are carried out with suitably chosen periodicities [62] and phase shifts [65]. Fig. 4c shows the results of applying the feedback-GRAPE method to this problem (labeled 'Adaptive'). We employ a recurrent neural network to produce the controls F j = (γ j , δ j ) when provided with the measurement outcome sequence (more details on numerical parameters can be found in appendices D,E). As we see, the impurity quickly decreases with the number of allowed measurements, and it does so significantly better than in a non-adaptive scheme, where the sequence of measurement controls δ j and γ j is still optimized, but where these controls are not allowed to depend on previous measurement outcomes. To visualize and analyze the numerically obtained strategy, we introduce in Fig. 4d a decision tree. This is extracted via an automated numerical procedure, by running many trajectories and noting in each case the controls suggested by the adaptive strategy. The controls are a deterministic function of previous measurement outcomes.
Such a decision tree will contain all information about the adaptive strategy learned by the NN and can possibly allow the user to give it a physical interpretation and extrapolate analytical solutions for larger numbers of control steps. This might require to leverage any available physical understanding of the control operations, e.g. identifying physically significant values of the control parameters. Using our understanding of the model's physics, we can choose to (analytically) interpret the controls, e.g. trying to represent them in terms of fractional multiples of π. This kind of analysis is optional, and independent of our method, but it nicely demonstrates what can be usefully done in settings with discrete measurements, generating additional insights after running the general-purpose algorithm. For example, here, we were able to take inspiration from the decision tree for four measurements and a specific value of the temperature to extrapolate the optimal purification strategy for any temperature and any number of measurements, see appendix J.

C. State preparation from a thermal state with Jaynes-Cummings controls
We now turn to a task that involves both feedback and control simultaneously. Specifically, we consider state preparation out of a thermal state, for target states that are selected as arbitrary superpositions of the first few Fock states. For this purpose, we consider the setup shown in Fig. 5a, comprising both an ancilla and a control qubit to combine the parameterized measurements introduced in Section III B with the Jaynes-Cummings control gates introduced in Section III A. The resulting sequence of parameterized controls is shown in Fig. 5b.
Results for the state (|1 + |2 + |3 )/ √ 3 and a fourcomponent kitten state |ψ Kit4 α withα = √ 2, are shown in Fig. 5c,d. Feedback-GRAPE converges in about 1000 gradient-ascent steps (each operating on a batch of 10 sampled trajectories). We ran the method several times, starting with different initial random configurations of the trainable parameters θ, demonstrating that convergence is robust, despite the usual absence of a guarantee for such a non-convex optimization problem. For more details see Appendix I.
It is interesting to analyze in some more detail the convergence behaviour. As one noteworthy observation, despite the overall very good performance, we sometimes Purity Fidelity  Each trajectory corresponds to N time steps. Each time step consists in a parameterized measurement, followed by two unitary gates. c) Gradient-ascent progress for two target states (the curves are smoothed with a moving average), and d) final infidelity vs total number of time steps. In (d), each point is the best out of 30 training runs. The statistics of the final infidelity for random training initialization is analyzed in Appendix I. is analyzed in Appendix I. e) Evolution of reduced qubit and cavity state (probability as color) for one trajectory of the converged strategy (target |1 +|2 +|3 ); time points of measurements (with results) and controls are indicated as in (b). f) Corresponding decision tree, for the most probable sequences of measurement outcomes. The red boxes show (α/π|β/π), and "no meas" means the parameters γ, δ are such that no measurement takes place. g,h) Gradient-ascent progress (for the same state and number of steps as in e,f) with and without memory and for different values of the learning rate (see legend).
find that the algorithm may get stuck at suboptimal solutions if we increase the total number of time steps available for the feedback sequence (Fig. 5d). Ideally, an increased number of steps should always lead to an improvement (in the present scenario), but apparently the larger space of control variables then becomes challenging. This can be mitigated to some extent by running the gradient ascent repeatedly from random starting conditions. We discuss other possible solutions to this general problem in Section IV.
One motivation for the use of a neural network instead of a lookup table is that the number of parameters needed for a tree-type table grows exponentially, while a neural network could in principle make use of a much smaller number of parameters. Also, it may be expected that the strategy of a network generalizes to situations with a number of time steps larger than the one it was trained on. Despite these obvious advantages of neural networks, we found (to our surprise) that lookup tables often converge to better fidelities than recurrent NNs, in the present example scenario with feedback, cf the red and green lines in Fig. 5c,d. Another important observation is that both look-up table and recurrent NN methods fare at least equally well if not better than a quantumstate-aware NN controller (pink line in Fig. 5c,d). As mentioned above, these methods are preferrable as they do not require any real-time simulations during deployment in an experiment. The reasons for these observations are still unclear and merit future investigation.
What is the nature of the feedback strategies that the algorithm discovers? Naively, we might expect the following strategy: an optimized adaptive purification phase, of the kind discussed above, leading to some Fock state |n , followed by state preparation that is derived from the Law-Eberly protocol (e.g. going back down to the ground state and then building up the arbitrary target state from there). However, the actual strategies discovered by feedback-GRAPE are significantly more efficient. They interleave adaptive measurements and controls already in the first stage of the process. This can be seen in Fig. 5e,f, where the goal was to prepare the equal superposition (|1 + |2 + |3 )/ √ 3. Again, it is possible to obtain more information about the full strategy (as opposed to a single trajectory), by extracting a decision tree (Fig. 5f). There, we observe that measurements are sometimes deliberately performed in such a way that certain Fock states are completely ruled out (their probability is set to 0), which requires certain choices of measurement control parameters. Simultaneously, qubit-cavity interaction cycles are employed to reduce the excitation Steps |0ic |5ic |10ic Probability 700 800 10°2 a) Sketch of the feedback control sequence, including also physical decay of the cavity. We assume that multiple control substeps can be applied after each measurement. Each substep comprises a qubit driving gate followed by a qubit-cavity interaction gate. The decay with rate κ is incorporated by interleaving the parameterized controls with dissipative evolution of fixed durations tM and tc before a measurement and control substep, respectively. See Appendix H for more details. b) Trajectories after optimization. We show the evolution of the oscillator Wigner function during one step of the feedback control sequence stabilizing a four-legged kitten state with average photon numbern = 9. Indicated are also the probabilities P of each measurement outcome as well as the fidelity F after the decay and control steps. c) Performance of the strategy found by feedback-GRAPE. We show the final fidelity for various target states and number of steps N . After each decay and measurement, a single control sequence (i.e. only one choice of α and β) is applied. The four columns represents different numbers of decay steps experienced by the state (here: N=1, 2, 3 and 4). The bars with lower values show the bare decay of the fidelity (here for κtM = 0.05 and tc = 0), when no feedback strategy is employed. d) For sufficiently low dissipation (small tc), the fidelity can be increased by applying more control substeps. We show the fidelity as a function of the number of substeps for N = 1, κtM = 0.1 and three different decay durations tc, κtc = 0, 1 × 10 −4 , 2 × 10 −4 . e) Stabilization of a Fock state (here |5 ) for an arbitrarily long time, employing the generalization ability of a recurrent neural network (RNN).
number of the cavity.

D. State stabilization in a noisy environment with Jaynes-Cummings controls
Quantum state stabilization in a noisy environment represents another challenging task that can be solved using feedback-GRAPE.
In this scenario, the interaction with the environment induces decay and decoherence of the quantum state. Both these effects can be suppressed by probing the system with an appropriate stream of quantum measurements interleaved with unitary gates, leading to the longterm stabilization of the quantum state.
We use the same Jaynes-Cummings feedback control scheme as in the previous section, allowing for multiple control substeps. Each substep comprises a qubit drive and a qubit-oscillator interaction gate. In addition, we model physical decay of the cavity with decay rate κ, interleaving the substeps of the feedback control sequence with intervals of dissipative evolution of fixed durations t M and t c , cf Fig. 6(a). These could be interpreted as waiting times before applying instantaneous measurement and control gates, respectively, but more realistically they can effectively incorporate the decay and decoherence that has occured during finite-duration operations. For this approximation to work the decay needs to be weak, which is the case for our scenario.
As an illustrative example, we show the two possible trajectories for a circuit comprising a single step of the feedback control sequence and optimized to stabilized a four-legged kitten state |ψ Kit4 α with average photon numbern ≈ 9, cf Fig. 6(b). The learning algorithm selects the measurement parameters δ = 0, and γ = π/2. This seems a natural choice because in this caseM (m = 1)|ψ Kit4 α = |ψ Kit4 α . This implies that the measurement leaves invariant the target Kitten state. In addition, a measurement outcome m = −1 postselects an orthogonal state. The latter outcome is verified, for example, if a single-photon leaks out of the cavity and occurs with 30% probability after the dissipative evolution of duration κt M = 0.05. The fidelity recovers up to 91% after a single Jaynes-Cummings control sequence is applied. This is a good result considering the limited expressivity of our control scheme. The fidelity can be moderately increased by allowing more steps provided that the decay during the control protocol is not too large, cf. Fig. 6(d).
The fidelity for several quantum states, including Fock states and superpositions thereof, for a varying number of time steps (up to 4) is shown in Fig. 6(c). As a comparison, the bare decay in the absence of any control is also displayed. These results demonstrate the ability of feedback GRAPE to discover strategies to mitigate the effect of dissipation for a variety of quantum states.
As we explained above, lookup tables often perform surprisingly well. We now briefly demonstrate, in the context of state stabilization, one example where the power of a neural network is clearly helpful (Fig. 6e). We first train a RNN on sequences of 20 steps, with the goal to stabilize a given Fock state for an arbitrarily long time. For this example, the cumulative reward of a trajectory is not only the final fidelity, but the sum of fidelities at all time steps. After training, we test on a 40 times longer simulation, and we see that the strategy learned by the RNN generalizes well even for longer sequences. We note how the strategy can recover, even when some "unlucky" measurement outcomes significantly perturb the quantum state.

E. State stabilization with SNAP gates and displacement gates
Using feedback-GRAPE applied to the Jaynes-Cummings scenario has allowed us to discover strategies extending the lifetime of a range of quantum states. However, for more complex quantum states like kitten states the infidelity becomes significant after just a few dissipative evolution steps in spite of the feedback, cf. Fig. 6(c). This raises the question whether the limited quality of the stabilization is to be attributed to a failure of our feedback-GRAPE learning algorithm to properly explore the control parameters landscape or rather due to the limited expressivity of the controls. With the goal of addressing this question, we test our method on the state stabilization task using a more expressive control scheme. Specifically, we use a universal control scheme originally proposed in [66]. This consists in a sequence of interleaved Selective Number-dependent Arbitrary Phase (SNAP) gatesŜ({ϕ n })) = n e iϕn |n n| and displacement gatesD(α) = exp αâ † − α * â . This is the same control scheme adopted by Sivak et al [38] to demonstrate their model-free optimal-control approach for state preparation and, as mentioned above, we have also used it to demonstrate the preparation of a so-called Gottesman-Kitaev-Preskill state [61] with open-loop controls, see Appendix G. We now go one step further and employ these powerful controls inside a feedback-based state stabilization scheme, optimized via feedback-GRAPE.
As a test example we consider the feedback-based stabilization of a kitten state built from two coherent states, |ψ Kit2 α ∝ |α + | −α withα = 2, corresponding to an average photon number ofn ≈ |α| 2 = 4. This state has even parityP |ψ Kit2 After an excitation leaks out of the cavity, it decays into an odd cat state with the sameα, cf Fig. 7(b). Thus, we can detect these decay processes using repeated parity measurements. After such a process, an optimized control step can transform the odd kitten state back into the target kitten state with high fidelity, Fig. 7(b). These considerations motivate us to use the feedback control sequence shown in Fig. 7(a). We use as control parameters the real and imaginary part of the phase-space displacements α j together with the phases ϕ j n for the first N SNAP Fock states (the remaining phases are set to zero).
We have trained several NNs using as a reward the time-averaged fidelitiy, R = N j F j /N with F j calcu-lated after applying the block of unitary gates. We have considered two different durations N t of the quantum trajectories seen during training. We have also considered two different scenarios for our description of the dissipative evolution. In a first scenario, the dissipative evolution is concentrated before the measurement, t c = 0 and κt M = 0.01. In a second scenario, the dissipative evolution is subdivided into two intervals before and after the measurement, κt M = κt c = 0.005. Finally, we have tested the NNs performance in stabilizing the kitten state for N i = 200 time-steps. This is much larger than the number of time-steps seen during training, N t = 2 or N t = 10, and the typical decoherence time t d for our kitten state, κt d =n −1 ≈ 1/4 corresponding to ≈ 25 time steps, Fig. 7(c-d).
Our results obtained using the best performing NNs for each of the four scenarios discussed above are summarized in Fig. 7(c-d). Panel (d) shows the infidelity as a function of time, averaged over a representative set of trajectories. The infidelity is plotted as a solid line in the time interval seen during training (and dashed thereafter). As a comparison, the infidelity without any stabilization (gray line) and the infidelity after a single interval of dissipative evolution of duration (t c + t M )/2 (the lower-edge of the blue-shaded region) are also shown. The latter represents a theoretical lower bound for the infidelity in the scenario with t c = t d (as the effects of dissipation after the last measurement can not be corrected). Panel (c) shows a histogram for the distribution of the time-averaged infidelity 1 − R for single quantum trajectories.
From these results, we can generally conclude that the feedback strategies discovered using feedback-GRAPE allow to maintain a low infidelity for measurement sequences much longer than those seen during training, demonstrating a remarkable power of generalization. For the scenario with dissipation injected after the measurement (blue lines), the fidelity remains just above the theoretical lower bound for much of the time-evolution. In both dissipation scenarios, the NNs trained on longer measurement sequences perform better. This tendency is most evident in the dissipative scenario with t c = 0. In this case, the NN trained on sequences of just two measurements performs very well on a similar time horizon. However, its ability to generalize the feedback strategy to a longer time evolution is poorer.
In order to better understand this behavior, we plot the time evolution of the infidelity for two typical trajectories (one for each NN), showing also the infidelity after each substep (decay, measurement, or control), cf Fig. 7(e). From these results, one can see that the strategies learned by the two NNs are qualitatively different: The NN trained on shorter measurement sequences pursues a greedy strategy that decreases the infidelity after each block of unitary gates (top). In contrast, the NN trained on a longer measurement sequence learns a non-greedy strategy which increases the infidelity after even parity measurements (bottom). Another obvi- ous difference between the two strategies is that the latter triggers more often odd parity measurements, which are imprinted in the infidelity as peaks of unit height cf Fig. 7(e). This is a good sign because it indicates that the parity measurements are able to extract more of the entropy injected during the dissipative part of the dynamics or, equivalently, better suppress decoherence. This is also reflected in the clear interference fringes in the Wigner function of the state after 200 time steps, 8 times larger than the typical decoherence time, cf inset of Fig. 7(e).
This example highlights the importance of training on long measurement sequences to develop more robust stabilization strategies.
Overall, we conclude that expressive controls like the well-established SNAP gate allow feedback-GRAPE to discover excellent feedback strategies in an efficient manner, and that strategies discovered for shorter training sequences generalize nicely to much longer sequences. To the best of our knowledge, the level of state stabilization performance demonstrated here has not been achieved with any other method, despite this being an area of active research in the context of quantum devices and quantum error correction.

F. State preparation in the presence of model uncertainty
Until now we have assumed that a model for the stochastic dynamics of our quantum system is known without any uncertainty. However, in practice, the model parameters are known only with a finite precision, they might deviate from theoretical predictions because of disorder and be difficult to measure precisely. Moreover, they might even be subject to slow drifts because of environment-induced changes in the quantum device. For all these reasons, an important direction of research in optimal control focusses on improving model-based methods to better perform in the presence of model uncertainties [13][14][15][16]. Our goal in the present section is to explore how feedback-GRAPE can contribute to this challenge. At the same time, this will also allow us to study the optimization landscape in a manageable example.
There are two fundamentally different approaches to deal with model uncertainties in a coherent control setting. In the first, "data-driven" approach experimental data are used during training. In the second "fluctuationmodel-based" approach, the model parameters are sampled from a probability distribution during training. The resulting parameter fluctuations reflect an imperfect knowledge of the model parameters, and the strategy is optimized for being resilient against these fluctuations.
Several extensions of GRAPE have been proposed to incorporate uncertainty in the model parameters following either the "data-driven" approach [13,14,16] or the "fluctuation-model-based" approach [15]. However, we emphasize that none of these extensions include feedback.
Our feedback-GRAPE method can also be extended to account for parameter uncertainties, both by using a "data-driven" or a "fluctuation-model-based" approach. For the "data-driven" approach, one could follow a simi- . Also shown is the infidelity for the intuitive feedback strategyḡτj ≈ π. h) Same as (e), here, for the optimal strategy without feedback (red), the optimal (blue) and intuitive (gray dashed) feedback strategies with N = 8.
lar approach as d-GRAPE [14] and c-GRAPE [16], modifying the analytical formula for the learning gradient (see Appendix C) to incorporate also operators estimated in quantum tomography experiments. Alternatively, feedback-GRAPE could be used in combination with model-free RL with both methods sharing the same controller in the form of a recurrent NN. In this setting, feedback-GRAPE would be used for the initial training of the controller allowing to explore higher-dimensional control parameter spaces. Afterwards, the controller will be trained on experimental data using a model-free approach to obtain a more accurate feedback strategy.
In the following, we demonstrate instead in more detail the "fluctuation-model-based" approach. This approach is not only more straightforward to implement. It is also best suited to a scenario with feedback: Since the measurement statistics depends on the model parameters, the measurement outcomes carry information about the underlying model parameters. At the same time, the strategies are conditional on the measurement outcomes and, thus, can be adapted to the most likely underly-ing model parameters. This approach is so powerful that it is sometimes worth to adopt it even in a model-free RL approach to forgo costly training on experimental data. A prominent example of this is the control of tokamak plasma where zero-shot transfer from simulations to hardware with imprecisely known parameters has been demonstrated [67].
We consider a simple toy model in which an inhomogeneous ensemble of qubits initially in the ground state are subject to a series of N pulses interleaved with projective measurements on their computational basis, cf Fig. 8a. The duration of the pulses {τ j } can be controlled but the coupling g of a qubit to the driving field is a random variable distributed according to a Gaussian distribution of averageḡ and standard deviation σ, cf Fig. 8b. As a consequence, the Bloch sphere rotation angles α j = gτ j will also be Gaussian random variables, now with standard deviations τ j σ, cf Fig. 8a.
Our goal is to maximize the number of qubits correctly flipped to their excited state or, equivalently, the fidelity averaged over the measurement outcomes and the cou-pling g, F N m∼P (m|g) g∼P (g) . In this case, we are interested in the optimal solution for a fixed number of pulses N .
Before discussing this problem in general, let us consider the limiting case of only one time step, N = 1. In this case, the pulse duration τ 0 is the only control parameter and there is no feedback by any previous measurement. The average fidelity as a function of this parameter is shown in Fig. 8c. In the limiting case without fluctuations, a single π-pulse of duration τ 0 = π/g or any of its odd integer multiples will achieve zero infidelity, cf dashed line in Fig. 8c. Once parameter fluctuations are introduced, one might still expect τ 0 = π/ḡ and any of its odd integer multiple to be optimal. In this way, the spins with coupling g =ḡ, corresponding to the peak of the parameter distribution P (g), would be flipped with unit probability. However, we observe that in reality shorter pulses are favored because they give rise to a narrower distribution P (α 0 ) of the rotation angles α 0 = gτ 0 . This physics leads to a single optimal pulse of duration slightly shorter than τ = π/ḡ and a series of suboptimal pulse durations corresponding to local minima of the average fidelity, see Fig. 8c.
Next, we consider the simplest scenario with feedback, which corresponds to N = 2 time steps, i.e. feedback on a single measurement. In this case, the control parameters are the first pulse duration τ 0 and the two conditional durations of the second pulse, τ 1 (m 0 = 1) and τ 1 (m 0 = −1). If we use our look-up table approach to directly optimize these control parameters, the underlying optimization landscape is a 3D function. In Fig. 8d, we show three 2D cuts of the optimization landscape along with the evolution of the training parameters for 26 feedback-GRAPE training runs, starting from random initial conditions. If we view as equivalent strategies that are connected by transformations of the control parameters that leave invariant the fidelity F N (g) m , we can associate most runs to just four final feedback strategies, see Appendix K for more details. Their coupling-dependent infidelity 1 − F N (g) m is displayed in Fig. 8e. A plurality of the training runs (10 out of 26 runs) converges to the optimal strategy which consists in the combination of a pulse of duration slightly shorter than π/ḡ followed or preceded by a slightly longer pulse. This leads to a small infidelity over the full width of the coupling distribution P (g), cf blue continuous line in Fig. 8e. However, for a significant number of runs (9 out of 26) the duration τ 0 of the first pulse converged asymptotically to zero during training, cf rightmost cut in Fig. 8d. In other words, the first pulse is switched off, effectively reducing the number of time steps to N = 1. The resulting infidelity is small on a much narrower band, cf top of Fig. 8e. The training can lead to this type of solutions because isolated attractors for N = 1 optimization landscape are promoted into 1D manifolds of attractors in the N = 2 optimization landscape, cf the middle and right-hand cuts in Fig. 8d with Fig. 8c.
The insight on the learning dynamics gained for the case with N = 2 time steps can be transferred to the general case of an arbitrary number of time steps N . For a typical run, one or more pulses are switched off, effectively reducing the number of time steps, cf red lines in Fig. 8f. This leads to many runs ending up in suboptimal solutions, cf the red diamonds in Fig. 8g. We note in passing that we have observed a similar learning dynamics for the state preparation of complex superposition of Fock states using the Jaynes-Cummings controls. Also in that case, a large number of local extrema for an optimization task with N time steps could be constructed adding idle time steps to the optimal solution for a smaller number of time steps, see Appendix F. We expect the same type of local extrema to appear in many optimal control tasks that take as ansatz a quantum circuit comprising a finite sequence of parametrized building blocks, irrespective of whether measurements are present or not. The challenge posed by the local minima can be addressed by taking a physically motivated initialization of the control parameters, e.g. τ j (m j−1 ) ≈ π/ḡ if m j−1 = (1, . . . , 1) and τ j (m j−1 ) = 0 otherwise. This is close to a π-pulse for g ≈ḡ if no previous measurement with outcome m = −1 indicated that the spin has flipped. With this approach, we consistently reach the optimal solution for N ≤ 6, cf blue squared in Fig. 8 g. For even larger N , we reach a regime in which for a typical batch of trajectories used to calculate the gradient all spins have flipped. This leads to a very noisy gradient making it difficult to distinguish between many available feedback strategies with low infidelity. We eliminate this problem taking the ansatz τ j (m j−1 ) = 0 if m j−1 = (1, . . . , 1) for the feedback protocol. In this way, the feedback task is reduced to the optimization of N control parameters τ j ≡ τ j (m j−1 ) with m j−1 = (1, . . . , 1). At the same time, the number of possible measurement outcomes is also reduced to the same value. This reduction of the decision tree to only N branches makes it efficient to evaluate the sum in Eq. (2) without resorting to measurement sampling, reducing the gradient fluctuations. In addition, it eliminates the basin of attraction of the suboptimal strategies with a reduced effective number of pulses. Overall, it leads to very robust learning even for large values of N , cf green lines in Fig. 8fg.
Finally, we comment on the robustness to parameter uncertainty for the feedback strategies obtained using feedback-GRAPE. The average infidelity F N m g obtained using the feedback strategy with N = 8 time steps is of the order ∼ 10 −5 , cf Fig. 8g. This means that, in spite of the broad distribution of couplings g, only one qubit out of every 10 5 would remain in the ground state. This compares to approximately 50 with the intuitive feedback strategy and 10 4 using the optimal strategy without feedback. We note that the infidelity 1 − F N (g) m is suppressed in a broad range of couplings g, remaining below the threshold 10 −3 in a broad band of width ≈ 1.5ḡ, cf Fig. 8h. From this, we can conclude that the robustness of our strategy to parameter uncertainty will extend beyond the particular coupling distribution used during training. This example thus has convincingly shown the ability of feedback-GRAPE to deal with parameter uncertainties, both by finding strategies that properly take into account the size of the fluctuations and, on top of that, by exploiting the extra information obtained via measurements.

IV. SCALABILITY AND OPTIMIZATION LANDSCAPE IN FEEDBACK-GRAPE
As we have seen in the numerous examples presented so far, feedback-GRAPE performs very well for quantum feedback tasks in physically relevant scenarios, including the preparation and stabilization of rather complex states. Even though it occasionally got stuck in local optima, in our examples this could often be remedied very simply by re-running from different random starting points a number of times. Nevertheless, in the present section we want to address the aspects of the non-convex optimization landscape and scaling towards larger quantum systems, like those consisting of many qubits, in a more general fashion. These challenges are of course by no means unique to feedback-GRAPE, and we will consequently rely a lot on observations that have been made in the literature, starting from the original GRAPE and going towards recent results on variational quantum circuits.
Already in the original GRAPE article [3] it was recognized that generally speaking GRAPE is a non-convex optimization problem, and it was suggested that adding stochasticity to the gradient update step could help to jump out of local minima. More refined approaches would perform simulated annealing, i.e. slowly reducing the noise strength over time. We note that, in contrast to GRAPE itself, some form of stochasticity is automatically generated in our case by the random outcomes of measurements. The noise strength can effectively be reduced over time using a learning rate schedule.
Two common approaches to work around local minima are also demonstrated in our own numerical experiments, reported in the present article. The first approach consists in avoiding the local minima by starting from a smart initialization of the training parameters. This could be physically motivated as in Section III F or, as it is sometimes done in RL, obtained using some form of pre-training based on supervised learning [68] of an existing approximate strategy. The second approach consists in modifying the ansatz for the feedback protocol which will also modifies the optimization landscape, possibly eliminating or reducing the problematic local minima. A better ansatz could be found using physical insight, as in Section III F, or using some form of derivative-free optimization, e.g. model-free RL as in [69].
Another option for addressing the issue of local optima is to employ the natural gradient. In [70] it was shown for variational quantum circuits in systems of up to 40 qubits that this technique, though computationally more expensive, most successfully avoids local minima when compared to the more well-known techniques, i.e. direct adaptive gradient descent or quasi-Newton methods like the Broyden-Fletcher-Goldfarb-Shanno algorithm. Natural gradient can directly be applied to state preparation problems by computing the Fubini-Study metric of the final state based on its dependence on the control parameters. Thus, it could be employed to help convergence in feedback-GRAPE when the technique is applied to larger qubit numbers.
The examples of feedback-controlled quantum dynamics we have focussed on in this article can be viewed as a combination of parameterized quantum circuits with classical measurements and feedback. When dealing with the question of scalability towards larger qubit numbers, the recent literature on variational quantum circuits (without feedback) suggests that there may arise another challenge that goes beyond the generic problem of getting stuck in local optima for non-convex optimization tasks. As it has been recognized first in [71] and subsequently discussed at length in the literature, one may be stuck in parts of the parameter landscape with essentially zero gradients, i.e. gradients that are exponentially small in the number of qubits; this is the infamous problem of "barren plateaus". Fortunately, the importance of this problem for the quantum computing community has led to a succession of possible suggested solutions, all of which could be transferred to an application like feedback-GRAPE in case the issue arises when applying it eventually to systems with larger numbers of qubits. The proposed solutions comprise (i) smart parameter initialization [72,73], e.g. choosing parameters values that initially lead to unitary blocks equal to the identity , (ii) being smart in the choice of circuit ansatz but avoiding overparametrization (too expressible ansatz structures) [74], (iii) constructing a cost function from local observables instead of a global cost function like the fidelity [75,76]. Last and most relevant to our work, it has been recently shown that incorporating in the quantum circuit the same type of stochastic local measurements that are also used in feedback-GRAPE could by itself induce a phase transition to a regime without barren plateaus [77]. While in their work the measurements only introduce decoherence, it would be worthwhile to explore whether they also help to avoid such plateaus in true feedback scenarios.
Beyond these aspects of the optimization landscape, the overall performance of feedback-GRAPE, and hence its scalability, is also governed by the computational cost associated with single trajectories. Just like any other model-based method, our method can deal only with systems whose time-evolution can be efficiently simulated on a classical computer. The computational cost of calculating the gradient through the system dynamics grows linearly with the number of time steps, just like the cost of the direct time evolution itself. Furthermore, when addressing the scalability for multi-qubit systems, it is true that the cost of a simulation will increase with the size of the Hilbert space and thus rise exponentially with the number of qubits. However, this scaling is no worse than in the original GRAPE or for any other model-based RL method, including those methods that are based on simulations interacting with a model-free approach. From experience with numerical simulations of multi-qubit systems, we deem feedback-GRAPE to be feasible still up to about 10 qubits when simulating master equations and maybe 20 qubits when resorting to quantum jump approaches, evolving pure states. This already covers a lot of unexplored territory for quantum feedback.
Another important aspect is the number of trajectories needed for convergence towards the optimal strategy. This is extremely scenario-dependent and therefore hard to predict in general. However, empirically we have seen that typically thousands of time-evolution trajectories are to be evaluated to converge to an optimum. It is here that the model-based approach of feedback-GRAPE has a big advantage over model-free approaches, since the latter have to employ a lot more trajectories just to implicitly learn the expected behaviour of the quantum system. Indeed, among our examples we have briefly discussed the superior performance of GRAPE vs a modelfree RL approach in the case of SNAP-gate-based cavity state preparation.
We mention in passing another advantage of feedback-GRAPE: It does not require a real-time Bayesian estimate of the quantum state during deployment in an experiment. In this sense, it is more scalable than other existing model-based quantum feedback approaches based on so-called Bayesian quantum feedback.
Finally, another aspect that may affect scaling is generalizability. Our method, for certain feedback tasks, allows a generalization of the feedback strategy. This is exemplified by our stabilization of a kitten state for a long sequence of 200 measurements, 20 times longer than the sequences seen during training. This generalization power decreases substantially the computational cost expended during training.

V. EXTENSIONS
Before concluding our discussion, we outline possible extensions of the general feedback-GRAPE technique introduced above.

A. Reducing sampling noise by using a value function
The average of the return over different measurement outcome sequences is obtained by sampling, which introduces noise into the estimate of the gradients. We can help suppress the noise by adopting value function approaches that are known as a general technique in reinforcement learning [30].
To start, we need to discuss the structure of the rewards more carefully. Above, we introduced the overall return (cumulative reward) as the quantity to be optimized. We can also assign the rewards more specifically to individual time steps. For example, during state stabilization we can evaluate the fidelity at each time step and sum it over time to obtain the return. Likewise, it is customary in some optimal control settings to punish large control amplitudes at any given time step. In all these cases, the return is a sum R = N j=1 r j of individual rewards.
More precisely, in the original approach, we had simply set R = r 1 (m 1 |θ) + r 2 (m 2 , m 1 |θ) + . . .. Here r j (m j , m j−1 , . . . |θ) is the instantaneous reward obtained after time step j (which consisted of some control, some measurement yielding m j , and possibly a further control step before assigning the reward). For any time step j, this then yields two contributions to the overall gradient ascent update. For example, at j = 2 we obtain, in a given trajectory with randomly sampled m 1 , m 2 , . . . the following contributions: Adding up these contributions for all j and averaging over trajectories yields precisely Eq. (3). This is a Monte-Carlo sampling approach. One concern in any such approach is the sampling noise, i.e. in our case the fluctuations of the quantity shown above between different trajectories. We can now take inspiration from the domain of model-free reinforcement learning and the general theory of reinforcement learning [30], where approaches have been invented to reduce the variance in estimations of the gradient update. Recall that in our case, the variance stems from the stochasticity of measurements, whereas in model-free RL it stems from the stochasticity of policy action choices that is encountered in policy-gradient and actor-critic approaches, plus any stochasticity of the environment dynamics. Even though the following steps follow very closely the corresponding tricks known in the model-free RL community, we display them explicitly here, for our modified scenario. This should help avoid any confusion and make this presentation self-contained.
First, when evaluating the gradient above, we need only include the sum of future rewards, since only those can be influenced by the present measurement result. In the example of Eq. (5), this means the term r 1 (m 1 |θ) on the second line may be dropped, as it is independent of m 2 , i.e. the new measurement result. Mathematically, this follows because when we eventually perform the average over trajectories, we have to multiply Eq. (5) by P (m 2 , m 1 |θ) = P (m 2 |m 1 , θ)P (m 1 |θ). Collecting terms, the m 2 -dependency for the r 1 contribution ends up in a sum m2 ∂ θ P (m 2 |m 1 , θ). This sum turns out to be zero due to the normalization of the conditional probability for any value of θ. This insight holds for any j, where it is used to drop all r k (k < j) when they multiply ∂ θ ln P (m j |m j−1 , . . . , θ).
Second, to further suppress stochastic fluctuations one can learn a value function V , which is a function of the current state and represents the expected future cumulative reward, averaged over all possible future measurement outcomes. Thus V (m j , m j−1 , . . . |θ) is defined to be where the label E stands for the expectation value over future rewards, conditioned on the preceding measurement results.
Typically, V would be expressed as a neural network, though a lookup table can also be used in the case of a (modest) number of discrete measurements. The input to the value network would be some representation of the current "state" s. This state could be identified directly with the sequence of previous measurement results, as indicated in our notation above, s j = m j , m j−1 , . . . (which uniquely determines the current state). Alternatively, this state could also be represented by some version of the current quantum state (e.g. the density matrix), if that proves easier to handle for the network. The value network would be trained to output the expected (averaged) future cumulative reward, counted from this state onwards. The value training would proceed in the fashion known from general reinforcement learning, i.e. using the Bellman update equation [30] V new (s j ) = V (s j )+α(r j +γV (s j+1 )−V (s j )), with α < 1 some update factor and γ ≤ 1 some discount factor to reduce the weight of long-term rewards (γ → 1 in the ideal case discussed up to now). When using a neural network, V new would be the new target value for the value network during a supervised-learning update. Once an approximation to the value function has been learned in this manner, we can proceed as in advantage actorcritic approaches to model-free RL. This means that in the gradient ascent procedure of the feedback-GRAPE approach, one would replace the (future) return by the advantage A j = r j + γV (s j+1 ) − V (s j ), which expresses the improvement over the currently expected future return. In effect, this reduces the variance of the gradient estimates by subtracting a convenient baseline, without changing the average gradient update.
Concretely, Eq. (5), the gradient contribution from time step j = 2, would be replaced by the following: The first line is unchanged, but in the second line r 1 was dropped, as explained before. Moreover, the sum of r 3 + r 4 + . . . has been replaced by γV (m 2 , m 1 |θ), which is the expectation of the future return (such that averaging over m 3 , m 4 , . . . has already been carried out, reducing sampling noise). Finally, V (m 1 |θ) was subtracted, to reduce further the variance by canceling the expected value, given m 1 . This is possible for the same reason that we could drop r 1 (m 1 ), as explained above. The extension to arbitrary j = 2 is obvious.
In summary, such an enhanced feedback-GRAPE method would run trajectories with deterministic continuous controls and stochastic discrete quantum measurements just as before. However, it would learn a value function to represent expected future returns, and it would use that value function to modify the gradient ascent procedure and reduce fluctuations.

B. Multi-target quantum feedback control
Whenever we are employing neural networks to represent the feedback-based controls, a straightforward but powerful extension of feedback-GRAPE suggests itself. We may feed a representation of a variable target state Ψ (or, in general, the target task, however it is defined) into the network: F j (θ j , m j , . . . ; Ψ). The whole feedbackcontrol strategy is then trained on many different randomly chosen tasks (e.g. many possible target states).
Such approaches have been successful recently for other control challenges, e.g. they are being investigated in robotic navigation and the general field of multi-target reinforcement learning [78,79]. Multi-target schemes have also been recently suggested to improve variational quantum circuits [80]. The benefit is data-efficiency: the network learns to generalize from the training tasks to other similar tasks, which requires less overall effort than to retrain a freshly initialized network for each task.

VI. CONCLUSIONS AND OUTLOOK
In this work, we have presented a general scheme for the direct gradient-based discovery of quantum feedback strategies. This scheme, which we have labeled feedback-GRAPE, works for arbitrarily strong (discrete or continuous) nonlinear stochastic measurements, which so far had been possible only using the less data-efficient approaches of model-free reinforcement learning.
We observed very good performance, significantly beyond the state of the art, when testing the method on a challenging set of feedback tasks in an important, practically relevant quantum-optical scenario. Overall, our method opens a new route towards solving challenging feedback-based control tasks, including tasks in quantum communication and quantum error correction on multiqubit or qubit-cavity systems. Besides presenting and analyzing the basic approach, we have also discussed extensions such as advantage functions (for reducing sampling noise) and training on multiple targets (to increase data efficiency and exploit transfer learning).

ACKNOWLEDGMENTS
The research is part of the Munich Quantum Valley, which is supported by the Bavarian state government with funds from the Hightech Agenda Bayern Plus.

Appendix A: Brief recap of reinforcement learning
For the reader with a physics or optimal control background, we very briefly recall some key concepts in reinforcement learning (RL). However, these remarks serve only to provide additional context and are not necessary to understand the feedback-GRAPE technique introduced in the main text.
The term RL covers a set of techniques for discovering optimal control strategies, typically involving feedback [30]. The setting can always be phrased as an agent, i.e. a controller, interacting with an environment, where the latter may represent e.g. a device to be controlled. The goal is always to discover a good strategy for the agent, by optimizing some reward -e.g. a fidelity, in the quantum setting. A first distinction is between model-based approaches, which require and exploit a simulation of the environment, and model-free approaches, where the environment is treated as a black box and the agent only learns implicitly about the environment's behaviour through repeated training runs. Feedback-GRAPE would be classified under the domain of model-based approaches, while most general RL algorithms used in machine learning applications nowadays are model-free. A sub-category, sometimes leading to confusion, are those cases where model-free algorithms are used to train an agent in silico, i.e. on simulated environments.
While we explain feedback-GRAPE in depth in the main text, here we briefly comment on one of the two main classes of model-free algorithms, namely policy gradient approaches, since we briefly compare and contrast some aspects of those against feedback-GRAPE in the main text. In such approaches, one represents the policy as a conditional probability to choose an action a given an observed state s of the environment: π θ (a|s) in standard notation of the field. Here θ is a set of parameters that will be updated during training. Training proceeds by performing gradient ascent on the cumulative reward R, in the form Here t is the time step, s t and a t are the sequence of states and actions in a particular trajectory, R is the cumulative reward for that trajectory, and E denotes the expectation value over many trajectories. We note in the main text that the logarithmic derivative appearing here relates to the probability of the agent's actions whereas a superficially similar logarithmic derivative appearing in Learning Initialization Set random trainable parameters θ Apply Measurement Set train counter r = 0

Time-evolution initialization
Set as initial state Apply Control

Update learning parameters
Use optimizer to update θ feedback-GRAPE relates to quantum measurement probabilities, i.e. a property of the environment and not the agent.
Appendix B: Algorithmic flow chart representation of feedback GRAPE In Figure 9, we represent the working flow of feedback-GRAPE for the special case of discrete measurement outcomes as an algorithmic flow chart. This representation provides more details than the conceptual representa-tions in the main text.
Appendix C: Evaluation of the parameter gradients of the time-evolving quantum state In the numerical results in the main text, we have employed automatic differentiation to evaluate gradients with respect to the trainable parameter vector θ. This approach is very convenient using modern machine learning tools. However, alternatively, it is also possible to directly work out analytical formulas to evaluate such gradients, based on our knowledge of the evolution equations. In a particular scenario, where the entries of the vector of trainable parameters θ directly correspond to the controls at different time points, this then produces a suitable extension of the approach advocated in the original GRAPE manuscript [3].
In the following formulas, we will assume for simplicity unitary evolution outside the measurements, but the extension to (Markovian) dissipative dynamics is comparatively straightforward (using a Liouvillian superoperator instead of the Hamiltonian).
We will first describe a general approach which works for any arbitrary choice of the parametrization θ. Further below, we will then specialize to a scenario where the original GRAPE idea for efficient gradient evaluation can be applied.
The general task is to obtain the gradient of the quantum state with respect to the trainable parameters θ that enter the controls (and, likewise, the gradient of the final probability P (m) of a measurement sequence m).
In modern machine learning language, tracking the evolution of parameter-gradients in the manner described in the following is connected to the recent developments of neural ordinary differential equations [53], where efficiency is obtained by not using automatic differentiation as a black box but rather evaluating analytically the form of the equations of motion for the gradients (and then solving those equations numerically with any efficient solver available). We can obtain the parameter gradient of the quantum state by solving the following evolution equation during measurement-free time intervals: whereρ is the solution to the original equation of motion, i∂ tρ = [Ĥ,ρ], and the initial condition at time 0 would be ∂ θρ = 0 (we have set ≡ 1 for brevity). The interesting step now happens at a measurement, wherê ρ(t + ) =M Here the required ∂ θρ (t − ) is the outcome of solving the previous continuous evolution equation up until time t. After this update, the continuous evolution of ∂ θρ (t) will proceed. We note, however, that the controls (embedded insideĤ in the present setup) will now depend on the measurement outcome m that was selected. Likewise for later time intervals, they will depend on the whole previous sequence, as described in the main text. At the end, we also need the gradient of the extra term, the log-likelihood of the whole measurement sequence, ln P (m 1 , m 2 , . . .). One way to obtain this is to evolve an unnormalized version of the quantum state,ρ, whose trace will give P , which follows the same evolution as the quantum state itself, but without the normalization factors that are the probabilities for the individual measurement outcomes. The θ-gradient of this unnormalized state again follows an evolution equation of the form like Eq. C1, just withρ substituted forρ, during the unitary evolution intervals. However, at a measurement-induced update, we obtain the simpler rulẽ What we have described here so far uses less assumptions than GRAPE, because the vector of trainable parameters θ can enter the controls in an arbitrary manner. In GRAPE [3], an additional assumption was used to simplify the gradients further and gain efficiency: The components of the vector of trainable parameters θ were supposed to directly correspond to the control values applied at different time steps. That is, schematically speaking, we would have θ 1 , θ 2 , . . . associated with the controls at time steps j = 1, 2, . . .. This then leads to a further simplification in the evaluation of the gradients. Importantly, if the number of parameters scales with the number of time steps N , then this approach has a runtime growing only linearly in N , while the general approach outlined above would need N 2 operations.
Let us briefly recall the GRAPE approach to gradient evaluation [3], before extending it. In the simplest possible version, with unitary evolution, let us consider the fidelity tr(σ(T )Û (T, 0)ρ(0)Û (0, T )). The derivative with respect to parameters θ entering the Hamiltonian will produce a contribution for each time t ∈ (0, T ) in the evolution. Specifically, the contribution from time t will be an expression of the type tr(σÛ (T, t)[−i ∂Ĥ ∂θ ,ρ(t)]Û (t, T )). Using the cyclic property of the trace, this can be reordered to obtain tr(Û (t, T )σ(T )Û (T, t)[−i ∂Ĥ ∂θ ,ρ(t)]). This can now be re-interpreted, namely as the overlap between a backward-evolved target stateσ(t) = U (t, T )σ(T )Û (T, t) and the perturbation of the forwardevolved state at time t: tr(σ(t)[−i ∂Ĥ ∂θ ,ρ(t)]). In machine learning language, the GRAPE procedure of obtaining gradients in this way can essentially be viewed as an analytically derived version of backpropagation for this specific case of a quantum-physical evolution. It is very efficient, since the effort scales only linearly in the number of time steps, even if there is a different, independently optimizable parameter θ(t) for each time step.
The question is how this procedure needs to be modified in the presence of measurements. Let us imagine we have a particular trajectory with a given fixed sequence of measurement outcomes. We find that we can perform the temporal backpropagation (starting from the final time T ) in the same manner as reviewed above, until a point in timet where a measurement has happened (unless of course we talk about a time point t later than the last measurement). At that pointt, we need to replacê σ(t) =Û (t, T )σÛ (T,t) by the following expression: Here we have defined, for brevity, the measurement op-eratorM ≡Mm at time pointt, with measurement outcomem, and the associated probability P ≡ Pm = tr(Mmρ(t)M † m ), whereρ(t) is already conditioned on previous measurement outcomes, for times less thant and has been obtained by the forward evolution starting from time 0 (with measurements and re-normalization of the state after each measurement).
After this procedure has been implemented for the measurement att, we would proceed with the backward evolution ofσ until point t, where the derivative is to be evaluated. There, we employ the same formula as in the usual GRAPE approach, i.e. we would evaluate tr(σ(t) † [−i ∂Ĥ ∂θ ,ρ(t)]). If there are multiple measurements between t and T , the backward evolution will proceed by alternating unitary evolution and applying the formula in Eq. (C3).
If we want to treat the unnormalized quantum state in the same manner, e.g. for obtaining the log-likelihood term, we will only need the trace of that unnormalized stateρ at the end of the time evolution (see our discussion above). Formally, this is as if we were to calculate the fidelity against a stateσ(T ) = 1, which is given by the identity matrix. We can now evolve this state backwards in the manner discussed above, but in addition, Eq. (C3) simplifies: One needs to drop the second term and also formally set P = 1 in the first term.
Finally, we briefly remark how the procedure will change if we are dealing with continuous measurement outcomes (strong continuous measurements, as briefly discussed in the main text, using the 'reparametrization trick'). In that case, we do not need the log-likelihood term. However, we now do need to differentiate the measurement outcome m = f −1 ρ (z) which depends on some random variable z (of a fixed distribution, not dependent on θ) and the quantum stateρ (that does depend on θ). As a consequence, Eq. (C3) needs to be modified. We have to add the following terms to the right-hand-side: Here ∂ θ in both parts of this expression is supposed to act only on theM † andM terms. This derivative is to be Law-Eberly Gradient-ascent applied in the way ∂ θM (m) = (∂ mM (m))(∂ θ m), where the derivative of m with respect to θ must be evaluated using the dependence of the inverse cumulative distribution function on the θ-dependent quantum state at that time-point.

Appendix D: Law-Eberly algorithm
As a benchmark with an analytical solution (but still without feedback), we consider the task of preparing an arbitrary pure cavity state in a cavity-qubit system . This can be achieved by exploiting the well-known Law-Eberly protocol [58]. This algorithm relies on the essential assumption that we start from the ground state. We briefly review it below. The Hamiltonian that describes the system is a Jaynes-Cummings model with controllable couplings: where the first term corresponds to the qubit drive and the second to the cavity-qubit interaction. The two complex controls A(t) and B(t) can assume continuous values.
Law and Eberly uses the particular ansatz A(t) = 0 if B(t) = 0 and vice versa. In this scenario, the dynamics can be viewed as being subdivided into a discrete number of steps N , with each step consisting in one qubit excitation gate,Û q (α j ) = exp −i(α jσ+ + α * jσ − ) , followed by one cavity-qubit interaction gate,Û qc (β j ) = exp −i(β jâσ+ + β * jâ †σ − ) . [For given A(t) and B(t), α j and β j can be easily obtained by integrating over the relevant time-interval.] Since the excitations can be added only one by one via the qubit drive, one can further refine the ansatz assuming that the number of steps N is equal to the maximum number of excitations in the target state, |ψ target = N n=0 c n |n, g . Summarizing, the goal is to find the parameters {α j } and {β j } that solve The Law-Eberly idea is to start from the target state and progressively remove excitations from the cavity until it becomes empty. In other words, one focuses on the time-reversed time evolution (D5) and recursively (for decreasing j starting from j = N ) find the β j and α j imposing the conditions j, g|ψ j = 0, and j − 1, e|ψ j = 0 with being the state after N + 1 − j time steps of the timereversed evolution. These conditions are enforced by the complex nonlinear equations j, g|ψ j+1 cos |β j | j (β/|β|) +i j − 1, e|ψ j+1 sin |β j | j = 0, with |ψ N +1 ≡ |ψ target for the first iterative step (corresponding to j = N ). It should be noted that the solution of these equations is not unique. This is why Fig. 10 shows two different strategies for the same task, although both of them fulfill the Law-Eberly ansatz.
Appendix E: Model-free reinforcement learning for the Jaynes-Cummings scenario It turns out that state-of-the-art model-free RL has surprising difficulties in addressing a physical scenario as important and conceptually simple as the Jaynes-Cummings model. In this subsection we provide some more details.
We will only consider the (simpler) no-feedback case, meaning only the controls α j and β j (see main text) are available. Since model-free RL already has severe problems in this case, we did not explore further the more challenging cases.
In our numerical experiments, we relied on the RL library Stable Baselines [81], which implements many of the most well-known optimized state-of-the-art RL algorithms. The RL environment (not to be confused with a "physical" environment) has been implemented in the following way: • Action a j : The two continuous controls, α j and β j .
• State s j (i.e. input to the agent): In principle, the no-feedback task requires no state input. However, we chose to make it easier for the agent, by supplying the full current quantum state of the system at time t j . Since the state is pure and the system is closed, we simplify the observation by only using the state vector |ψ j (instead of the density matrix). Since it is complex-valued, we split its real and imaginary part and so we have a vector of length 2N , where N is the size of the Hilbert space.
• Reward r j : the fidelity at step t j (in various versions, see below).
We have used a variety of different approaches to solve the task of pure state preparation. These included: using either a sparse final reward (i.e. r j = 0 only if j = N ) or else a reward based on the fidelity at each time step, either discrete (discretized) actions or continuous actions, and several different optimization algorithms (PPO [60], A2C [82], HER [83], TRPO [84], DDPG [85]). The results shown in 3c) are the best results we could manage to produce among all these approaches. They were obtained with PPO, continuous actions and sparse rewards and using the hyperparameters in Table I  analysis gives useful insight on the optimization landscape for the open-loop control preparation of complex superpositions of Fock states using Jaynes-Cummings controls.
Our numerical results forα = 3, corresponding to the average photon numbern ≈ |α| 2 = 9 are summarized in Figure 11. We preliminary note that the excitation number of the target state is not bounded in this case. On the other hand, the Law and Eberly protocol allows to reach only the first N Fock states in N preparation steps. Thus, the optimal strategy will project the target state into the Hilbert space spanned by the first N Fock states and can be obtained using Law and Eberly algorithm. This procedure allows also to find the smallest possible infidelity. We choose N = 20 corresponding to the minimal infidelity F ≈ 6 × 10 −5 .
The fidelity as a function of the number of training iterations (or, equivalently, trajectories used during training) for 10 different training runs is shown in panel a. As it should be expected given that the control parameters slide down a rugged learning landscape the lineshape of the fidelity depends strongly on the initialization. Nevertheless, it displays robust features in the form of a series of plateaus whose heights do not depend on the intialization. I turns out that the statesρ(t N ) prepared following strategies obtained in different runs but corresponding to the same fidelity plateau are also approximately equal. In panel b and c, we show the Wigner function and excitation number distribution n|ρ(t N )|n obtained with a representative strategy for each plateau in panel a. We note that the state is approximately the projection of the target state on a Hilbert space containing the first N th Fock states with the threshold excitation numbers N th = 8, 12, 16, cf panel c. Importantly, each of these states are prepared using different strategies in the different training runs, cf panel d. Indeed, it is easy to construct several different strategies to prepare exactly these states setting 2(N − N th ) controls to zero and choosing the remaining 2N th controls to solve the Law and Eberly equations (D6) for N th preparation steps. It can also be shown that such suboptimal strategies correspond to saddle points of the optimization landscape and that the curvature of the optimization landscape in the direction of increasing fidelity is zero, giving rise to a narrow valley. In this way, we can construct many suboptimal strategies for N preparation steps from an optimal strategy with N th ≤ N steps. Each of these suboptimal strategies corresponds to a narrow valley in the optimization landscape. These valley can cause the training to stall in a suboptimal solution. In fact, we were not able to recover an optimal solution in any of 10 training runs each comprising 1000 training iterations. Nonetheless, we were able to obtain very good quality solutions. The best solution we have obtained (in one out of 10 runs) allows to prepare the oscillator in the target state projected onto an Hilbert space with cut off N th = 16 (third row in panels b and c). This corresponds to the last valley before reaching an optimal solution. We have also run a set of 10 simulation runs comprising also the same number of learning iterations but with a larger number of preparation time steps (N = 28). In this set of run (not shown), we have reached the fidelity plateau corresponding to N th = 16 more consistently and could even reach the plateau for N th = 20. . The phases ϕn for 0 ≤ n < NSNAP are predicted by a recurrent NN that is given the time j as an input (for n ≥ NSNAP ϕn = 0). Shown are also the Wigner functions of the initial state and of the target grid state. b) Mean value of the two (finite energy) stabilizersŜx,∆ andŜp,∆ as a function of the number of trajectories sampled during training (or, equivalently, the number of training iterations) for four values of NSNAP (three runs each; the best run is displayed as a solid line). The target grid state is in the manifold with Ŝ x,∆ = Ŝ p,∆ = 1. Plotted is the running average over 100 trajectories. Also shown is the Wigner function after training for NSNAP = 50 and NSNAP = 130. For small NSNAP the tail of the Wigner function is distorted, while for larger NSNAP it is indistinguishable by bare eye from the target Wigner function. This indicates that the quality of the strategy is limited by the expressivity of the parametrized control sequence. Compared to model-free RL [38], the model-based approach used here allows to explore a higher-dimensional parameter manifold (larger values of NSNAP) and, thus, obtain better quality results for large grid states. Parameters: N = 9 and for the grid state ∆ = 0.15 corresponding to var(n) ≈n ≈ 1/(2∆ 2 ) ≈ 22 (The Hilbert space contains 130 Fock states).
In conclusion, our analysis indicates that the optimization landscape for the preparation of complex superpositions of Fock states using open-loop Jaynes-Cummings controls features a very large number of narrow valleys. In this setting, it is crucial to have a direct access to the landscape gradient to be able to slowly but steadily slide down the optimization landscape. Very good quality solutions can be consistently obtained. the same type of controller (a recurrent neural network that takes as input the time-step) as in Ref. [38]. This corresponds to a 1188-dimensional control space, much larger than the one that could be handled in the modelfree approach. In spite of this larger control parameter space, training required only one hundredth of the trajectories. Most importantly, we obtained a better quality solution as reflected by the best stabilizer value of Ŝ x,∆ +Ŝ p,∆ /2 ≈ 0.995 out of three training runs with ∆ = 0.15 and N SNAP = 130.
Appendix H: Details on Feedback-GRAPE algorithm and on physical simulations As explained in the main text, in the feedback-GRAPE approach presented in this manuscript we can produce the control values (conditioned on previous measurement results) either with the help of a neural network or with the help of a lookup table (containing trainable control values). In this section we present more details on both of these approaches, as implemented for the specific numerical examples shown in the main text.
In our illustrative physical scenario (the feedbackcontrolled Jaynes-Cummings model), there are four control parameters: α j , β j , γ j and δ j . In the most general case, where arbitrary superpositions should be generated, α j and β j need to be complex. In the scenarios whose results are displayed in the main text, this was not needed due to the nature of the target states. However, we have checked independently that the whole approach works just as well for complex control parameters.
Neural Network -We first discuss the case when the controls are computed by means of a neural network. This network can receive the measurement results so far, m 1 , m 2 , . . . , m j . Alternatively, we can also supply it with the quantum state as input, which has been updated according to the measurement outcomes. Both techniques supply the full information content needed to apply the next control.
For the "state as input" approach, we defined a fully connected neural network that takes the density matrix of the system as input. Since the density matrix is complex-valued, we chose to split it into its real and imaginary parts and to stack it, in such a way that for a N H × N H density matrix, the input tensor has shape The fully connected NN has been employed both for the no-feedback case (pure state preparation), where in principle no such input would be needed (but can still be helpful for convergence), and also for the more interesting feedback cases.
If, on the other hand, we want to supply directly the measurement results, then we employ a recurrent neural network (RNN). For our scenario, its input at each time step is a binary measurement outcome m j ∈ {−1, +1}. When a RNN network is used, due to the probabilistic outcome of the trajectories during a simulation, it is useful to feed batches of multiple randomly sampled trajectories as input to the network.
As already mentioned, both types of neural networks output real-valued controls α j , β j , γ j and δ j to be applied in the next time step. When complex-valued controls are required, two additional neurons can be added to the output of the neural networks, and they correspond to the imaginary parts of α j and β j . In the main text, we did not use complex controls, because these were not needed for the tasks considered there.
Our neural networks are implemented using Keras and their hyperparameters are shown for completeness in Tables II, III, figure 7 and 12, respectively. The hyperparameter NSNAP is 15 for Figure 7 and varies from 30 to 130 in Fig.12, cf inset. For figure 12 the input is the time step j, 0 ≤ j < N = 9, while for Figure 7 is the measurement outcome mj.
FIG . 13. Sketch of the three alternative types of trainable controls that can be employed in feedback-GRAPE: the first one is a fully connected neural network which receives the density matrix (quantum state) of the system as input and output the controls. The second one is a RNN with GRU cells as recurrent neurons. The third one is a lookup table, with N n=0 2 n entries (when feedback is required and when the measurement outcomes are binary, as shown here), and each entry contains the controls that need to be applied after observing a particular measurement sequence.
Lookup Table -Another way to represent the entire feedback-based control strategy is to use a lookup table, which essentially is just a list of optimisable parameters. In the case of feedback, we have to build a lookup table that encodes the structure of a decision tree. For binary measurement outcomes (as used here), this has N n=0 2 n entries, each of which is the vector of all control parameters, i.e. in our scenario (α j , β j , γ j , δ j ). Each column of this table represents the 2 j possible control parameter vectors at time step j ∈ {0, ...N }. At j = 0, we have only one set of numbers, which stand for the (only) possible control vector to apply (not dependent on any previous measurement; in our case reduced to only the entries controlling the first measurement). At step j = 1, we have two sets of numbers, and we apply the set of controls corresponding to the observed measurement, and so on and so forth. By doing so, we can apply controls conditioned on the "memory" of all previous measurements, at the cost of keeping an exponentially growing number of entries in the computer's memory. Many of those will likely not be explored at all, if their probabilities are too small.
In our numerical experiments, we went as far as lookup tables containing about 2 21 ∼ 2 · 10 6 entries, which still was easily handled. The initial condition for the whole table was to set each parameter value to a random number uniformly distributed within (0, π).
For figure 8, we have used a large batch of size 10000 (1000) in the approaches with (without) Monte Carlo sampling. For the approach with random initialization the look-up table entriesḡτ i (m j−1 ) were uniformly distributed within (−2π, 2π). In the approach with smart initialization, the initial entries are weakly randomized: For m j−1 = (1, . . . , 1), we have chosenḡτ j (m j−1 ) = π + z(m j−1 ) with z(m j−1 ) uniformly distributed in the interval (0, 1). For the remaining measurement outcomes we set τ j (m j−1 ) = 0.
In several results mentioned in the main text, we use a lookup table "without memory". This means that there is just one control parameter vector for each step j, instead of a tree-type structure with an exponentially growing number of parameters. Thus, we still optimize the controls but ignore the result of previous measurements. This is used both for the "non-adaptive" scheme for the purification task in Fig. 4c) and in figure 5g).
A sketch of all of the three feedback-based strategies discussed here and in the main text (neural network with state as input, recurrent neural network with measurement sequence as input, and a tree-type lookup table) is shown in Fig. 13.
In any case, in whatever ways we choose to parametrize our controls, we have a finite number of parameters that need to be learned. In order to do so, the optimizer employed for every example is Adam [86], and its hyperparameters are shown in table V.   Fig. 4c) of the main text. Here, the shaded lines show 10 different strategies found by repeated runs of the algorithm, from different random starting points. The thick lines represent the best strategy found.
We discretize this continuous time-evolution applying the fourth-order Runge-Kutta method. We chose the Hilbert-space to have a finite dimension N H with a cut-off in the Fock states excitation number. An appropriate choice of the cut-off depends both on the initial and the target state and ranges from 10 to 130 in our simulations.
In our work, we chose several different tasks within a Jaynes-Cummings model to illustrate the performance of our approach. Despite being only an illustrative physical example in this context, the model is of sufficient interest as a paradigm for actual feedback control of quantumoptical systems. In this section, we describe some of the insights we were able to extract by closer inspection of the numerical results obtained by feedback-GRAPE, in situations with feedback.
In the main text, we show the decision tree for the purification of a thermal state with initial occupation number â †â = 2 in four measurements. Here, we want to show how the insight gained by analyzing the decision tree for this special case allows to derive an analytical solution for an optimal purification strategy valid for arbitrary temperature and number of measurements.
We start by reviewing the physics for the building block measurement, cf Eq. (4). This type of measurement has been originally proposed in [62] and has been extensively used in quantum optics experiments with flying Rydberg atoms, e.g. to monitor the occupation number of a cavity in the presence of very small thermal fluctuations [65] or to prepare a Fock state starting from an initial coherent state [64]. After each measurement, the Fock state probability distribution P j (n) is updated by multiplying it with a sinusoidal mask, P j+1 (n) ∝ P j (n) cos 2 γ i n + To better understand the effects of the measurement it is important to keep in mind two key insights: (i) If the measurement strength can be well approximated with a rational multiple of π, γ i = πp i /q i where p i and q i are coprime numbers, the denominator q i represents the period of the mask. Thus, the relative occupations P (n)/P (n ) of any pair of Fock states that have the same excitation number modulus q i , (n − n ) mod q i = 0, do not change after the measurement. (ii) If the phase δ i satisfies either condition π p i q i n i + δ i 2 = 0 mod π, or = π/2 mod π, for an integer n i , one measurement outcome (m i = −1 or m i = 1, respectively) rules out the infinite set of Fock states with excitation numbers n satisfying n mod q i = n i . We note that if q i is an even number any δ i that satisfies the first condition for n i ≡ n i,−1 satisfies also the second condition for n i = n i,1 ≡ (n i,−1 + q i /2) mod q i . In this scenario, each of the two possible measurement outcomes rules out a (different) infinite set of Fock states, n i,±1 mod q i for m i = ±1. We note further that there are infinitely many values of δ i satisfying one of the two conditions in Eq. (J1) for the same n i . All of these values of δ i are rational multiples of π.
Motivated by the insights (i) and (ii), we have written an algorithm that identifies values of γ i and δ i that are close to rational multiples of π with small denominators (we allow a deviation of 1% of π) and displays these rational values (in units of π) in the decision tree as shown in Fig. 4d. By inspecting this decision tree, one can immediately observe that the NN tends to use measurement strength γ j corresponding to the period q j = 2 j for the j-th measurement. In order to understand this pattern, we inspect the phases δ j selected by the NN. For the first measurement, the measurement strength is γ 1 = π/2 and the phase δ 1 = 0. This corresponds to n 1,−1 = 0 and n 1,1 = 1. In other words, the Fock state 0 (1) along with all other even (odd) states are ruled out by the measurement m 1 = −1 (m 1 = 1). Thus, the net effect is that, irrespective of the measurement outcome, the probability of every second Fock state is set to zero. Such a measurement extracts exactly 1 bit of information in the large temperature limit. For the second measurement, the NN doubles the period of the sinusoidal mask, q 2 = 4, (independent of the outcome of the first measurement). By inspecting the phases δ 2 chosen adaptively by the NN we find out that they always allow to rule out either of the two most likely states after the measurement. For example, in the upper branch (corresponding to m 1 = 1) all odd states have been decimated and, thus, the two more likely states are the 0 and 2 Fock states. From the tree we see that δ 2 = π/2 in this branch. This indeed satisfies the two conditions in Eq. (J1) with n i = n 2,−1 = 2 and n i = n 2,1 = 0, respectively. In other words, the Fock states with n mod 4 = 0 (n mod 4 = 2) are ruled out by the measurement outcome m 2 = 1 (m 2 = −1). Since all odd Fock states had been already ruled out after the first measurement, the overall effect of the first two measurements is to postselect every fourth Fock state, n mod 4 = 0 (n mod 4 = 2) for m 1 = 1 and m 2 = −1 (m 1 = m 2 = 1). Likewise, the choice of the phase δ 2 = −π/4 in the lower branch allows to postselect every fourth Fock state, now, nmod4 = 1 and nmod4 = 3 for m 2 = 1 and m 2 = −1, respectively. This strategy can be easily generalized for any arbitrarily large number of measurements J: the period q i is doubled after every measurement, q j = 2 j , independent of the measurement outcomes and appropriate adaptive phases δ j are selected to always rule out either of the two most likely states. Such a strategy allows to postselect the Fock states with n mod 2 J = n i where n i depends on the measurement history. More precisely there is a bijective mapping between 0 ≤ n i < 2 J − 1 and the 2 J possible measurements outcomes. Indeed, a close inspection of the strength γ i and phases δ i selected by the NN shows that the NN adopts this strategy for all four measurements in most (but not all) branches. A notable exception is the third measurement in the lowest branch (corresponding to m 1 = m 2 = −1). This choice results in an ineffective measurement that does not allow to exclude either of the two most likeliest states. Interestingly, in this case the NN selects for the fourth measurement the measurement settings that were expected (according to the strategy identified above) already for the third measurement. We believe that this sub-optimal strategy corresponds to a local minimum for the gradient ascent. We note that the strategy whose tree is displayed in Fig. 4(d) has been obtained after selecting the best gradient ascent training run out of 10 runs with different random initializations. A tree without any such suboptimal measurements could be obtained by performing more gradient ascent runs or, more efficiently, by increasing the temperature of the initial mixed state (which will punish more suboptimal purification strategies).
The same optimal strategy discussed above can be implemented for infinitely many different choices of γ j and δ j . In particular, different bijective mappings between the measurement outcomes and the likeliest state n j after j measurements can be implemented. To find a simple analytical solution for the phases γ j for one of the implementations of the optimal strategy, we choose p j = 1 and, thus, γ j = π/2 j . In addition, we choose n j as the number whose binary representation is d j−1 . . . d 2 d 1 with d i = (1 − m i )/2, e.g. for m 1 = m 2 = −1 corresponding to d 1 = d 2 = 1 we have n 3 = 1 + 2 = 3. This mapping is implemented, if the phase δ j always allows to rule out the Fock state with largest probability (or, equivalently, lowest excitation number among the states that have not yet been decimated by previous measurements) for the measurement outcome m j = −1. With these constraints we find a simple analytical solution for the phases, δ j = πn j /2 j .
Appendix K: Symmetry of the optimization landscape for the spin state preparation with uncertain parameters In this Appendix we analyze the symmetries of the optimization landscape for the learning of feedback strategies to prepare an ensemble of qubits in the excited state investigated in Section III F of the main text.
For N = 2 the 3D optimization landscape displayed as three cuts in Fig. 8(d) is the average F 2 m g∼P (g) of the coupling-dependent fidelity We note that this function has three mirror planes because it is invariant under a sign change of τ 0 , τ 1 (m 0 = 1), or τ 1 (m 0 = −1). In addition in the plane τ 1 (m 0 = −1) = 0, corresponding to the leftmost cut in Fig. 8(d), it can be rewritten as From the above expression, it becomes clear that one can exchange τ 0 , τ 1 (m 0 = 1) without changing the fidelity. These symmetries are present for any value of g and, thus, 29 also for any weighted average over g and, in particular, for the optimization landscape F 2 m g∼P (g) . Since the optimal solutions lie on the plane τ 1 (m 0 = −1) = 0, there are 8 symmetry-related optimal solutions corresponding to the same coupling-dependent fidelity F m . This result can be generalized to the case of N measurements. On the hyperplane with τ j (m j−1 ) = 0 for all m j−1 = (1, . . . , 1) the coupling-dependent fidelity is the function with τ j = τ j (m j−1 ) with m j−1 = (1, . . . , 1). It is easy to show that this function is symmetric under permutation of its N variables τ j . This leads to 4 × N ! optimal solutions with the same coupling-dependent fidelity F m .