Measurement Based Feedback Quantum Control With Deep Reinforcement Learning for Double-well Non-linear Potential

Closed loop quantum control uses measurement to control the dynamics of a quantum system to achieve either a desired target state or target dynamics. In the case when the quantum Hamiltonian is quadratic in ${x}$ and ${p}$, there are known optimal control techniques to drive the dynamics towards particular states e.g. the ground state. However, for nonlinear Hamiltonians such control techniques often fail. We apply Deep Reinforcement Learning (DRL), where an artificial neural agent explores and learns to control the quantum evolution of a highly non-linear system (double well), driving the system towards the ground state with high fidelity. We consider a DRL strategy which is particularly motivated by experiment where the quantum system is continuously but weakly measured. This measurement is then fed back to the neural agent and used for training. We show that the DRL can effectively learn counter-intuitive strategies to cool the system to a nearly-pure `cat' state which has a high overlap fidelity with the true ground state.

Closed loop quantum control uses measurement to control the dynamics of a quantum system to achieve either a desired target state or target dynamics. In the case when the quantum Hamiltonian is quadratic in x and p, there are known optimal control techniques to drive the dynamics towards particular states e.g. the ground state. However, for nonlinear Hamiltonians such control techniques often fail. We apply Deep Reinforcement Learning (DRL), where an artificial neural agent explores and learns to control the quantum evolution of a highly non-linear system (double well), driving the system towards the ground state with high fidelity. We consider a DRL strategy which is particularly motivated by experiment where the quantum system is continuously but weakly measured. This measurement is then fed back to the neural agent and used for training. We show that the DRL can effectively learn counter-intuitive strategies to cool the system to a nearly-pure 'cat' state which has a high overlap fidelity with the true ground state.
As the research on quantum communication and computation has progressed rapidly with the goal of achieving the holy grail of quantum computing, quantum state engineering has begun to take on a high profile [1][2][3]. Of particular importance are feedback control techniques, in which a physical system subjected to noise is continuously monitored in real time while using measurement information to impart specific driving controls to modulate the system dynamics [4]. Unlike classical systems, measurement control of a quantum mechanical system is challenging for a number of reasons. Firstly, the act of continuously observing a quantum system introduces non-linearity within the conditioned dynamics. Secondly, continuous measurement on a quantum system generally alters it, generating measurement-induced noisy dynamics, commonly known as quantum back-action. Finally, applying feedback which is dependent on the noisy measurement current adds further noise into the dynamics. Consequently, a variety of feedback control schemes that work well for classical systems may not for the analogous quantum counterparts [4][5][6].
In recent years, machine learning (ML) has rapidly gained interest, leading to numerous technological advancements in machine vision, voice recognition, natural language processing, automatic handwriting recognition, gaming, and engineering and robotics, to name a few. [7] Various ML models, broadly fall into three categories: supervised learning, unsupervised learning and reinforcement learning (RL) [7][8][9]. For supervised/unsupervised methods, the ML model is provided with enough labeled/unlabeled datasets to be trained on, which it uses for discovering the predictive hidden features in the system of interest. On the other hand, RL approaches the problem differently and is not pre-trained with any external data explicitly, but learns in real-time based on rewards. Indeed, RL is regarded as the most effective way to benefit from the creativity of machines, where it collects experiences by performing random experiments on the system (known as the environment in RL literature), learning by trial and error. RL, specially in combination with deep neural networks, ab-breviated as DRL, is poised to revolutionize the field of artificial intelligence (AI), particularly with the emergence of autonomous systems which process, in real time, stimuli from real world environments [10].
There have already been several important applications of ML in different areas of physics, such as in statistical mechanics, many-body systems, fluid dynamics, and quantum mechanics [11][12][13][14][15]. Most of these applications are supervised in nature, e.g., in the quantum domain, these have been applied to solving the many-body systems [16], in the determination of high-fidelity gates and the optimization of quantum memories by dynamic decoupling [17], quantum error corrections [18][19][20], quantum state tomography [21][22][23][24][25], classification and reconstruction of optical quantum states [26].
Recently, a few applications of DRL in quantum mechanics have also appeared that includes applications in quantum control [27][28][29][30], quantum state preparation and engineering [31][32][33][34][35], state transfer [36][37][38], and quantum error correction [39,40]. While the number of works utilising DRL is increasing, a very few consider using continuous measurement outcomes explicitly towards training the DRL agent [27,32,39]. As experiments often employ such continuous quantum measurement techniques for feedback control, we will consider this type of measurement as a key ingredient of our analysis below.
While traditionally known optimal control techniques work very well for linear, unitary and deterministic quantum systems, there is no known generalised method for non-linear and stochastic systems. RL, on the contrary, is agnostic to the underlying physical model, but attempts to control the dynamics of the system by finding patterns from the data produced by it. In this Letter, we model the quantum evolution of a particle in a double well (DW) subject to continuous measurement at a rate Γ of the operator x 2 , whose even parity avoids measurement localisation of the particle's wavefunction to either well [41]. The DRL agent controls the quantum dynamics via a modulation of the Hamiltonian H (t) = H + F (t), with F (t) = A(t)(xp + px), where x and p are (dimensionless) FIG. 1. (a) The working of a DRL actor-critic model. The agent consists of two networks, actor and critic, where the actor decides the action to be applied on the environment based on the suggestion made by the critic network that computes the value function based on the reward and state obtained from the DRL environment, that includes the physics of the problem -the quantum dynamics/state of particle moving in a DW and modulation function to alter quantum dynamics and the reward estimation function based on the observables (measurement results). (b) the Wigner function distribution for the ground state of the DW, and the corresponding probability distribution on position (c) and Fock-number basis (d). The ground state of the DW is an even parity state. canonical operators -a squeezing operator, whose strength A(t), is modulated by the DRL agent. The DRL agent is trained via the continuous measurement current, while, in real time acts back on the system via F (t). We show that the DRL agent can be trained to cool the particle close to the ground state. Interestingly, the cooling efficiency depends on the choice of Γ, for which there is an optimal value of Γ to achieve the best cooling which we identify numerically.
RL translates a problem at hand into a game-like situation in which an artificial agent (also called the controller), finds a solution to the problem based on a trial-and-error approach [8,9], with no hints or suggestions on how to solve the problem itself. For this purpose, the agent is given a policy (in the case of DRL, it is the neural network itself), which is optimized based on some scalar values (rewards) it receives from the environment (that includes the physics of the problem and the reward estimation function based on the observables) for each decision (action) made by it. By harnessing the power of search, coupled with many trials, the RL will gain experience from thousands of instances executed sequentially or in parallel in a sufficiently powerful computing infrastructure. After sufficient training, the agent can become skilled enough to have sophisticated tactics and superhuman abilities as was phenomenally demonstrated by Google's AlphaGO [42,43]. To give a perspective on the applicability of RL in physics and the kinds of tasks it can solve, we provide a short demonstration to a problem in elementary mechanics which we include as media file in the supporting information.
It is possible to implement the DRL agent according to two distinct frameworks, a policy-based or value-based framework [9]. In policy-based frameworks, the policy parameters-the weights and biases of the neural networkare optimized directly based on the rewards it receives and informs its future actions on the environment (see Fig. 1(a)). Value-based methods on the other hand, optimize the expected future return of a given value function, and deduce the policy from it [8]. It is possible to achieve the best of both these worlds by combining these two approaches in a meaningful way, known as Actor-Critic (AC) methods [9]. Here the actor is the policy which is being optimised and the critic is the value function which is being learned. The actor network can be modeled using various policy-based approaches such as Vanilla Policy Gradient [9], Trust Region Policy Optimization (TRPO) [44], or the more recent Proximal Policy Optimization (PPO) [45]. In our work, we used PPO in combination with Advantage Actor-Critic (A2C) [46] as a DRL agent. In the PPO scheme, it optimizes a clipped surrogate objective/loss function given by, where, r t (θ) = π θ (at|st) π θ old (a t |s t ) is the probability ratio between current and old stochastic policies, so r(θ old ) = 1. Furthermore,Ê t is the empirical average over a finite batch of samples andÂ t is an estimator of the advantage function at timestep t, obtained from the critic, and calculated as the difference between the Q-value for action a t at the state s t and its average value, V :Â t (s t |a t ) = Q(s t |a t ) − V (s t ). Clipping the ratio within a bound specified by = 0.2 ensures that the policy is not updated too much. The A2C framework allows synchronous training of multiple parallel worker environments simultaneously, which enables faster training. A more detailed theory can be found in the Supplementary Materials. A depiction of the DRL model employed in this study is shown in Fig. 1(a).
In this article, we will work with dimensionless position and momentum denoted by (x, p). The relationship to the physical position and momentum variables is x = Q/Q 0 , p = P/P 0 where Q 0 , P 0 are suitable scales for position and momentum. As the canonical commutation relations are [Q,P ] = i ,thus [x,p] = ik, where the dimensionless Planck's constant is defined byk = /(Q 0 P 0 ). The DW potential we consider is formed along the x axis by the Hamiltonian of a particle where b gives the location of the well's minima, h is the height of the barrier between the wells, and a is the offset along x.
The ground state of this potential is a type of Schrödinger cat state due to the even parity symmetry of H in both x and p. The ground state can be depicted by the Wigner function W(x, p), which is shown in Fig. 1(b). The probability distribution along the x-axis i.e. ρ(x) = W(x, p)dp is shown in Fig. 1(c). Furthermore, the ground state has even parity symmetry while the first excited state has odd parity [47]. Hereafter, we will set the parameters a = 0, b = 3.0 and h = 5, which sets the potential to be symmetric around the origin at x = 0. It is worthy to note that the DW potential can now be engineered in laboratories such as in superconducting circuits, Bose-Einstein condensates and magneto-optical setups [48].
To provide data to the agent we consider that the quantum system is subject to a continuous measurement process and these measurement results are provided to the agent in real time. This continuous measurement also induces back-action on the quantum system and noise on both the conditioned quantum dynamics and also on the observed measurement data. We can describe the dynamical evolution of the conditioned density operator ρ c (t), conditioned on a stochastic measurement record to time t via a quantum stochastic master equation [4] given by, where A is a hermitian observable operator under continuous measurement (known as the measurement operator), and H[A] and D[A] are super-operators given by, where {·, ·}, denotes the anti-commutator. Furthermore, dW (t) in Eq. 3 represents a Wiener increment. It has mean zero and variance equal to dt. The measurement result current, I(t), are described by a classical stochastic process that satisfies an Ito stochastic differential equation (see supplementary information) where g is a gain with inverse units to that of the measurement operator A, meaning I(t) has the units of frequency. For the context of the present work, we have A = √ Γx 2 and where Γ is the measurement rate and quantifies the quality of the measurement (see supplementary information). As we have fixed units so that x, p are dimensionless we can set γg = 1.
To cool the system to the ground state (which is a 'cat' state) via continuous measurement, it is important to choose the stochastic operator, A in Eq. 3 as √ Γx 2 instead of √ Γx, as the latter would collapse the state to either of the two minima of the DW [41]. At each interaction, the agent adds a squeezing term F (t) ≡ A(t)(xp + px) to the Hamiltonian (Eq. 2), attempting to adjust the values of A(t) in the continuous range A(t) ∈ [−5, 5], to maximize the reward. The choice of such a feedback is motivated by the physics of the problem, which we explain in detail in the supporting information, backed by an analysis using Bayesian control driven by the conditional mean of the measurement record following the method of Stockton et al., [49]. It is possible to implement xp−type Hamiltonian terms via motion in a magnetic field [50]. In each episode of the training process the DRL interacts with the environment 1000 times, in intervals of δt = 0.01, and each time applies an action to the environment. Further detail of the implementation and other technicalities of the DRL can be found in the supporting information.
The amplitude of the measurement noise depends on two parameters -(a) the measurement strength, Γ and (b) the measurement time, δt. Since the Wiener noise in Eq. 6 is a Gaussian with variance δt, the noise term in the measurement current I(t), scales at least as 1/ √ 4Γδt. Because of this, one might expect the DRL to learn more efficiently for larger values of Γ, however, this is not the case. This makes it critical to choose an optimal value for Γ along with the measurement time δt. For a choice of δt = 0.01, we observe that optimal learning occurs near Γ ∼ 0.1, and worsens for other values of Γ. Similar effects can be observed in Markovian measurement-based direct feedback which we discuss in the supplementary information. Larger Γ values result in an increase in noise, while smaller Γ values return a very low signal to noise ratio in the continuous measurement process. The agent tends to learn most efficiently when the dynamics fluctuate in a limited domain around the DW minima. The effectiveness of the agent learning is shown in Fig. 2(a), in terms mean and maximum fidelities from 10 successive deterministic episodes of the agents trained on the measurement current, An important ingredient in the DRL is a suitable choice for the reward function. Many research studies have previously used fidelity or energy as the reward function. However, such a function is not practically available in experiments. Instead we propose a measurement based reward function R(t) = − I(t)/(γg) − 3 2 , where I(t) is the measurement current (Eq. 6). This function obtains its maximum reward of 0, when x 2 c = 3 2 , at the well minima positions. The learning process of the agent is shown in Fig. 2(b), along with the fidelity of the instantaneous state of the system with respect to the DW ground state in Fig. 2(c). Although such a fidelity is not possible to evaluate experimentally in real-time, we present this as a check of the learning process. In a given episode the trained agent is able to adapt the feedback in such a way that the particle oscillates near the minima of the DW, as shown in Fig. 2(d) (for γg = 1). The corresponding variation of fidelity for the episode is shown in Fig. 2(e). It is worthy to note that with a different choice of the reward function it might be possible to obtain better and more stable learning, see supporting information for details.
At the beginning of each episode, the environment must be reset to an initial state (initial density matrix, ρ(0)), which is needed to start the stochastic master equation solver. We have found that the choice of ρ(0) plays a crucial role in determining the total reward that can typically be achieved by the agent. When ρ(0) is a thermal or coherent state, the DRL converges to an average fidelity of about 60%. However, if we use a small cat state or the ground state of the DW itself, the agent is able to achieve a mean trained fidelity of over 80% with noisy measurement data. The parity of the initial state is crucial as the stochastic process of continuous measurement and feedback we have chosen is parity conserving. The target ground state of the DW however has even parity and so choosing an initial state with a component of odd-parity will lower the ultimate fidelity achievable. We achieve similar high performance if we start with a thermal state projected on to even parity, as done for Fig. 2. The explicit comparison of the performance of the trained agent with an untrained one is demonstrated in the supporting media.
We benchmark our results against the state-based Bayesian feedback protocol (where the feedback is based on an estimate of the state) as proposed by Doherty et al. [49,51]. In the context of the present work, the protocol can be simplified to provide feedback of the form denotes the conditional mean of the observable x 2 . We find that this Bayesian control achieves a mean fidelity of ∼ 85%. However, x 2 c (t) is not a quantity directly accessible in real experiments. When the Bayesian feedback is instead driven by the noisy measurement current I(t) (which is available in experiments), we find that Bayesian feedback demonstrates almost no control over the dynamics. Numerical simulations with 1000 copies of the system (ensemble), evolving under a given feedback based on the mean of the measurement currents during each time step, yields an average fidelity of ∼ 42%. This is considerably worse than the performance of the DRL agent. A more detailed discussion can be found in the supporting information.
We have found that the DRL shows a robust control when the measurement efficiency, η > 50%, as shown in Fig. S5(a) of the supporting information implying that the the DRLagent is able to find patterns in the underlying dynamics even when the stochasticity is significantly increased. Similarly, the DRL shows no significant drop of fidelity under additional dephasing of the form ∼ √ γa † a, but this fidelity does drop for damping ∼ √ γa, where γ is the decoherence rate (see supporting information).
On the computational side, the challenge for DRL control is the significant computational expense, e.g. 3-4 days of simulation time, even with fast computational CPUs, the bottleneck being the slow stochastic solver routines. For the RL side, we find that RL algorithms does not scale up linearly with the number of parallel processes, and thus improved parallelization in such RL computations could be worthwhile. We expect improved performance if the agent is made to learn in a dynamic combination of supervised (under a supervised setting a ML agent can learn more effectively from less data points, but is not reward based), and reinforcement learning (which is reward based and hence useful for feedback control), as done for image recognition [52][53][54]. It is possible to reshape the data to images of actions and measurement records which would enable the usage of convolution neural networks (CNN) in GPUs/TPUs with multi-core support and utilise different image compression techniques, e.g., deep compressed sensing technique, proposed recently by researchers from DeepMind [55]. From the physics side, use of proper filters (as normally done in experiments) to filter the noisy signals prior to inputting into the DRL is expected to be crucial. In addition, the use of better reward estimation, such as combining constraints on current, fidelity (using tomography), and energy, is expected to be useful for further improvement. An even further innovation would be to use RL in combination with various optimal and Bayesian control protocols, as recently explored in applications outside of quantum mechanics [20,56,57] In conclusion, we have demonstrated the usefulness of deep reinforcement learning to tailor the non-trivial feedback parameters in a non-linear system to engineer evolution towards the ground state. We found that the artificial agent can discover novel strategies solely based on measurement records to engineer high-fidelity 'cat states' for the quantum double well. Essentially there are three kinds of machine learning, namely supervised learning, unsupervised learning and reinforcement learning [7]. In supervised learning, the ML model is provided with sufficient input and output data (labeled data), which are used to train the model to discover the hidden pattern/features in the datasets. For example, the dataset may consist of RGB images of different fruits, e.g., of apples, oranges, and bananas, with proper labels assigned by their names. The ML model can read the pixels of those images and correlate to the labels in the datasets by training on them. After training on enough datasets, the ML model is expected to acquire the necessary wisdom to predict from a new image (on which no training was done) whether it is an apple or not. On the other hand, in unsupervised learning, we do not provide labels to the datasets. After training, the model is able to make a classification of those fruits based on the hidden features in the pixel data of the images, and eventually acquire the expertise to classify new test images of fruits again, whether it is an apple, orange or banana or not.
FIG. S1. The workflow of reinforcement learning, with the environment as the classical upside-down harmonic oscillator. At a given time the agent exerts an action (a discrete force), that modifies the dynamics of the environment (leads to a direct change in the equations of motion). The new position x of the oscillator's position at a short time later is the observation (stimuli), received by the agent, along with a reward (chosen as r(t) = 0.11/(|x(t)| + 0.01). This reward is maximised at x = 0 and gradually decreases away from the origin). This reward value and associated observation stimuli are the information upon which the agent modifies its parameters and decides a new action. The loop is repeated for enough episodes until the agent receives a saturated net maximum reward for its actions taken per episode.
Reinforcement learning (RL) [9] is the third kind of ML, the basic workflow of which is shown in Fig. S1. It comprises of two main characters -the agent and the environment. The environment is the agent's world in which it lives and interacts with. The agent is the brain of the RL and comprises of routines to function approximators, such as the artificial neural networks. In the case when the agent uses deep neural networks as function approximators (neural networks are non-linear function approximators), the RL is known as deep reinforcement learning (DRL). The RL-agent periodically interacts with the RL-environment in real time, in which it applies some changes to the dynamics of the system by changing some dependent variables. The changes it applies to the environment are called the actions. In return, the agent gets back full/partial information of the changes in the dynamics of the environment (known as the observation or the state of the system) along with a scalar value known as the reward signal. The reward function returns a scalar value that gives an estimation of the likelihood of the applied action bringing to desirable changes in the dynamics or not. The idea behind choosing a reward function is that rewarding or punishing an agent for its behaviour (actions applied) makes it more likely to repeat/forgo that behaviour in future. For that, the agent is set the goal to maximize the cumulative rewards (known as return) by changing its actions within a certain number of interactions in real time, known as an episode (a play in games). At the beginning of such a RL experiment, the agent applies the actions randomly, and later updates the parameters of the agents function approximators (the weights and biases of the neural networks in the case of DRL) based on which the next actions are chosen. In reality, the agent-environment loop has to be computed for several hundreds/thousands of episodes depending on the complexity of the problem of interest. For instance, for stochastic environments with noisy data, the learning process of the RL-agent is slower than compared to its deterministic counterpart.
To understand it better, we here try to demonstrate the working of DRL with a fundamental classical mechanics problem, shown in Fig. S1. The environment is an upside-down harmonic oscillator with a completely frictionless surface, such that a ball can not be held fixed on top of it without any external forces. The task of the agent would be to apply engineered amount of forces back and forth to keep it on top, namely in the blue region shown in the figure. The reward function is chosen such that it gets the maximum reward at x = 0 and gradually smaller ones as we go away from the origin. This must be programmed together with the solver of the equations of motion in the RL environment. The agent, composed of a few layers of neural networks (hence known as deep reinforcement learning, DRL) can be trained for few thousand steps before it acquires superhuman or human-like skills to adapt the external forces to keep it stable within the blue region. A video demonstration of the same can be obtained in the supporting media or the GitHub link [58].
The rule based on which the agent decides the action to be applied on the environment is known as its policy. In the case of DRL the rule roughly represents the neural network function approximators, as the rule is implicitly determined by some non-intuitive choices of the parameters (weights and biases) of the neural networks. A policy can be deterministic, in which case it is usually denoted by a t = µ θ (s t ) or stochastic, which is denoted by a t ∼ π θ (·|s t ), where a t is the action chosen by the agent as a function of the state s t at time t. The dot (·) in the case of stochastic policy (a t ∼ π θ (·|s t )) represents the conditioning on s t (probabilistic), unlike the case of deterministic policy a t = µ θ (s t ), for which the action is a deterministic function that maps the state s t to the the policy µ at time t. A deterministic policy is usually chosen only for evaluating the performance of an agent after being trained.
The reward function R is one of the most critical parts of RL, which is crucial for the agent's learning. In its most general form, it is a function of the current state, the action applied and the next state of the environment, r t = R(s t , a t , s t+1 , where r t is the scalar reward signal at time t. The cumulative reward is defined as, where τ = (s 0 , a 0 , s 1 , s 2 ...s t , a t ..) represents the sequences of state and action over an episode known as a trajectory, and T represents the horizon, giving the extent to which the summation is done. In this case, the return is known as finite-horizon un-discounted return. Often the choice is to use infinite horizon limit with a discount factor γ ∈ (0, 1), known as infinite-horizon discounted return: which essentially signifies the fact that rewards now is better than rewards later. The choice of the discounted rewards also makes it mathematically convenient. The task of the agent is to maximize the expected return J(π) =Ê t [R(τ )], whereÊ t is the empirical average over a batch of samples, for which the policy is optimized for the optimal policy π * : π * = arg max π J(π). The expected return for choosing a given state or state-action pair is known as the value function. The expected return for the starting in a state s 0 = s and following the policy afterwards gives the on-policy value function, given by, On the other hand, the expected return for the starting in a state s 0 = s and taking an action a 0 = a and following the policy afterwards gives the on-policy action-value function, given by, The value functions obey the Bellman equations, which can be solved self-consistently, for example, for action-value function, it is given by, Sometimes, yet another function that combines both the types of values function is defined, given by, This is called the advantage function and gives a measure of advantage of taking an specific action a in state s over the random selection of it following π(·|s). In recent years, several different RL algorithms have been proposed, among which the leading contenders are the deep Qlearning [8], vanilla policy gradient methods [9], and trust region natural policy gradient methods [44,45]. In the modern scenario, a successful RL method is expected to be scalable to large models, parallel implementations, data efficiency, and robustness, and should be successful on a variety of problems without much of hyper-parameter tuning. While Q-learning methods are known for their efficiency in numerous discrete action-based environments, such as various Atari games [42,43], they fail on many simple problems, and may suffer from the lack of guarantees for an accurate value function, the intractability problem resulting from uncertain (stochastic) state information as well as the complexity arising from continuous states and actions [45]. Vanilla policy gradient methods [9] are known for better convergence properties, are effective in high-dimensional or continuous action spaces, and can learn stochastic policies. However, they often converge to a local minimum rather than a global one, and have poor data efficiency and robustness, often leading to high variance in policy updates. The trust region policy optimization (TRPO) [44] is relatively complicated, and is not compatible with architectures that include noise such as dropout or parameter sharing between the policy and value function, or with auxiliary tasks. Proximal policy optimization (PPO) [45] is the latest inclusion in the list, which is shown to yield similar or better performance and robustness as TRPO, with much easier implementation and parallelization, and faster optimization of policy parameters with stochastic gradient methods, and is naturally compatible with noisy environments.
Policy gradient methods work by optimizing the following objective (loss) function: where, π θ is a stochastic policy, andÂ t is an estimator of the advantage function at timestep t.Ê t is the empirical average over a finite batch of samples. In TRPO methods, the objective function (the "surrogate" objective) contains the probability ratio r(t) of the new (π θ ) and old policies (π θ old ) instead of log π θ : under the constraints that the average KL-divergence between new and old policies are less than a small cutoff, δ: Without the constraint, maximizing the objective would lead to an excessively large policy update [44]. PPO modifies the above constraint by clipping the policy update, with a hyper-parameter ∼ 0.2, and optimizing the objective function with stochastic gradient method [45]: Q-learning, on the contrary, optimizes value function Q θ (s, a) using the Bellman equation, as given in Eq. S11. Lately, different algorithms that combines policy gradient methods with Q-learning methods have been proposed and have been found to sidestep many of the drawbacks of the traditionally used approaches. These are known as the actor-critic methods, in which there are two policy networks, actor and critic. While the actor, based on a policy network, takes an action on the environment, the critic network receives the reward from the environment and makes useful suggestion to the actor by updating the parameters of the action value function. In this paper, we have used a DRL agent that combines PPO with advantage actor-critic (A2C) approach [46] , where the advantage function is is calculated as the difference between the Q-value for action a t at the state s t and the average value of the state:Â t (s t |a t ) = Q(s t |a t ) − V (s t ), discussed above. The A2C framework allows synchronous training of multiple parallel worker environments simultaneously, which enables faster training. The basic workflow of A2C is shown in Fig. S2.

WEAK CONTINUOUS MEASUREMENT MODEL
We have used a weak continuous measurement ofx 2 described using a master equation for the system density operator. In this section we derive this equation using a continuous limit for a sequence of weak measurements at random times.
A single weak measurement of a Hermitian operatorÂ is described using an operation on the state of the system that gives the post-measurement state conditioned on a measurement result z ∈ R where P (z) = Tr[Υ † (z)Υ(z)ρ] is the probability density of the measurement result and with [Â,Υ(z)] = 0, and g is the gain in inverse units toÂ. The measurement result, z, is a classical random variable conditioned on the state of the measured system. The mean and variance of z are given by where ∆Â =Â − Â with the quantum averages defined by Â = tr [Âρ]. It is clear that σ can be interpreted as the measurement error over and above any intrinsic quantum uncertainty. The unconditional state of the system post measurement is then given by We now move to a weak continuous measurement by assuming each of these single weak measurements occur at random, Poisson distributed times. At each time, the device that performs the measurement is discarded after the measurement result is recorded, and a new measurement device is supplied -ready for the subsequent measurement. This builds in the Markov condition for the irreversible measurement process. The master equation describing the unconditional dynamics of the entire system is given by [59] where γ is the rate of measurements. We now take the limit in which the rate of measurements satisfies γδt 1 where δt is the time scale of the free dynamics of the system and the uncertainty of the measurements σ g 2 such that the ratio γ/σ is constant. This is the continuous weak measurement limit. The unconditional master equation then becomeṡ with Γ = γg 2 /σ. The effect of the double commutator term in this master equation is to destroy coherence in the diagonal basis ofÂ and adds diffusive noise to incommensurate variablesB where [Â,B] = 0. A good measurement requires that Γ is large.

S5
We define an observed classical stochastic process in the continuum limit as follows. Denote a complete record of the measurement results, indexed by the random times of the measurement times, as z(t). Now define the observed process by the stochastic differential equation where dN (t) 2 = dN (t) and dN (t) = γdt while z(t) is a Gaussian distribution. In the continuum limit this point process can be approximated by a diffusion process as where the subscript c indicates that this average must be computed from the state conditioned on the entire history of y(t) to time t and dW (t) is the Wiener process. Note that for a good measurement in which Γ is large, the signal-to-noise ratio also becomes large.
In order to evaluate the conditional averages in the observed process we need a continuous limit of the conditional state at each measurement. The result is the stochastic master equation [4], where and This equation is non linear; not surprising as for a conditional process the future must depend on the entire past history of measurement results.

MARKOVIAN QUANTUM FEEDBACK
Let us consider the case when we remove the DRL agent altogether and set the feedback Hamiltonian to be proportional to the measurement current, dy(t)/dt (Eq. S25). The effect of feedback on ρ c can be introduced as, where K is a super-operator such that Kρ c = −i[F, ρ c ] for some Hermitian feedback action operator F . With Eq. (S26), we get the modified closed-loop stochastic master equation under feedback F , Averaging over the noise we get the the Wiseman-Milburn unconditional master equation with Markovian feedback aṡ where ρ = E(ρ f ), the average over ρ f . Looking at the unconditional master equation (S31), one can identify the feedback term, −i[F,Âρ + ρÂ], which modifies the base Hamiltonian evolution generated by H. The term ΓD[Â]ρ, is the decoherence introduced by the continuous measurement ofÂ. However we now observe a new source of decoherence D[F ]ρ, introduced by the feedback itself [6]. There is often this competition between the two sources of decoherence that decide an optimal value of Γ.
For small values of Γ, the measurement current dy/dt, is very noisy and the feedback decoherence is high, while the continuous measurement decoherence is low. For large values of Γ, the feedback decoherence is low while the measurement decoherence is high. This competition carries over to the case studied in the main manuscript where the feedback action is determined by a DRL agent. We thus observe an optimal measurement rate, Γ for the learning of the DRL agent, as shown in Figs. 2(a) and S7(a).

THE DRL IN THE STUDY
Using the model given in Eq.
(2) of the paper, we created a DRL environment (shown in Fig.1(a) using the open-source platform OpenAI-Gym [60], in which QuTIP's [61] stochastic master equation (SME) solver is used to compute the dynamics, see Eq. (6). The environment primarily includes two routines-the quantum dynamics/state of particle moving in a DW and modulation function to alter the quantum dynamics and a reward estimation function based on the observables (measurement results) in a step-wise manner for every interaction made by the agent. Our PPO agent [45] is constructed following the implementations of stable-baselines3 [62] in the A2C settings [46], where both the actor and critic are modelled with a set of fully connected feed-forward networks dimensions 512 × 256 × 128, with the first layer as a shared network between the two. For modeling the neural networks we have used the open-source ML platform PyTorch [63]. The input of the network is provided as the mean of the measurement currents, I(t) from the last four time-steps (which helps learning faster rather than using the instantaneous current), along with the action returned by the network. In each episode of the training process the DRL interacts with the environment 1000 times, in intervals of δt = 0.01, and each time applies an action to the environment. The agent gathers trajectories as far out as the horizon limits (in our case 4000 steps, i.e., data from 4 episode for each environment) for each environment running in parallel (in our case 8 environments in total), then performs a stochastic gradient descent update of mini-batch size (in our case it is 100) on all the gathered trajectories for the specified number of epochs (in our case 10). We have found that a larger network along with a small learning rate (10 −5 ) helps the agent keep the learning process stable during longer simulations.
These simulations require a high truncation of the Hilbert-space dimension (we took N trunc = 60), since the process of measurement often drives the state to high Fock numbers (see supporting media for demonstration or the GitHub link [64]). This is especially important for the DRL in the initial stages of learning when it is trying to gather information about the system by performing random actions. In our numerical experiments we chose N trunc = 60, Γ ≤ 0.5, and used 8 parallel vector environments to speed up the processing. After training within a few hundred such parallel environments the DRL learns to apply the actions in such a way that it limits the occupation to low Fock states, avoiding and errors with truncation limits. In addition, we observe that the efficiency of learning depends on the type of agent. Although traditional DQN methods based on discrete actions works well with simple conditional mean data, we find that they perform very poorly in the presence of noise. Actor-critic methods, especially the newer PPO agent (an evolution of A2C and TRPO), is better suited for dealing with a stochastic environment (stochastic data) than a purely value-based algorithm like DQN. In the main manuscript we have chosen the agent controlled action to be F (t) ∝ (xp + px). This action augments the default DW quantum dynamics. It is natural to enquire if there are other, more effective actions that will improve the learning of the DRL process to achieve the goal of cooling the DW to its quantum ground state. Interestingly, we find that this form of action appears to optimally effective and choosing other forms of action results in weaker DRL efficiency. The action operator, F (t) = A(t)f , where f = xp + px, is the Hermitian squeezing operator. The reason why this action is so effective in learning and controlling the quantum dynamics of the particle in the double-well potential can be understood merely from classical perspectives.

ROLE OF FEEDBACK HAMILTONIAN
where (x k , p k ) is a phase space point at time t k just before feedback, and moves the particle along curves of constant f . Fig. S3 shows the phase space streamlines for three different possible feedback action Hamiltonians. Feedback in the form of f = x 2 is likely to be very ineffective, as indicated by the streamlines in Fig. S3(a), as the flow is always orthogonal to the x-axis. Similarly, the streamlines for f = p 2 − x 2 would push the state starting near the p axis further away from the x axis. On the other hand, for the case of xp + px, the streamlines push the particle in the direction of the x-axis, which is appropriate for approaching the double-well minima.
Numerically, we can demonstrate the best of the three controls by using these choices of the feedback with the Bayesian control with conditional data, discussed in details down below.

EFFECT OF INPUT STATE
It is found that the choice of initial state for the time evolution via the stochastic master equation has a profound effect on how well the DRL agent can achieve its goal to bring the system to the ground state. The maximum (blue) and mean (orange) fidelity of the trained agents for different initial states are shown in Fig. S4. When the initial state, ρ(0) is a thermal or coherent state, the DRL converges to an average fidelity of about 60%. However, if we use a small cat state or the ground state of the DW itself, the agent is able to achieve a mean trained fidelity of over 80% with noisy measurement data. We achieve similar high performance if we start with a thermal state projected on to even parity.

EFFECT OF MEASUREMENT EFFICIENCY AND DECOHERENCE
We benchmark the mean fidelity achieved as a function of the measurement efficiency, η in the Fig. S5(a). We find that the DRL reveals robust control for η ≥ 50%. In Fig. S5(b), the learning of the DRL agent is compared with the one without environment damping (blue),in the presence of different collapse operators -(i) √ γa, and (ii) √ γa † a. Further improvement of the results could be achieved by implementing the suggestions mentioned at the end of the main manuscript.

GENERALIZABILITY OF THE TRAINED MODEL
Since the DRL is trained explicitly on the measurement data that varies with the double-well parameters (for example, the position of the well minima), measurement strength (Γ) and measurement time interval (δt), the generalizability of a trained model is poor across any significant changes of those parameters (see Fig. S6). However, this is not the case for the initial state The loss of generality is due to the fact that the agent is trained explicitly based on the measurement records which changes as we change the parameters of the double-well. This suggests that the agent needs to be explicitly re-trained for a significant change in the parameters of the double-well. ρ(0). Although the choice of the type of ρ(0) has a significant effect on the mean fidelity achieved by the agent, we found that a trained model with a given ρ(0) (say coherent state) yields exactly the same behavior for others (say even-thermal state) and vice versa (see Fig. S4), which means that choosing a better prediction (or a similar basis with respect to the target state) would lead to better net fidelity, without the need to retrain the model for different initial states.

FIDELITY AS THE REWARD FUNCTION
It is possible to choose fidelity as the reward function instead of measurement current discussed in the main text of the paper, shown in Fig. S7. However, fidelity is not directly accessible in real time in a laboratory experiment, and the following analysis is included for the sake of clarity and helpful comparison. When the DRL agent uses fidelity as a reward function we see that the learning behaviour, shown in (Fig. S7(a)), as a function of the measurement rate Γ, essentially mirrors the behaviour seen when the measurement current is used instead as a reward function (shown in Fig. 2(a) in the main text of the paper). In Fig. S7(b) the episodic mean rewardF (t) evolution in time during the training of the agent when Γ = 0.1 [light(dark) blue includes(averages) noise]. In Fig. S7(c) the fidelity variation of a given random episode of the training agent is shown.

BAYESIAN VS. DRL CONTROL
Here we benchmark our result against the state-based Bayesian feedback protocol (where the feedback is based on an estimate of the state) as proposed by Doherty et al., [49,51]. In the context of the present work, the protocol is simplified to provide feedback of the form F(t) = −( x 2 c (t) − 3 2 ) × (xp + px) where x 2 c (t) denotes the conditional mean of the observable x 2 . However, x 2 c (t) is not a quantity directly accessible in real experiments, and the above feedback condition should be replaced by the measurement current, I(t). We find that when Bayesian feedback is driven by I(t), it shows very little control over the dynamics. We performed a numerical simulation with 1000 copies of the system (ensemble), all evolving under feedback based on the mean of the measurement currents during each time step. The performances of these two approaches is compared in Table  S1 TABLE S1. Comparison of the Bayesian and DRL control towards finding the ground state of the double well potential. We list a comparison of the mean fidelities achieved when the controllers are driven either by the conditional mean of x 2 c (t) [which are not readily available during a live experiment], or by the measurement current I(t), which includes measurement noise. In addition we include cases with decoherence of two forms, either amplitude damping decoherence at a rate γ = 0.1 or dephasing decoherence at the same rate. The Bayesian control for the measurement current is done over an ensemble of 1000 trajectories. We color blue the higher fidelity result for clarity.

Feedback basis
Bayesian control (% fidelity) DRL control (% fidelity) conditional means, We see from S1 that in order that the Bayesian control performs well only when it is provided the control with conditional mean data. When used with measurement currents, it, however, shows no control at all. Even when it is used with an ensemble of 1000 identical systems, it does not find any control beyond 50%. We found from our analysis that it is very important to provide the Bayesian controller with very accurate estimation of the mean currents that does not deviate significantly from the S10 conditional means. The presence of a bad data point often throws the system into a non-recoverable state resulting in much lower fidelities. DRL, on the other hand, has a brain, and hence can recover from such bad data points.
It is also seen that when the Bayesian as well as the DRL controller is trained with conditional mean data, both yield comparable performances. It is interesting to observe the significant drop in achievable fidelity in presence of the environment damping √ γa, but not in presence of √ γa † a, environment dephasing. Thus, the drop of the performance of the DRL agent in presence of an environment damping √ γa can be attributed to the reduction of the indistinguishably of the target state of the DW potential in the presence of √ γa decoherence.
Finally, we found that the choice of the xp + px feedback over x 2 and p 2 − x 2 , discussed above, can be demonstrated using Bayesian feedback with conditional mean data. We found a feedback of the form x 2 could never control the dynamics, while with p 2 − x 2 achieves a mean fidelity of ∼ 60%. The mean fidelity of 85% shown shown in table S1 could only be obtained when used with the feedback xp + px.