No-Collapse Accurate Quantum Feedback Control via Conditional State Tomography

The effectiveness of measurement-based feedback control (MBFC) protocols is hampered by the presence of measurement noise, which affects the ability to accurately infer the underlying dynamics of a quantum system from noisy continuous measurement records to determine an accurate control strategy. To circumvent such limitations, this work explores a real-time stochastic state estimation approach that enables noise-free monitoring of the conditional dynamics including the full density matrix of the quantum system using noisy measurement records within a single quantum trajectory -- a method we name as `conditional state tomography'. This, in turn, enables the development of precise MBFC strategies that lead to effective control of quantum systems by essentially mitigating the constraints imposed by measurement noise and has potential applications in various feedback quantum control scenarios. This approach is particularly useful for reinforcement-learning (RL)-based control, where the RL-agent can be trained with arbitrary conditional averages of observables, and/or the full density matrix as input (observation), to quickly and accurately learn control strategies.

The effectiveness of measurement-based feedback control (MBFC) protocols is hampered by the presence of measurement noise, which affects the ability to accurately infer the underlying dynamics of a quantum system from noisy continuous measurement records to determine an accurate control strategy.To circumvent such limitations, this work explores a real-time stochastic state estimation approach that enables noise-free monitoring of the conditional dynamics including the full density matrix of the quantum system using noisy measurement records within a single quantum trajectory -a method we name as 'conditional state tomography'.This, in turn, enables the development of precise MBFC strategies that lead to effective control of quantum systems by essentially mitigating the constraints imposed by measurement noise and has potential applications in various feedback quantum control scenarios.This approach is particularly useful for reinforcement-learning (RL)-based control, where the RL-agent can be trained with arbitrary conditional averages of observables, and/or the full density matrix as input (observation), to quickly and accurately learn control strategies.
Future advancements in quantum technologies will hinge on the ability to effectively manipulate quantum systems by controlling their states through reliable protocols and feedback strategies [1][2][3][4].Broadly speaking, pure control strategies entail using open-loop pulse-based controls for quantum circuits, and such problems have been successfully tackled using conventional optimal control tools like gradient-ascent pulse engineering (GRAPE) [5][6][7][8].These methods are fundamentally based on a differentiable model of quantum dynamics that cannot be extended to feedback-based controls [5,9].For controls employing continuous measurement, non-trivial strategies need to be identified based on conditional dynamics.These measurement-based feedback control (MBFC) techniques are considered pivotal for achieving real-time quantum control in laboratory experiments [10][11][12][13][14][15][16][17][18].Reinforcement learning (RL) has recently been proven as a powerful new ansatz for such control tasks, which, in the quantum domain, was first demonstrated for quantum error correction [19] and optimization of quantum phase transition in 2018 [20].Following these initial studies, we have recently witnessed its applications in different sets of non-intuitive problems, including applications in quantum control [8,[21][22][23][24], state transfer [25,26], quantum state preparation and engineering [27][28][29][30], and quantum error correction [31].Very recently, the use of RL controls for real laboratory experiments of a quantum system has become a reality [32,33].
At a fundamental level, the MBFC approaches based on continuous measurements suffer from the limitations of two primary sources.First, such approaches often fail to control the dynamics beyond a specific limit set by the signal-to-noise ratio of the intrinsic and unavoidable measurement-induced noise to the measured quantity.The level of noise increases as 1/ √ κδt, where κ denotes the measurement rate, and δt is the measurement time interval, which given the fact that δt is related directly to the variance of the noise distribution (in the Wiener noise model) and δt ≪ 1, the actual measured signal can be well hidden in the sea of random noise [24].This makes it practically impossible for MBFC to find suit-able control strategies for the system to achieve the desired dynamics.Second, the continuous measurement process naturally leads to the so-called measurement backaction, which makes the MBFC schemes highly non-intuitive and nontrivial in general [19,24,27,34,35].
In this Letter, we research in this direction and propose an efficient MBFC protocol that can precisely control the dynamics of a quantum system of interest based on noisy, continuous, and real-time measurement data.This is made possible by developing a measurement-based stochastic estimator that can extract the real-time state of the measured system noiselessly and without collapse, thereby controlling the system dynamics in any desired way.Unlike the usual method of state estimation with continuous measurement using thousands of trajectories from that many copies of the quantum system, this method estimates the conditional state of the system from single trajectory, a method we term as 'conditional state tomography'.We demonstrate the efficiency of the scheme by applying it to control the dynamics of linear and nonlinear quantum systems where the applied feedback is state-based or conditional.We also show the usefulness of the scheme for cases where control laws can be derived based on conditional moments (assuming perfect extraction of the measured signal from the noisy data, which is typically not possible in realistic experiments), which we illustrate with an example of preparing symmetric and antisymmetric entangled states of two qubits.Moreover, our scheme is adaptable for real-time feedback with RL controllers, allowing optimal and efficient training and control.
The protocol is shown schematically in Fig. 1.It consists of two operation steps, the estimation stage and the control stage.In the estimation stage, the to be controlled quantum system (shown on the left), with an unknown initial state (given by the density matrix ρ(0)) is measured using a (weak) continuous measurement approach.The noisy current streams from the measurement are then used to construct a stochastic estimator (shown on the right), which is a computational model of the measured quantum system, with the same Hamiltonian Figure 1.The schematic of the proposed protocol.(Top) The estimation stage: a physical quantum system (left) described by Hamiltonian Ĥ0 is continuously monitored to probe the observable Â, and the noisy measurement outcomes are fed to the estimator (right) -a simulator based on the mathematical model of the real physical system on a classical processor, eg. a FPGA (Field Programmable Gate Array).The state of the physical system (estimator) at time t is described by ρ(t) (ρ e (t)) which becomes equal at t ≥ tc.(Bottom) The control stage: a controller is used to apply accurate feedback F (t) to both the physical as well as the estimator systems as a function of the estimated noiseless conditional signal ⟨ Â(t)⟩c obtained through the estimator.
but with any random initial quantum state ρ e (0).The estimator can track the dynamics of the measured quantum system in real-time after a while, as the conditional state of the estimator converges to that of the physical quantum system.In the control stage of operation, a controller is developed to mediate between the real system and the estimator by applying feedback on the systems based on the conditional dynamics of the latter while continuing to control the systems through the real-time measured data of the physical quantum system.
We first describe the theory behind the measurement-based stochastic estimator and the feedback control method.Suppose the laboratory quantum system (top left in Fig. 1), with Hamiltonian Ĥ0 , is being measured continuously with a weak probe for the measurement operator Â (suitably scaled to make it dimensionless).Such a continuous measurement process leads to conditional stochastic dynamics of the system density matrix in time ρ c (t) and is described by the so-called quantum stochastic master equation (SME), Here, κ is the measurement rate (the rate at which information is extracted from the detector), η is the measurement efficiency of the detector and dξ(t) represents an instantaneous random Wiener noise increment (white noise model with zero mean and variance √ dt, where dt is the time interval between successive measurements).D[ Â] and H[ Â] are the superoperators describing respectively the backaction and diffusion terms in the SME [1,3], see Supplemental Material for more details.Probing the system with a weakly coupled meter that, in effect, has a broad probability distribution of the quantum state leads to noisy measurement records given by, The first term on the right-hand side of the above equation denotes the conditional mean of the measurement operator (the signal) and the second term represents the contribution of the measurement noise, which depends on η and κ.
The estimator is a model quantum system with the same Hamiltonian Ĥ0 , as depicted in Fig. 1 (top right), which is initialized in any arbitrary quantum state ρ e (0), and is driven by the noisy measurement current of the real laboratory quantum system, I(t) (Eq.2).The dynamics of the estimator is described by the modified SME [1,35,36], where ρ e c (t) denotes the conditional density matrix of the estimator independent of the real system, and ⟨ Â(t)⟩ e c = Tr[ρ e Â] is the conditional mean calculated for the estimator at time t.In essence, the estimator dynamics is driven by the noisy realtime measurement currents from the meter and the conditional means of the estimator itself.It can be shown that the overlap between the states ρ(t) and ρ e (t) following Eqs.S15 and 3 monotonically increases until it reaches unity: δTr[ρρ e ](t) ∼ Tr √ ρ( Â+⟨ Â⟩) ρ e Â+⟨ Â⟩ × √ ρ δt.Thus, provided the estimator gets sufficient amount of measurement data, the convergence of its dynamic state to that of the physical quantum system, i.e., ρ e (t) ∼ ρ(t) can always be guaranteed, except for the cases where [ Ĥ0 , Â] = 0.In case of the latter, the observable Â is a constant of motion, and the continuous measurement of it does not provide any information about the state of the system, causing ρ e (t) to remain in one of the eigenstates of Ĥ0 .It is possible to ensure convergence regardless of the values of η and κ, although the time it takes to reach convergence, t f , will be longer if η and κ are lower (see the Supplemental Material for an example of a qubit to illustrate the protocol).Once this estimation stage is complete, the second stage of the MBFC scheme, namely the control stage, is initiated, as shown in Fig. 1(bottom).We first apply the scheme for dynamic feedback cooling of a linear quantum harmonic oscillator and demonstrate how it becomes possible to employ accurate state-based feedback control to achieve this.The Hamiltonian of the linear quantum harmonic oscillator is given by Ĥ0 = p2 /2m+mω 2 x2 /2, where x and p are the position and momentum operators, respectively, m is the mass of the oscillator, and ω denotes the frequency of oscillation.Consider making a measurement of  .Control of a linear quantum harmonic oscillator using the protocol.(a) In the estimation phase, the fidelity F(t) between the physical system and the estimator steadily converges.The inset displays the conditional means of the observable x.(b) Subsequently, a state-based controller is applied, swiftly guiding the particle's motion around the center ⟨x(t)⟩c = 0.The instantaneous fidelity Fg(t) (depicted in black) quantifies the closeness between the physical system/estimator state and the target state, i.e., the ground state of the oscillator.The conditional mean population ⟨â † â(t)⟩c in the oscillator is shown in red.
the position operator such that Â = x.In Fig. 2(a), the instantaneous fidelity between the states of the real system and the estimator, F(t) is shown during the estimation stage of the control protocol.As shown in terms of the monotonically improved fidelity, the estimator starts mimicking the dynamics of the measured quantum system; also shown in the inset of the Fig. 2(a), where the evolution of the conditional means of x for the measured system and estimator are compared.After the estimation stage is completed, which is typically smaller than κ −1 , the control stage is initialised.We now use a statebased control strategy given by Ĥ(t) = Ĥ0 − ⟨x(t)⟩ c p, where ⟨x(t)⟩ c denotes the conditional mean of x at time t.Such a feedback represents a damping control scheme, where the controller applies feedback based on the conditional mean of the position operator to effectively reduce the momentum as it approaches ⟨x(t)⟩ c → 0. The feedback is applied to both the measured system and the estimator based on the noisefree conditional mean of the position extracted by the estimator.The results are shown in Fig. 2(b), where it is found that the proposed control protocol leads to fast and accurate dynamic cooling of the quantum harmonic oscillator.The inset of Fig. 2(b) shows how the control protocol could keep the quantum state at a dynamical minimum to any length of time, which is crucial.
Next, we consider a nonlinear quartic potential with the unperturbed Hamiltonian given by Ĥ0 = p2 /2m + λx 4 , where we have chosen m = 1/π and λ = π/25 with proper dimensions.We apply artificial control viz.RL [37][38][39], to devise proper feedback strategies in this case.It is noteworthy that with the designed stochastic estimator, it is now possible to apply the full density matrix as well as the means and moments of the operators for choosing any accurate feedback scheme.Therefore, the scheme allows using accurate conditional means of observables as the input s t (observation) to the RL-agent; for example, here we use s t = {⟨x⟩, ⟨p⟩}.Another advantage of the estimator control is that state fidelities are now realizable, which are usually pervasive in real experimental measurements.Therefore, given that we have access to the fidelity F(t) of the estimator, it can be used as a simple and efficient reward function that needs to be maximized by the RL-agent in the training process.The agent is first trained with a given initial state, which, due to the generalizability of the trained model, permits to be used for controlling the system started with other (random) initial states.The learning curve as the mean fidelity F(N ) over each training episode N is shown in black in Fig. 3. Using conditional means for training the RL-agent makes learning quicker and more accurate.The evaluated episodic fidelity variation F(t) is shown in red colored line in Fig. 3 in the bi-axial plot's second scale, demonstrating accurate feedback control by the trained RLmodel.
Besides, it is often possible to derive control laws for systems undergoing continuous measurement based on the conditional means of observables (without the noise component).Although such control laws would not have much value in realistic situations due to the unavailability of accurate noiseless signal, we now, show in the following that in such context too, our proposed scheme would be useful.To illustrate it, we Figure 4. Demonstration of the proposed MBFC protocol for the preparation of symmetric, ρs, and antisymmetric, ρa, entangled states between two qubits as an example for when it is possible to derive control laws based on conditional moments within stochastic dynamics.Control laws u1 and u2 are selected depending on the conditional value of ρ(t)ρµ, where µ ∈ {s, a} (symmetric and antisymmetric) are in the three regimes, conveniently demonstrated in (a), and the arrows represent the direction of the entrance boundary of ρ(t) to the middle section.γ is the damping parameter, the measurement rate κ is assumed to be 0.1, and the efficiency η = 0.5 for this simulation.After the estimation stage (not shown), these control laws are applied on conditional mean data (density matrices to compute instantaneous fidelity), which leads to convergence to the target states (ρa: black and ρs: red), shown in (b).In the absence of such laws, RL can be used -the performance is shown in the inset of figure (b) with similar color settings.
consider the preparation of symmetric (ρ s ) and antisymmetric (ρ a ) entangled states of two qubits, where the states are given by ρ (s/a) = 1 2 (ψ ↑↓ ±ψ ↓↑ )(ψ ↑↓ ±ψ ↓↑ ) * .Here, ψ ↑↓ = (↑)⊗(↓) and ψ ↓↑ = (↓) ⊗ (↑) are the tensor product states of the individual qubit states in the ground and excited states.The quantum filtering equation under feedback with control variables u 1 (t) and u 2 (t) is given by, where dW t is the Winner noise increment at time t.σ i g , g ∈ {x, y, z} and i = {1, 2} are tensored Pauli operators for qubit i and . The control laws dictate non-intuitive choices of the control parameters u 1 (t) and u 2 (t) provided the real-time conditional fidelity between the current and the target states, ρ s and ρ a could be accurately extracted via conditional tomography of the quantum states, which is often a difficult task if not impossible.These are discussed in the Supplementary Material and conveniently repre-sented in Fig. 4(a).Using these control laws with the MBFC scheme makes it possible to evaluate the controls u 1 (t) and u 2 (t) in real time which leads to a guaranteed preparation of the states ρ a and ρ s , shown in black and red lines, respectively, in Fig. 4(b).It becomes also possible to use RL for control similar to the case shown for a quartic oscillator above, in which case one can use the full density matrix for training along with conditional means, and the performance is shown in the inset of the figure.Compared to the control laws, the RL controller can help the system reach its target state in a shorter time scale.
Finally, we will mention possible shortcomings of the proposed scheme.First, the protocol leans towards a modelbased approach, aiming to maximize controlled output accuracy based on a highly precise physical model, and therefore one should care about potential model bias.To remove model bias, one can integrate model learning techniques such as Hamiltonian learning beforehand [41].It is also possible to use machine learning techniques such as Bayesian estimation [42] and RL [43,44] for estimating model parameters.Second, when dealing with real-time feedback control problems, it is likely to have potential delays between the measurement and feedback operations.The estimator, being a simulator in a classical processor, needs finite time for simulation that can add to this delay event, especially for large systems.While the estimation stage of the protocol can be streamlined by completing it in a single pass by providing all previous measurement results at once to the estimator; for the control stage, it would be advantageous to provide the estimator and the controller with frequent measurement results, to discover finer controllability, tailored to the system's complexity.In such cases, RL-based methods can be especially effective [32,33].
In conclusion, even when employing sophisticated noise filtering techniques such as Linear Quadratic Regulator (LQR), Linear Quadratic Gaussian (LQG), and Kalman filters in standard MBFC experiments, extracting the exact signal from the noisy measurement results remains a formidable task [18].Consequently, conventional feedback strategies fall short of achieving accurate control.The proposed protocol circumvents this by estimating accurate conditional state tomography, thereby enabling precise quantum feedback control within the realm of continuous measurement.Furthermore, this protocol integrates seamlessly with RL-based control methods, enabling efficient training and control.

Supplemental Material S1. CONTINUOUS MEASUREMENT THEORY
Quantum continuous measurement approach is useful for following the process of changes in the state of a quantum system being measured as one gradually obtains information about it.It is particularly important for situations where a system needs to be monitored in continuous time and active feedback is applied in real time.The dynamics of a quantum system undergoing continuous measurement is described by the so-called stochastic master equation (SME) [1,3].
One approach to obtaining the SME is to correlate it to classical continuous measurement, as outlined in [3].In classical continuous measurement theory, the likelihood of obtaining measurement results y given the true value x true of a system measured in terms of the continuous variable x is represented by a Gaussian distribution.The conditional probability distribution is represented as where, σ is the noise variance, and ξ represents unbiased Gaussian noise.The measurement outcomes are given by y = x true + ξ.
The sum of measurement results can be expressed as an integral over increments: Here, dW represents Wiener noise, obeying dW 2 = dt and n (dW n ) 2 = T .Differential equations involving dW are handled using stochastic calculus.The stochastic differential equation (SDE) for the probability distribution P (x) based on measurements is given by, where the measurement outcomes are given by, This is the Kushner-Stratonovich equation, which characterizes classical continuous measurement theory.Now, to formulate the quantum continuous measurement theory, we choose a quantum observable X for which the measurement result dy is given by, where we have redefined k = 1/(8β 2 ).The second term on the right-hand side is called the measurement noise or shot noise.The likelihood function P (dy|x) has the form From this, the set of quantum operators A(dy) can be derived as follows, where we assume that the measurement operator X has a continuous spectrum of eigenvalues x such that X|x⟩ = x|x⟩ and ⟨x|x ′ ⟩ = δ(x − x ′ ), i.e., the measurement operators are Gaussian weighted sum of projectors onto the eigenstates of X.The probability distribution of the measurement results with initial state |ψ⟩ = ψ(x)|x⟩dx can be determined as follows, where we have replaced |ψ(x)| 2 → δ(x − ⟨X⟩), using the fact that the Gaussian in the integral is much wider than ψ(x) and ⟨dy⟩ = ⟨X⟩dt, so it must be centered at ⟨X⟩.With this, we can now easily derive the SDE for |ψ⟩ as follows, where | ψ(t + dt)⟩ represents the normalized wave function after a measurement.The normalization leads to, which is known as the stochastic Schrödiner equation (SSE).Unlike the Schrödinger equation, the SSE is essentially nonlinear, since the term containing ⟨X⟩ is contained in the term containing dW .From this, the stochastic master equation (SME) for the evolution of the density matrix ρ can be easily derived by taking ρ(t + dt) = (|ψ⟩ + d|ψ⟩)(⟨ψ| + d⟨ψ|) as, where the first term on the right side represents the coherent evolution due to the system Hamiltonian H.This is the primary equation for the quantum analogue of the classical Kushner-Stratonovich equation.Defining the following superoperator and replacing k → κ/2, the SME can be written as, where ρ c denotes the conditional density matrix of the system described by the Hamiltonian H.A is the dimensionless measurement operator.We refer to continuous measurement records as,

S2. MEASUREMENT-BASED FEEDBACK CONTROL
In a typical Measurement-Based Feedback control (MBFC) scheme, a dynamic system is continuously monitored by a probe that outputs noisy readings in real time, which are processed by a real-time feedback algorithm to modify the input to the system to achieve a desired dynamic or target state over time.This is shown schematically in Fig. S1.The feedback algorithm is usually called a controller and is designed to be a computer or configurable logic block that interprets signals from noisy data and, based on that, applies feedback controls according to some predefined, hard-coded rules.This may be a classical or a quantum system.In the latter case, the signal is also a function of the backaction of the measurement process, in addition to the presence of various noises originating from decoherence effects of the environment.We are interested here in quantum mechanical control by active feedback.Such active feedback (also known as closed-loop feedback control) is based on continuous monitoring of the quantum system with weak measurements, as described above.When the feedback controller applies feedback F (t) to the dynamic quantum system at time t, the SME becomes Here, the feedback F (t) can generally be a function of all previous measured values dQ; in this case, the feedback is essentially non-Markovian.In the case of Markovian feedback, F (t) is determined only by the instantaneous value of the measurement result dQ(t).This is the essence of the so-called Wiseman-Milburn feedback based on measurements [1].In the case of Bayesian feedback control, F (t) is a function of the conditional density matrix ρ c instead of dQ.For example, F (t) can be determined by the conditional mean of an operator X, which is not readily available for most complex quantum mechanical systems [1,3].The distinctive feature of the present work is that we have shown how it may be possible to access noiseless conditional dynamics using noisy continuous measurement techniques that leads to accurate MBFC.

S3. REINFORCEMENT LEARNING CONTROLLER
Reinforcement learning (RL) [37] is a type of machine learning (ML) [45] that is used to make a series of decisions on complex and potentially uncertain problems.It is trained through a game-like process in which a player learns to play a game from scratch in order to win and accumulate as many points as possible in a shorter time.The ML model in the case of RL is commonly referred to as the RL-agent.The world outside the RL-agent is called the RL-environment, which in principle includes everything, but must include the part of the world defined by the particular problem over which the RL-agent wants to have control.Similarly, the RL-agent also gathers knowledge about the RL-environment by collecting incentives called rewards while trying to change its dynamics (its game) by imposing some stimuli called actions.The RL-agent's goal is to collect as many rewards as possible over a given period of time, called an episode.Thus, the RL-agent essentially learns by trial and error, much like humans and animals, and because of this similarity, RL is considered the primary technique for general machine intelligence.RL is used in numerous engineering applications, including robot navigation, artificial intelligence in games, real-time decision making, skill acquisition, and learning tasks.
The typical workflow of RL is very similar to the MBFC shown in Fig. S1 above.In the case of RL, the feedback algorithm (controller) is the RL-agent and the system is the RL-environment.The outputs that the RL-agent uses to train itself and learn about the RL-environment and controller are called states or observations.Unlike the controller in MBFC, the RL-agent does not need to be pre-programmed, but is able to learn the rules (policy) of the controls itself based on the observations it receives in real-time from the RL-environment.To do this, the RL-agent must receive another signal, the reward, which, as mentioned above, is maximized over each episode to obtain the optimal strategy.With these modifications, the workflow of MBFC with RL control takes the form shown in Fig. S2.The definitions of the various RL-terminologies along with brief definitions and notes are given in Table S1.
RL in conjunction with ANN is referred to as Deep Reinforcement Learning -DRL for short, which has revolutionized the field of RL as a cutting-edge technology in recent years.Although the primary task of RL is to learn to control the environment through actions, there are numerous algorithms to accomplish this.Choosing the right algorithm depends on the nature of the problem we need to solve and the complexity of the actions and observations.In most cases, it turns out that the choice is not trivial and there is always an optimal choice of algorithms and various so-called hyper-parameters that work optimally for a given problem.Therefore, it is overwhelming for novices to learn the procedure in detail, and a chronological understanding is desirable.When applied to problems in quantum physics, the complexity arises not only from the choice of the type of algorithms, but also from the inherent complexity resulting from the uncertainty associated with quantum dynamics.This requires a basic understanding of the different types of RL algorithms.
The RL problem can be formulated mathematically using a Markov decision process (MDP) that represents the dynamics of the environment during the time that the agent takes actions in a given state.To this end, the MDP is equipped with a transition function (or transition model) that can predict the state of the environment as a function of the current state and the action taken.The MDP is also equipped with a reward function that can be obtained as a function of the current state of the environment and possibly the action and the next state predicted from it.Thus, the dynamics of the MDP are described by the transition and reward functions, collectively referred to as the "model."In most contexts, the MDP has a model, which means that the transition and/or reward functions are not available.
We have used Proximal Policy Optimization (PPO) [38] which is a reinforcement learning algorithm that is used to optimize the policy of an agent in an environment that is designed to be both simple to implement and effective in practice.PPO is a variant of the popular actor-critic algorithm, which separates the policy (the actor) from the value function (the critic).The PPO algorithm can be broken down into the following steps:  S1.In RL terminology, the controller on the right is called the agent, the system to be controlled by the agent is called the environment, the outputs are called the states and the feedback is called the actions.Learning of the rules of the control (policy) is done via maximization of the net reward it gets over some time called an episode.The reward should any metric of the changes occurred in the environment because of the application of an action and should be a quantity accessible in real experimental conditions.
1. Collect a batch of samples by interacting with the environment using the current policy.These samples consist of a sequence of state-action-reward tuples (s t , a t , r t ).
2. Estimate the value function V (s t ) for each state in the batch using a neural network.The value function can be estimated using the Bellman equation, which gives the optimal value function V * (s) for each state, where s ′ is the next state, and the expectation is taken over the distribution of next states given the current state and action.
The value function can be estimated by iteratively updating the estimates using the Bellman equation, this process is known as dynamic programming [37].One popular method to estimate the value function is using the temporal-difference learning algorithm, which is a type of online, model-free method for estimating the value function.
3. Estimate the advantage function A(s t , a t ) for each state-action pair in the batch.The advantage function is an estimate of the difference between the expected return and the value function, 4. Use the samples to update the policy network, which is a neural network that maps states to a probability distribution over actions.PPO modifies the actor objective function by using a 'clip' function to ensure that the updated policy is not too far from the previous one.The PPO objective function is, where r t (θ) = π θ (at|st) π θ old (at|st) is the ratio of the new policy to the old policy, π θ (a t |s t ) is the probability of taking action a t in state s t under the current policy, π θold (a t |s t ) is the probability of taking action a t in state s t under the previous policy, clip(r t (θ), 1 − ϵ, 1 + ϵ) is a function that clips the ratio of the new policy to the old policy to the range The PPO algorithm is an example of trust region policy optimization algorithm [39] that ensures that the updated policy is not too far from the previous one.This makes the optimization process more stable and prevents the agent from over-fitting to the current policy.For that it modifies the objective function of the actor to ensure that the updated policy is not too far from the previous one.The use of the clip function also helps to reduce the variance of the gradient estimates, which can lead to more stable and efficient training.
One of the key advantages of PPO is that it is relatively simple to implement compared to other state-of-the-art algorithms.PPO does not require the use of complex off-policy methods or value function approximations.Another advantage of PPO is Table S1.A list of RL terminologies and their definitions along with small notes for each.

RL Terminology Definition & brief notes Agent
It is the controller and essentially the brain of the RL, gradually learning about the hidden features of the system being controlled (the environment) in a trail-and-error process.At each point in time, it receives an observation s from the environment, which is mapped by the it to specific feedbacks (actions) a to be applied to the environment.Environment It is the world in which the agent lives and interacts with, in which it learns and decides what actions to apply to it.Anything that the agent cannot arbitrarily change can be considered part of the environment.It is essentially the representation of the problem to be solved and can be based on the real world (e.g., a robot walking in a field), simulations (e.g., board games such as Atari), or hybrids (e.g., self-driving car where the training can be done on simulated models before working in real environments).

State and Observation, s
State represents the complete information about the environment as output at each time step of the RL workflow.Theoretically, the observation represents the complete or partial description of the state.However, in the RL literature, both terms, states and observations, are used as synonyms.In practice, state can be anything that might be useful for the agent to decide the next action wisely, and is usually represented by a real-valued vector, matrix, or higher-order tensor.The set of all valid observations in a given environment is called the observation space, which can be discrete or continuous in nature.Actions, a These are the feedbacks that the agent applies to the environment based on the current observation, s.The actions change the environment and these changes are expected to happen in a desired way.The set of all valid actions in a given environment forms a space known as the action space, which can be discrete or continuous in nature.

Reward, r
It is a scalar number, positive or negative, received by the agent from the environment along with the observations, which is a direct measure of whether the action performed by the agent was useful or not in achieving the goal.The reward at each time t is determined by the current observation, the action, and the subsequent observation of the reward function: rt = R(st, at, st+1), and often simplified as rt = R(st, at).

Episodes and Trajectories , τ
A trajectory represents a sequence of observations and actions τ = (s0, a0, s1, s1, . . .aT , sT ) over a time T , called an episode.The agent's task is to collect as much total reward as possible during each episode.

Discounted return
The sum of rewards received by the agent due to its actions within a trajectory τ is called the rate of return R(τ ) = T t=0 rt or, more precisely, the finite horizon undiscounted rate of return, where all returns, now and later, are treated equally.In practice, the so-called infinite-horizon discounted return R(τ ) = ∞ t=0 γ t rt, where γ ∈ (0, 1) is called the discount factor.This gives preference to the reward received now over that to be received later, and makes it mathematically simpler to treat in equations.

Policy, π
It is the rule or control strategy learned by the agent to apply actions to the environment to achieve something desired.It is essentially represented as a deterministic or stochastic mapping from states to actions.In a stochastic strategy, the action at time t is given by the conditional probability distribution of state st: at ∼ π(•|st).

Function approximator
It represents the encoding of a function from training examples.Standard approximators are decision trees, neural networks and nearest neighbor methods.The task of the RL is to optimize the so-called policy parameters θ represented by the function approximator (weight and biases of the neural network) to optimize the policy π θ (•|s).V -value function, V π (s) It is denoted as the value of a state s and represents the expected return if we start in state s and act according to policy π forever thereafter.It is computed as , where E τ ∼π represents the average discounted return starting at s0 = s and following the policy in each trajectory over multiple trajectories.The optimal value function V results from following the optimal policy, for which it is given by The optimal action can then be calculated as a * (s) = argmax It is denoted as the value of a state-action pair (s, a) and represents the expected return if we start with the state-action pair (s, a) and act according to policy π forever thereafter.It is computed as , where E τ ∼π represents the average discounted return of starting at s0 = s, taking the action a0 = a, and following the policy in each trajectory over multiple trajectories.The optimal Q-value function results from following the optimal policy, for which it is given by Q The optimal action can then be calculated as a The main connections between the V -value and Q-value functions are s) and represents how much better it is to take a given action a in state s versus a randomly chosen action π(•|s).It is important for formulating RL methods for policy optimization.
that it is a sample-efficient algorithm.It allows the agent to learn from a relatively small number of samples, which makes it well suited to applications where data collection is expensive or time consuming.PPO also has a good performance on high-dimensional and continuous action spaces.It has proven to be a useful algorithm for problems involving quantum systems.
For the results presented in this work, we have demonstrated the use of RL for the non-linear quartic potential control and in the case of preparation of entangled states.In this protocol, the RL can be trained as a state-aware model of the estimator quantum state or any mean value of interest.For the case of the quartic oscillator, we consider the conditional moments of a few observables as the input to the RL agent, where x(p) are the position (momentum) quadrature of the oscillator.These, along with the real-time fidelity are provided by the estimator in real-time, the latter is used as the reward signal that gets maximized through training the RL-agent by adjusting the control parameters, The actions, λ(t) of the RL (the controls) are real scalar values such that the feedback added to the Hamiltonian is H f = λ(t)p.
For the case of preparation of an entangled state based on a control law, the task of the RL is to estimate to controls u 1 and u 2 , as discussed in S5.For this case, the RL uses the conditional density matrix obtained from the estimator as input(observation) to correctly learn the controls in order to generate the target states, using the fidelity as the reward signal.
For the RL implementation, we used PPO with continuous controls with the following hyperparameters.In the following, we use the intuitive example of a qubit to demonstrate the MBFC protocol, with the Hamiltonian given by, where σi , i = (x, y, z) are Pauli operators, ε is the bare energy splitting, and ∆ is the tunneling rate between the two states of the qubit system.We start the state of the physical qubit with excited state occupancy that undergoes continuous measurement of the operator Â = σz , and see if the stochastic estimator started with a random state with an initial fidelity of ∼ 0.6, can lead to a perfect estimate of the state in time.As shown in Fig. S3, the conditional state of the estimator gradually converges with time and perfectly reproduces the real system state.Note that this convergence can always be guaranteed regardless of whether the efficiency η is ideal or not.In the case of η ̸ = 1, the time t f required to reach convergence becomes slightly longer.This is shown in Fig. S4(a) as a function of η.On the other hand, for detectors with larger measurement rate κ, t f becomes smaller as shown in Fig. S4(b) for η = 1 as a function of κ.This behavior of the estimator is understandable, since intuitively the estimator would be able to learn the state faster if it had more accurate information (larger η) and less noisy measurement data (larger κ).

S5. CONTROL LAWS FOR SYMMETRIC AND ANTISYMMETRIC ENTANGLED STATE PREPARATION
We consider the example of two qubits, which starting from random states can be prepared in symmetric and antisymmetric entangled states given by,  where ψ ↑↓ = (↑) ⊗ (↓) and ψ ↓↑ = (↓) ⊗ (↑) are tensor product states of the individual qubit states in the ground and excited states.We consider the stochastic feedback controls given below [40].
The quantum filtering equation under feedback with control variables u 1 (t) and u 2 (t) is given by,   protocol works perfectly.It is indeed most suitable for situations with precise and controlled experimental conditions, such as in controlled settings in superconducting circuit quantum electrodynamics systems.We illustrate the effect of not having an accurate estimation of the system Hamiltonian on the estimation phase of the protocol in Fig. S8, for the case of the quantum harmonic oscillator cooling.However, as discussed in the main text, as the protocol is expected to benefit from the use of RL-based controllers, the estimation can be improved in realistic experimental situations by combining it with parameter estimation and Hamiltonian learning.There are several methods for such estimation based on measured data, derived from both machine learning and nonmachine learning, which could be used before the protocol to estimate the parameters of interest, as summarized in the recent review article on this topic [47].

Figure 2
Figure2.Control of a linear quantum harmonic oscillator using the protocol.(a) In the estimation phase, the fidelity F(t) between the physical system and the estimator steadily converges.The inset displays the conditional means of the observable x.(b) Subsequently, a state-based controller is applied, swiftly guiding the particle's motion around the center ⟨x(t)⟩c = 0.The instantaneous fidelity Fg(t) (depicted in black) quantifies the closeness between the physical system/estimator state and the target state, i.e., the ground state of the oscillator.The conditional mean population ⟨â † â(t)⟩c in the oscillator is shown in red.

Figure 3 .
Figure 3.The protocol is applied to control a particle's motion in a nonlinear quartic potential to cool it to its dynamic ground state using RL-based control.The training process is shown in black colored line as the average fidelity over each episode N with respect to the target state (ground state), F(N ), which is maximized through training.Note that the sudden drop at N ∼ 100 is due to the exploration of the RL-agent.The performance of the trained agent is shown in red line.

Figure S1 .
Figure S1.The typical workflow of measurement-based feedback control (MBFC) is shown.A system (left) is measured continuously, and the noisy measurement results are used by a feedback algorithm to change the system's inputs in a desired way and control the system's dynamics.

Figure S2 .
Figure S2.Workflow of RL in comparison to MBFC shown in Fig.S1.In RL terminology, the controller on the right is called the agent, the system to be controlled by the agent is called the environment, the outputs are called the states and the feedback is called the actions.Learning of the rules of the control (policy) is done via maximization of the net reward it gets over some time called an episode.The reward should any metric of the changes occurred in the environment because of the application of an action and should be a quantity accessible in real experimental conditions.

Figure S3 .
Figure S3.We demonstrate the convergence of the state of the estimator to the real quantum system state for the toy model of a qubit.In the insets, we show the initial and final states of the real and the estimator for this particular example.The initial fidelity of the real and estimator states is F(0) ∼ 0.6, which gradually improves until it reaches F(t f ) ≈ 1.This represents the estimation phase of the MBFC protocol shown in Fig.1(a) in the main text.The parameters considered are ε = 0.1, δ = 1.0, κ = 1.0, η = 1.0.

Figure S4 .
Figure S4.The convergence time t f as a function of (a) measurement efficiency η and (b) measurement rate κ, showing the fact that t f depends on the access to information obtained from noisy continuous measurements.

Figure
Figure S5.Circuit-QED example.The time evolution of the fidelity between the real and the estimated state is shown.In the inset (a), the x-quadrature conditional means ⟨x(t)⟩c of the real and the estimator system are compared.The corresponding change in ⟨σz(t)⟩c of the qubit for the physical system and the estimator are compared in the inset figure (b).The control in this case would be to flip the qubit as ⟨σz⟩c → −1.

Figure S6 .
Figure S6.An use-case of the proposed method for non-classical bosonic code preparation using continuous measurement and non-trivial control protocols obtained with RL.The RL-agent is trained to improve the fidelity of a target state (here a binomial 1-3 code is shown) starting from the ground state.The RL can only discover strategies when it gets the instantaneous conditional density matrix as input (observation).The learning curve of the RL is shown.In the insets the wigner function of the initial state and the final state prepared by the RL-agent is demonstrated.The final state has fidelity > 0.99.