Accelerated Magnonic Motional Cooling with Deep Reinforcement Learning

Achieving fast cooling of motional modes is a prerequisite for leveraging such bosonic quanta for high-speed quantum information processing. In this work, we address the aspect of reducing the time limit for cooling below that constrained by the conventional sideband cooling techniques; and propose a scheme to apply deep reinforcement learning (DRL) to achieve this. In particular, we have shown how the scheme can be used effectively to accelerate the dynamic motional cooling of a macroscopic magnonic sphere, and how it can be uniformly extended for more complex systems, for example, a tripartite opto-magno-mechanical system to obtain cooling of the motional mode below the time bound of coherent cooling. While conventional sideband cooling methods do not work beyond the well-known rotating wave approximation (RWA) regimes, our proposed DRL scheme can be applied uniformly to regimes operating within and beyond the RWA, and thus this offers a new and complete toolkit for rapid control and generation of macroscopic quantum states for application in quantum technologies.

Introduction.− Fast cooling of bosonic mechanical modes of macroscopic systems is a primary objective of the ongoing efforts in quantum technology [1][2][3][4][5], and is a prerequisite for diverse prospective applications, such as the realization of macroscopic superposition states [6], gravitational tests of decoherence [7,8], ultra-precise measurements and sensing [9,10], and bosonic quantum computing [11]. Macroscopic yttrium-iron-garnet (YIG, Y 3 Fe 5 O 12 ) magnets have recently attracted strong interest towards such applications given the versatility of such systems in coupling to other modes falling in a wide frequency spectrum, e.g., with optical, microwave, and acoustic modes, as well as to superconducting qubits [12][13][14][15][16][17]. In addition, highly polished YIG spheres feature high magnonic Q-factor and exhibit large frequency tunability properties due to the magnetic field dependence of the excited magnon modes. Considering, in particular, the cooling of motional modes of such objects, the usual method of sideband cooling based on weak magnomechanical interaction, operates on a time scale longer than the mechanical period of oscillation and depends on the relaxation dynamics of the subsystems. Cooling of bosonic modes in such systems in a timescale less than the mode frequency is highly advantageous for quantum computation and bosonic error correction [11,18]. By going over to the strong coupling regimes in magnon-phonon interactions, the speed of motional cooling can be highly enhanced giving rise to accelerated cooling. However, in such strong coupling regimes, the energynonconserving dynamics prevails because of the simultaneous presence of counter-rotating interactions, which makes it impossible to use sideband motional cooling techniques in this regime. In this work, we explore the usefulness of a machine learning based approach to address the aspect of reducing the time limit for cooling below that constrained by the conventional cooling techniques.
Recently, various machine learning (ML) approaches, aided with artificial neural networks as function approximators, have found widespread technological applications [19]. Among the various ML approaches, reinforcement learning (RL) [20], is considered to exhibit the closest resemblance to a human-like learning approach, in that the RL-agent tries to gather experience on its own by interacting with its environment in a trial and error approach. In RL terminology, the environment describes the virtual/real-world surrounding the agent, with all the physics hard-coded into it along with a reward function based on which the agent can classify its good moves from the bad ones. RL, when operated in combination with artificial neural networks, is known as deep reinforcement learning (DRL). DRL has become crucial in many industrial and engineering applications, primarily after recent seminal works by Google DeepMind researchers [21,22]. Following these developments, there have been several fascinating applications of DRL in various fundamental domains of science, including some in quantum physics in areas of quantum error correction, [23][24][25] quantum control [26], and state engineering [27][28][29][30][31][32][33][34][35][36].
In this work, we propose a DRL-based dynamical coupling scheme for accelerated motional cooling of a macroscopic object, that works for a generalized parameter setting of the coupling strength between the subsystems. In particular, we use the protocol to cool the acoustic phonon modes of a YIG sphere with a magno-mechanical interaction, and show that it works efficiently in the strong coupling regime, where other methods such as sideband cooling fail. Also, we show how going over to the strong coupling regime is particularly advantageous, as it lowers the cooling time well below the phonon oscillation period and two orders of magnitude below the sideband cooling time limit. We demonstrate the usefulness and generalizability of our DRL cooling protocol by extending its application to a tripartite system of a trapped YIG magnet with its magnonic modes coupled to the center-of-mass (COM) mode in the trap and an optical cavity mode; and show that despite the system being in the ultrastrong coupling regime, our DRL scheme can reveal nontrivial coupling modulations to cool the motional mode, which is usually not possible with coherent counter-intuitive protocols.
Bipartite magno-mechanical cooling.− We consider a setup as shown in Fig. 1(b), where a highly polished YIG sphere is placed in a microwave cavity. With a large external ho- The schematic workflow of the DRL protocol is shown, where the RL-environment is either the bipartite magno-mechanical system (b), or the tripartite opto-magno-mechanical system (c). See text for further detail on DRL and the physical models. mogeneous magnetic field, B 0 , applied in theẑ-direction, the YIG sphere is magnetized to its saturation magnetization, M s = 5.87 × 10 5 Am −1 , which gives rise to collective spin wave magnon modes [16,37]. The frequency of the Kittel mode, which is the uniform magnon mode in the YIG sphere, is given by, ω m = γB 0 , where γ/2π = 28 GHz/T is the gyromagnetic ratio. Due to the magnetostriction properties, YIG spheres also exhibit high-Q acoustic modes which are coupled to the magnon modes [12,15], and by driving the magnon modes with MW fields, the magno-mechanical coupling can be tuned and controlled [15,17]. In the limit of adiabatic elimination for a low-Q cavity, the bipartite magnomechanical Hamiltonian is given by (see Supplemental Material for detail), where m(m † ) and b(b † ) are the magnonic and acoustic mode annihilation (creation) operators, ω b is the resonance frequency of the acoustic mode and ∆ m = ω m − ω d (drive frequency ω d ) is the detuning. In such a bipartite system, while the beam-splitter interaction,Gmb † +G * m † b, valid for weak magno-mechanical coupling, favours mechanical cooling at the red sideband ∆ m = ω b ; the full coupling interaction accounting for the strong/ultrastrong coupling regimes, is not favourable in the usual sideband cooling approach. Sideband cooling works through anti-Stokes scattering of the excitation from the thermally populated mode, b to the mode at zero entropy, m. However, such cooling needs constant driving as it is a steady-state process that takes a duration of the order of the relaxation dynamics of the subsystems (see Supplemental Material). If one can access the strong coupling regime and manage to tame the counter-rotating interactions therein, there is a possibility of getting faster cooling than this limit. In the following, we design an algorithm based on DRL to model a dynamic variation of coupling,G, to get faster cooling of the acoustic mode, that operates within and beyond the weak coupling regime.
The schematic workflow of the DRL scheme applied to the physical model (RL-environment) is shown in Fig. 1(a). The RL-agent consists of a neural network model that is optimized for selected choices of actions that lead to desirable changes, and to net maximum rewards. The RLagent is modelled using the recently proposed Soft Actor-Critic (SAC) [38] algorithm that is based on the maximization of the entropy, H(π θ (·|s t )) of the policy, π θ as well as the long-term discounted cumulative return r(s t , a t ), i.e., max E [ t γ t (r(s t , a t ) + αH(π θ (·|s t )))] , where, θ denotes the optimizable weights of the neural networks, γ is the discount factor and α is the regularization coefficient that determines the stochasticity of the policy. The policy, π θ sets the rules for the particular actions to be applied on the RLenvironment (see Supplemental Material for further details).
The dynamics of the system is described by the quantum master equation (QME) for the density matrix ρ with the HamiltonianH as, with dissipations and thermal fluctuations given by the Lindblad superoperators, where, κ j 's are the damping rates of the modes; and the thermal occupation of each bosonic mode is given by,n cj = [exp( ω j /k B T ) − 1] −1 , where T is the bath temperature and k B is the Boltzmann constant. Solving the full QME to obtain the mean occupancy is a computationally intensive task. DRL typically requires several thousands of episodes of training, and solving the full QME within each episode is too resource sensitive for complex systems such as the ones we consider in this work. We employ an alternative approach to compute the mean occupancies in each mode using a set of linear differential equations for the second-order moments obtained from the QME, given by, ∂ t ô iôj = Tr (ρô iôj ) = m,n ζ mn ô môn , whereô i ,ô j ,ô m ,ô n are one of the operators (c j † , c j ); and ζ mn are the corresponding coefficients. Instantaneous solutions of these equations and the controls, G, are used as the observations for the RL-agent in Fig. 1(a), and the reward function is chosen as, is the cooling quotient of the phonon number with respect to thermal occupancy, n T b at temperature T , where b † b represents the mean value of the phonon population. Further details of the DRL controller can be found in Supplementary Information.
In Fig. 2(a), we show the cooling quotient,ñ b for the DRLoptimized controls, as a function ofG max (the maximum of the coupling parameter). We consider ∆ m = ω b , and damping rates as, κ b /ω b = 10 −5 and κ m /ω b = 0.1. It is found that as the coupling is increased towards the ultrastrong coupling regime (G ω b ), the cooling becomes much faster. The DRL-optimized complex pulse sequence is shown as a polar plot forG max /( √ 2ω b ) = 5 in Fig. 2(b). We denote the minimum time for cooling achieved by our method asτ DRL . Fig. 2(c) compares this time as a function ofG max with respect to the sideband cooling time limit, τ SB , which represents the shortest cooling time limit possible with these methods, that work only when the rotating wave approximation (RWA) is applicable (see Supplemetal Material). With the DRL-based coupling modulations, we can achieve very low limits of cooling time compared to sideband cooling techniques, showing a lowering of approximately two orders of magnitude; which for the ultrastrong coupling regime, is further lowered.
Tripartite opto-magno-mechanical cooling.− Now, we show that the proposed scheme can be extended effectively to more complex systems, for example a higher-order tripartite opto-magno-mechanical system, where we intend to cool the motional mode through non-trivial three-mode interactions. For this, we consider a system comprising a levitated YIG sphere in a harmonic trap [39][40][41], along with a driven WGM optical microresonator placed along thex-direction with a magnetostrictive rod (MR) attached to it, as depicted in Fig. 1(c), which will be used as the RL-environment in Fig. 1(a) in the DRL model. In a large external homogeneous magnetic field, B 0 , applied in theẑ-direction, the YIG sphere is magnetized to the saturation magnetization, M s , and the homogeneously magnetized fundamental magnon mode (Kittel mode) produces a change in the axial length of the MR, which modulates the WGM optical mode frequency [2,[42][43][44]. This gives rise to a coupling between the WGM optical mode (a) and the magnon mode (m) of the form, Ω S a † a(m + m † ), where Ω S = ∆ω is the optical frequency shift. The magnon mode can also be coupled to the COM motion of the YIG sphere, by applying a spatially inhomogeneous external timedependent magnetic field, H g (y, t) [45,46], which satisfies the weak driving, |H g (y, t)| H 0 , and small-curl, |∇ × H g (y, t) r| |H g (y, t)| conditions (see the Supplemental Material for more information). Considering a timevarying gradient magnetic field of the form, H g (y, t) = bg(t) µ0 yŷ, (b g in units of [T/m]), the interaction Hamiltonian for the COM motion in theŷ-direction (frequency ω b ) and the magnon mode is given by, is the saturation magnetization, and ρ = 5170 kg/m 3 is the mass density of YIG. In the rotating frame of the cavity drive and the displacement picture of the average field in each mode [47], the complete Hamiltonian is described bỹ where ∆ a is the cavity detuning, andΩ S is the drivingenhanced optomagnonic coupling rate. One can modulateΩ S via the external drive, whereas the phonon-magnon coupling Ω P can be modulated using the time-varying magnetic field gradient. In order to cool the COM motion, we intend to transfer the phonon population from the COM mode to the optical mode without populating the magnon mode. The damping rates of the cavity, magnon, and COM modes are given by, κ i 's, and the corresponding thermal populations at temperature T are n T i , with i ∈ {a, m, b}. Since the optical cavity mode oscillates at high frequency, its corresponding thermal bath even at room temperature yields zero thermal occupancy, however, the phonon and magnon baths are occupied.
Similar to the bipartite system discussed above, we next use the second-order moment equations to solve the dynamics of the system and the DRL scheme is used to optimize the controls,Ω P/S by maximizing the net reward signal per episode, where λ is a constant chosen such that the magnon mode does not get populated. Given the fact that the COM mode frequencies of the YIG are of the order of ω b /2π ∼ 10 − 100's of kHz, and the magnon frequency is ω m /2π ∼ 10 GHz, the ideal choice of ω m for ω b /2π = 100 kHz is ω m = 10 5 ω b . With such a high frequency difference with the intermediate magnon mode at ω m = 10 5 ω b , this constitutes a largely detuned system (ω m Ω2 P +Ω 2 S ). In such a system while the magnon mode is usually decoupled, the ideal time limit to obtain swap between mechanical quanta and optical mode with the ideal Raman pulses is given byτ lim = πω m /(2Ω PΩS ) (see the Supplemental Material). However, this is the limit for the situation as long as the RWA is valid. It is also noted that the effectiveness of this kind of cooling is highly reduced in presence of damping, and going beyond the RWA to decrease the cooling time limit is not possible because of the counter-rotating dynamics. We apply the DRL strategy to work in a regime whereΩ i 's are sufficiently high to access the cooling limit not obtainable by these conventional means, and also keep the counter-rotating dynamics in control. While the use of the method of coupled second order moments reduces the computation resources drastically, simulating the dynamics with the choice of the realistic parameter ω m = 10 5 ω b with high coupling strengths, for ex-ampleΩ max S =Ω max P = 100 ω b , turns out to be a computationally highly intensive problem, due to the very stiff solution of the set of differential equations. Hence, we adopt a two-step training procedure for the problem. In this protocol, we first use an auxiliary system with ω m = 10 3 ω b and Ω max S =Ω max P = 10 ω b , for which solution of the set of equations can be obtained without much computational effort. The trained auxiliary model is then used as a supervisor/teacher for the actual system, that we call primary, with ω m = 10 5 ω b andΩ max S =Ω max P = 100 ω b . Training the primary system for a few hundred episodes with periodic evaluation of the RL-agent yields the best trained model. In the literature of RL, such a scheme is known as imitation learning, and is a feature of generalizability of RL-trained model. In Fig. 3(a), the cooling quotients of the photon, magnon and phonon modes,ñ i = n i /n T , are shown, where n i 's are the mean occupancies and n T is the phonon thermal bath population. The cooling time limit for this system is given byτ lim ω b /(2π) ∼ 5 (see the Supplemental Material). The plot shows the population dynamics in the three modes for the DRL-based coupling parameters with a maximum allowed value,Ω max i /ω b = 100. One can see that while the magnon mode is kept at a constant population, there is a scattering between the mechanical and optical mode that gives rise to five orders of cooling in the mechanical mode. This draws an analogy to the stimulated Raman adiabatic passage (STI-RAP) techniques for three-mode systems. However, it is wellknown that such adiabatic techniques require longer time and only works ideally as long as the counter-rotating terms are not present (see the Supplemental Material). On the contrary, the coupling parameters found by our DRL protocol are nontrivial, which are shown in Fig. 3(b), and the overall time required for cooling is reduced below the adiabatic limit even for high values of coupling. In the bottom panel of Fig. 3(a) we show the cooling time required by our DRL method for even larger values, and higher coupling parameters yield even lower time limits.
Conclusion.− To conclude, in this work we address the aspect of reducing the time required for cooling bosonic motional modes below the time limit accessible by well-known cooling methods with scattering techniques. While such conventional methods of cooling do not work in strong and ultrastrong coupling regimes where the RWA is not valid, we design a DRL-based algorithm that works within and beyond such regimes, and show that by accessing the ultrastrong coupling limit with DRL-designed pulses the cooling limit can be broken resulting in accelerated cooling. We further show how the protocol can be adapted for cooling in a tripartite system with an opto-magno-mechanical interaction, that repre- sents more complexity for the DRL-control owing to the huge dimensionality in the Hilbert space; and find nontrivial threemode interactions leading to accelerated cooling breaking the coherent cooling time limits. Thus, this study outlines a comprehensive toolbox for application of DRL for fast and efficient quantum control in a magnonic system within and beyond RWA restrictions, which can be adapted to other quantum systems of interest. For future and ongoing efforts in quantum technology, this is expected to play a pivotal role, especially in conjunction with various laboratory experiments.

Supporting Information REINFORCEMENT LEARNING AND SOFT ACTOR-CRITIC ALGORITHM
Reinforcement learning (RL) has been one of the most exciting fields of machine learning (ML) following a few revolutionary works carried out during 2013-2016 by the DeepMind and Google [21,22]. Although these applications were demonstrated initially for different board games, researchers all around the world are now looking into its potential uses for different technological applications. In RL, a software agent (we will call it the RL-agent), makes some observations and applies some actions on an environment (we call it the RL-environment), that causes a change in the dynamics of the environment and, in return, the RL-agent receives some scalar values as rewards. The objective of the RL-agent is to maximize the total rewards obtained in a particular time range (we call it an episode). This approach of training is particularly different from typical training approaches where the ML-model is trained based on predefined data with labels (supervised learning), or with no labels (unsupervised learning).
FIG. S1. The workflow of deep reinforcement learning (DRL) is shown. The RL consists of two main characters -the RL-environment and the RL-agent. The RL-agent, in the case of DRL, comprises artificial neural networks which operate as function approximators. The RL-agent applies certain actions on the RL-environment and receives certain quantities as observations along with a scalar reward signal, based on which the neural network is updated to adopt a better policy for improved rewards.
The workflow of a typical RL is shown in Fig. S1. Here, the block named as the RL-agent is depicted as comprising a set of neural network layers which act as function approximators. This type of RL is known as deep reinforcement learning (or DRL for short). On the other hand, the block in the right depicts the RL-environment, which is the world wherein the RL-agent lives and can affect change via actions on the dynamics of the environment. In RL terminology, the environment signifies the real/virtual space in which the rules of the physics of the problem/system are encoded along with a reward estimation function returning some real scalar values that determine whether the applied action has led to desirable changes or not. In practice, the RL-agent makes a certain number of interactions by first randomly applying some actions on the RL-environment based on which the weights of the nonlinear neural network function estimators are updated. The procedure is repeated several hundreds/thousands of times, depending on the complexity of the problem. Such a learning process is analogous to the learning process of a human, where the agent learns by conducting trial and error experiments on the RL-environment. Depending on the type of the RL algorithm used, the neural network based agent can explore even newer possibilities not usually explored, thus outperforming the analogous supervised learning agent.
The algorithm that the RL-agent uses to determine the actions is called its policy, which in the case of DRL, represents the neural network itself that takes the observations as the input, and outputs the actions. A policy can be deterministic in which case the action of the RL-agent is determined by the policy parameters θ for a given state s t at time t: a t = µ θ (s t ), or stochastic in which the actions are sampled from a probability distribution conditioned on s t : a t ∼ π θ (·|s t ). The task of the RL-agent is to optimize the policy parameters (θ) to maximize the net discounted rewards R(τ ) = ∞ t=0 γ t r t over a trajectory τ = (s 0 , a 0 , s 1 , a 2 , ...), for the discount factor γ ∈ (0, 1). To achieve that, the expected return over the discounted rewards J(π) = E[R(τ )] is optimized to obtain the optimal policy π = argmax J(π).
As a fundamental concept in RL, the value function gives a prediction of the expected, cumulative, discounted, future reward, and provides a measure of how well a given state s or state-action pair (s, a) behaves to generate a higher net return. The state S2 FIG. S2. The workflow of DRL in the soft-actor-critic (SAC) setting is shown. In the actor-critic framework, the RL-agent has two independent neural networks for policy evaluation and policy improvement to guide the choice of actions, known respectively as the critic and actor networks. The critic, in the SAC framework, consists of two neural networks, Q1 and Q2 for improved policy evaluation through Q-function updates. Based on the suggestion of the critic (in terms of estimation of the discounted future rewards and entropy regularized critic loss for SAC), the actor-network is updated, and the next action is chosen.
value V π (s) = E[R t |s t = s] is defined as the expected return for following policy π from state s. On the other hand, the action value Q π (s, a) = E[R t |s t = s, a t = a] is the expected return for selecting action a in state s and then following policy π. The value functions obey the so-called Bellman equations [20], which can be solved self-consistently. For example, the action-value function obeys, Q π (s, a) = E r(s, a) + γ. max a Q θ (s , a ) . (S4) There are several approaches to optimize the policy, that can be broadly categorized into three types, namely (a) policy-gradientbased, (b) value-based and (c) actor-critic based methods. In value-based approaches, such as the Q-learning methods, the policy is optimized to get the net maximum value functions solving the Bellman equations as discussed above. On the other hand, in policy gradient methods the policy parameters are optimized by using gradient descent algorithms: where E represents the expectation value over the trajectory τ . This basic approach can be improved by using a baseline function, b(s t ), to reduce the variance of gradient estimation and forms the basis of the more advanced state-of-the-art DRL actor-critic algorithms. In the most generalized form, the policy gradient methods work by optimizing the following objective (loss) function: where, π θ is a stochastic policy, andÂ t = Q(s t , a t ) − V (s t ) is an estimator of the advantage function at timestep t, since R t is an estimate of Q(a t , s t ).
An actor-critic algorithm learns both a policy and a state-value function, and the value function is used for bootstrapping, i.e., updating a state from subsequent estimates, to reduce variance and accelerate learning [20], while the critic updates action-value function parameters, and the actor updates policy parameters, in the direction suggested by the critic.

S3
The Soft Actor-Critic (SAC) algorithm is a recently proposed actor-critic algorithm [38] in the field of reinforcement learning. The main difference of SAC compared to other actor-critic methods is that its policy is optimized in an entropy-regularized manner and is inherently stochastic. The policy is trained to maximize a tradeoff between expected return and entropy, with an increase in entropy leading to more exploration and, in addition, preventing the policy from converging prematurely. The following discussion closely follows the theory presented in [48].
In entropy-regularized RL, the agent gets a bonus reward at each time step proportional to the entropy of the policy, where H(P ) = E x∼P [− log P (x)] denotes the entropy computed from the probability distribution P , and α > 0 is the trade-off coefficient. The corresponding value functions in this setting, V π and Q π are changed to, With these definitions, V π and Q π are connected by, The Bellman equation for Q π becomes, where, the expectation over the next states, r and the states, s come from the replay buffer, while the next actionsã are sampled fresh from the policy. SAC concurrently learns a policy π θ and two Q-functions Q φ1 , Q φ2 and set up the loss functions for each Q-function, and takes the minimum Q-value between the two Q approximators, where the target is given by y(r, s , d) = r + γ min j=1,2 Q φtarg,j (s ,ã ) − α log π θ (ã |s ) ,ã ∼ π θ (·|s ). (S13) For the studies reported in this article, we design the RL-environments using the framework template of OpenAI-Gym [49]. The RL-agents are modelled using SAC policy, with two critic models following the implementation available in the open-source software package Stable-Baselines3 [50]. For modelling the neural networks and optimization of the network parameters we have used the PyTorch [51] ML module.

THE BIPARTITE MAGNO-MECHANICAL MODEL AND SIDEBAND COOLING TIME LIMIT
The effective two-mode magno-mechanical model shown in Fig. 1(b) in the main text consists of a YIG sphere put in a microwave (MW) cavity, where a uniform magnetic field, B 0 is applied in the z-direction that induces spin wave magnons precessing in the xy-plane. The magnetic field component of the cavity mode couples to the magnons through magnetic dipole coupling [16,37]. The YIG sphere also acts as an excellent mechanical resonator because of the magnetostrictive effect, that gives rise to vibrational acoustic phonon modes [12,15]. We consider the magnon modes to be driven by a MW field with frequency ω p and amplitude p [15,17]. The Hamiltonian of the system in a frame rotating with the magnon drive frequency is given by, where a(a † ), m(m † ) and b(b † ) are the annihilation (creation) operators for the cavity, magnon and phonon modes respectively, δ i = ω i − ω p are the detunings, J is the magnon-photon coupling strength and G is the magno-mechanical coupling rate. The Rabi frequency p is given by √ 5N γB 0 /4, where γ/2π = 28 GHz/T is the gyromagnetic ratio, N = ρ n V is the total number of spins, where ρ n = 4.22 × 10 27 m −3 is the spin density of YIG and V is the volume of the sphere.
The quantum Langevin equations for the average dynamics of the fluctuation operators are given bẏ whereG is the driving-enhanced magno-mechanical coupling. When the cavity has a low Q-factor, with a damping rate of the order of κ a {J,G, κ m , κ b }, the cavity field can be adiabatically eliminated, that gives rise to an effective magno-mechanical Hamiltonian of the form,H where the effective parameters are given by, ∆ m = δ m − |J| 2 δ 2 a +κ 2 a δ a ,κ m = κ m + |J| 2 δ 2 a +κ 2 a κ a . Hence, when the sphere is put at a node of the cavity magnetic field, or the cavity damping is very high, effectively the system reduces to a two-mode coupled system with a magno-mechanical interaction.
The magnon and mechanical modes have largely different frequencies, with ω m ∼ 2π × 10 GHz and ω b ∼ 2π × 10 MHz, therefore the magnon mode behaves as a 'cold' sink at zero temperature for the 'hot' mechanical mode. At the red sideband, ∆ m = ω b , by virtue of the resonant interaction in the weak coupling regime, between the mechanical mode and magnon anti-Stokes sideband, it gives rise to transfer of thermal quanta from the mechanical mode to the magnon mode, which is the working principle of the well-known sideband cooling technique. In Fig. S3 we show the sideband cooling dynamics with several values ofG, in the regimeG < ω b . The time limit for such irreversible cooling is denoted by τ SB . From the figure, one can get an estimate of the steady-state cooling time limit and the optimum cooling quotient,ñ b . Considering an optimum cooling quotient of,ñ b ≈ 10 −4 , the best steady-state cooling from such sideband cooling mechanism is found to occur at a time limit of the order of τ SB ω b /(2π) ≈ 40. As can be seen from the figure, such sideband cooling is not possible with higher values of the magnomechanical coupling,G because of the effect of the counter-rotating terms beyond RWA. In our DRL-scheme, we show that cooling can be obtained even in the strong coupling regime with proper choice of coupling modulations. And more interestingly, it leads to lower time limit for cooling.  S4. (Top) The learning curve of the RL-agent is shown for the bipartite magno-mechanical model shown in Fig. 1(b) in the main body of the article. In the upper panel of the figure, the mean net rewardR(N ) is shown over each episode N within the training period of the RL agent. The corresponding cooling quotient over each episode N is shown in the lower panel of the figure. To understand the learning process of the RL-agent, we divide the learning curve into four distinct regions, namely A, B, C, and D. The typical performance of the RL agent within these regions is evaluated and plotted respectively in Figs. A, B, C, and D below the learning curves. In each plot, the variation of cooling quotient,ñ b is shown for an episode in which the controls, shown as polar plots in the inset, are determined by the selected RL-agent having already acquired knowledge up to that point of the learning curve. In A and B, the RL agent tries more or less random combinations of the controls and tries to learn the big lessons to achieve a net large return. In A, the RL agent learns to avoid increasing the cooling quotient, while in B it learns to choose controls such thatñ b can be held constant atñ b ∼ 1. The RL agent must explore for a relatively long time to gain control and move beyond it, which does not happen until about N = 8000. Most of the learning to achieve cooling takes place in region C, where the RL-agent tries to learn the little nitty-gritty of the problem by making its controls more fine-tuned to achieve an overall larger return, which is further improved in the learning period marked by D.

S6
For the two-mode magno-mechanical system, the control parameter is the coupling strengthG(t), which can be complex in nature (see Eq. 1 of the main text of the article). In the DRL model, the real and imaginary parts ofG are considered as the two actions that are learned by the DRL-agent through trial and error. In a given iteration of the RL workflow at time t, the RL-agent chooses two actions based on which the system (RL-environment) makes the dynamics to t = t + δt, where δt = 0.1ω b /(2π) during whichG(t) remains fixed. The observations that the RL-agent gets from the RL-environment comprise the instantaneous values of the moments obtained by solving the second order coupled differential equations. The actions of the RL-agent are also added to the list of observations, which we found to be advantageous for learning. The SAC controller is made based on the following settings of hyperparameters, Network hidden layer size: 512 × 256 × 256 × 128 Activation function: Rectified Linear Unit (ReLU) Learning rate: 10 −4 Buffer size: 1000000 Batch size: 512 Soft update coefficient, τ : 0.005 Discount factor, γ: 0.99 Entropy regularization coefficient, α: Adaptive starting from 0.1.
The goal of the problem is to optimize the neural network parameters such that it learns the non-trivial combinations of the controls,G, so that the phonon occupancy of the YIG sphere is reduced thereby cooling it significantly. For that the agent is given the following reward function, where n T is the thermal phonon number, b † b is the number operator of the mechanical mode, and A denotes the expectation value of the operator A. The task of the RL-agent would be to maximize the net reward, R = Nt i r i over an episode. Note that this, in theory, represents a highly challenging task as for each control (action), the number of possible combinations in a given episode is N Nt C , where N t is the number of total time steps allowed within an episode, and N C is the number of possible discretization of the control strength bounded byG max . Since in our case, the actions are continuous the problem is even harder.
The RL-agent is trained for the following set of cases: Fig. S4. In these plots, we have shown how the RL-agent gradually learns the non-intuitive control sequences by optimizing the neural network weights to achieve the goal of attaining larger net returns determined by Eq. S17. We have also shown the performance of the RL-agent at different stages to demonstrate the learning process, as explained in the caption of Fig. S4 in more detail.

TRIPARTITE OPTO-MAGNO-MECHANICAL MODEL AND THE COOLING TIME LIMIT
For the tripartite system model, we consider the cooling of a levitated spherical YIG ferrimagnet with high spin density, trapped in a harmonic potential wherein spin-wave magnon modes are excited by applying an external bias magnetic field. In a large external homogeneous magnetic field, B 0 , applied in theẑ-direction, the YIG sphere is magnetized to its saturation magnetization, M s = 5.87 × 10 5 Am −1 , whereas the small magnetization fluctuations, m(r, t) around the fully magnetized state is given by, M (r, t) = M Sẑ + m(r, t). For the homogeneously magnetized fundamental magnon mode (Kittel mode), precessing around theẑ-axis with an eigen-frequency, ω m = |γ|B 0 , the magnetization fluctuation is given by m(r, t) = M K (x + iŷ)m + h.c. [52], which induces a magnetic field, b(r, t) = µ 0 is the zero-point magnetization, µ 0 is the vacuum permeability, m is the bosonic magnon operator, R and V are the radius and the volume of the sphere. As schematically shown in the main text, a driven WGM optical microsphere with a magnetostrictive rod (MR) attached to it, is placed near the YIG magnet. The magnon mode field modulates the axial length of the MR, which modulates the WGM optical mode frequency that is depicted by a coupling of the form Ω S a † a(m + m † ) between the WGM optical mode (a) and the magnon bosonic mode (m), where Ω S = ∆ω is the optical frequency shift i.e. the single photon magnon-cavity coupling.
The magnon mode can also be coupled to the COM motion of the YIG sphere, by applying a spatially inhomogeneous external time-dependent magnetic field, H g (y, t) [45,46], which satisfies the weak driving, |H g (y, t)| H 0 , and small-curl, |∇ × H g (y, t) r| |H g (y, t)| conditions. The COM motion of the magnet with frequencies ω x , ω y , and ω z , inx,ŷ andẑ directions, is depicted by the Hamiltonian, H b = j=x,y,z ω j b † j b j , where the bosonic ladder operators describe annihilation and S7 FIG. S5. The effectiveness of STIRAP protocol for the tripartite opto-magno-mechanical system for differentΩ max P/S with numerically optimized counterintuitive Gaussian pulses, typically used in such protocols are compared. The performance with (dotted lines) and without damping (solid lines) are shown. For (a) and (b), such a protocol leads to cooling of the phonon mode (ñ b ) below 3 orders without damping. For Ω max P/S = 10ω b , shown in (c), and forΩ max P/S = 12ω b , in (d), the STIRAP protocol cannot lead to any effective cooling, because of the counter-rotating terms of the non-RWA model. This can be more easily understood forΩ max P/S = 15ω b , shown in (e), where the populations are increased leading to heating of the system. In (f), we compare the minimum achievable phonon occupation,ñ min b for the above choices of Ω max P/S with (dotted) and without (solid) damping. In (g), the time required to cool the phonon occupation below 10 −3 or the time required to attain the minimum occupation, τ min s is compared for cases with (dotted) and without (solid) damping. creation of a motional quantum along the direction j. The quantum Hamiltonian describing the interaction between the COM motion, and the spin-wave magnetization due to the gradient magnetic field is given by, H mb (t) = − µ0 2 dV [M Sẑ + m(r)] · H g (y, t). Considering a time-varying gradient magnetic field of the form, H g (y, t) = bg(t) µ0 yŷ, (b g in units of [T/m]), the interaction Hamiltonian for the COM motion in theŷ-direction (frequency ω y ≡ ω b for b y ≡ b) and the magnon mode is given by, H mb (t) =Ω P (t) b +b † m +m † , withΩ P (t) = bg(t) 4 |γ|M S ρω b , where ρ = 5170 kg/m 3 is the mass density of YIG. Taking into account all these interactions, in the rotating frame of the cavity drive and the displacement picture of the average field in each mode [53], the Hamiltonian is given bỹ where ∆ a is the cavity detuning, andΩ S is the driven optomagnonic coupling. In this tripartite system, the optical and COM modes are not directly coupled, however both are coupled to the magnon mode.
In the large detuning condition, ω m Ω2 P +Ω 2 S , and under RWA, the model can be reduced to a bipartite effective system with the form of Hamiltonian,H where ∆ eff = (Ω 2 P −Ω 2 S )/(2ω m ) and Ω eff =Ω SΩP /ω m . This results in a transfer time limit ofτ lim = πω m /(2Ω SΩP ). In the three-mode system given by the full tripartite model under RWA, a widely applicable scheme of population transfer between the S8 FIG. S6. (Top) The learning curve of the RL-agent is shown for the three-mode opto-magno-mechanical model shown in Fig.1(c) in the main text of the paper. In the upper panel of the figure, the mean net rewardR(N ) (scaled by the thermal number) is shown over each episode N within the training period. The corresponding variation in the cooling quotient of the mean phonon number averaged over each episode,n b (N ) is shown in the lower panel of the figure. The zoomed view of the region beyond N = 1k is dubbed in the inset of the plot. To understand the learning process of the RL-agent better, we divide the learning curve into four distinct regions, namely A, B, C, and D. The typical performance of the RL-agent within these regions is evaluated and plotted respectively in Figs. A, B, C, and D. In each plot, the variation ofñ b (in green), nm (in orange) andña (in blue) are shown for an episode in which the controls,Ω P/S are shown in the lower panels (similar to the Fig. 3(b) in the main body of the article), which are determined by the selected RL-agent having already acquired knowledge up to that point of the learning curve. In the region A learning curve, the RL-agent explores in large parameter space more or less randomly to search for controls that would reduce the mean phonon number from within its thermal population limit, n T b . At the end of stage A, the RL-agent more or less learns that applying constant control pulses cannot lead to a transfer of population out of the phonon mode. In the middle part of B, the RL-agent learns that it is possible to get considerably higher net returns if larger amplitude of control pulses are followed by smaller ones. The RL-agent then learns to apply the smaller controls appropriately at precise times which leads to better cooling through population transfer in the region C and D.
uncoupled modes in an irreversible manner is to apply STIRAP-like ideal pulses [54]. However, only when the total operation time of pulse sequences, T τ lim , adiabatic effective population transfer can occur between the optical and COM modes in this system. In Fig. S5, we show the cooling dynamics with ideal STIRAP-like Gaussian pulses. This is obtained with counterintuitive sequence of the coupling parameters for transfer of quanta from the COM mode to the optical mode. The cooling time is of the order of the limit given from the above expression,τ lim , which shows that these pulses are not adiabatic, but rather gives the shortest time for excitation transfer. The performance of the cooling quotient can be seen from the figure. As the time limit is the shortest possible time for cooling, one has to compromise on the effective cooling achieved. For the coupling parameters in the range,Ω max i /ω b = {6, 8, 10}, the RWA is valid, whereas for higher values ofΩ max i beyond 12ω b , the counter-rotating terms have significant effect. Therefore, even if the increase of coupling parameters shows sign of the lowering of transfer time, it is not possible to obtain effective cooling in these cases. On the contrary, our proposed DRL-scheme finds finely-tuned pulses to go beyond such limits to obtain effective and much faster cooling.

THE DRL CONTROLLER FOR THE TRIPARTITE OPTO-MAGNO-MECHANICAL SYSTEM
The tripartite system is modelled similarly to the two-mode model described earlier, except the fact that an additional Gaussian noise layer is added to the final layer of the network to facilitate more exploration to uncover the non-triviality. Here the controls are,Ω P/S , which are taken to be real and positive. As earlier, the observables are the instantaneous solutions of the coupled differential equations of the second order moments of the quantum master equation of the tripartite system, along with the controls,Ω P/S . In a given episode, the RL-agent makes a total of 150 interactions with the RL-environment in timesteps of dt = 0.1, each time applying its actions, maintaining a generous exploration-exploitation trade-off. The reward function, for this particular case is, where n T is the thermal phonon number, b † b (m † m) is the number operator of the phonon (magnon) modes, and λ = 10 is chosen which guarantees to shuffle population out of the phonon mode and into the optical cavity mode leading to cooling of the phonon mode, without transferring to the magnon modes. Given the large computational overhead to solve the coupled second order moment equations for the tripartite opto-magnomechanical system with state-of-the-art experimental parameter choice of ω m = 10 5 ω b , the problem is solved in two steps. In the first step, we train an auxiliary system with ω m = 10 3 ω b andΩ P/S = 10 ω b , for which solution of the set of equations can be obtained without much computational effort. The trained auxiliary model is then used as a supervisor/teacher for the actual system, that we call primary, with ω m = 10 5 ω b andΩ P/S = 100 ω b . Training the actual system for a few hundred episodes with periodic evaluation of the RL-agent yields the best trained model. In the literature of RL, this can be considered as a form of imitation learning.
The (auxiliary) RL-agent is trained for the following set of cases:Ω max P/S = 10, 12, 15. The learning curve for the first case is shown in Fig. S6. In these plots, we have shown how the RL-agent gradually learns the non-intuitive control sequences by optimizing the neural network weights to achieve the goal of attaining larger net returns determined by Eq. S20. We have also shown the performance of the RL-agent at different stages of the learning process, as explained in the caption of Fig. S6 in more detail.