Deep Reinforcement Learning for Feedback Control in a Collective Flashing Ratchet

A collective flashing ratchet transports Brownian particles using a spatially periodic, asymmetric, and time-dependent on-off switchable potential. The net current of the particles in this system can be substantially increased by feedback control based on the particle positions. Several feedback policies for maximizing the current have been proposed, but optimal policies have not been found for a moderate number of particles. Here, we use deep reinforcement learning (RL) to find optimal policies, with results showing that policies built with a suitable neural network architecture outperform the previous policies. Moreover, even in a time-delayed feedback situation where the on-off switching of the potential is delayed, we demonstrate that the policies provided by deep RL provide higher currents than the previous strategies.

Introduction -A flashing ratchet is a nonequilibrium model that induces a net current of Brownian particles in a spatially periodic asymmetric potential that can be temporally switched on and off [1][2][3][4]. If one can access the position information of the particles, the current can be greatly improved by feedback control that switches the potential on-off based on the position information [5]. Feedback strategies for maximizing the current in flashing ratchets have been extensively studied [4][5][6][7][8][9][10][11][12][13] due to the model's applicability in various disciplines [14]; for instance, flashing ratchets have been used for explaining transport phenomena in biological processes such as ion pumping [15], molecular transportation [16], and by motor proteins [17][18][19][20]. However, the proposed feedback strategies [4][5][6][7][8][9][10][11] are not optimal policies for a moderate number of particles and require prior information of the system as well.
Thanks to the recent advances in deep learning [21], physicists in diverse fields have been applying it to complex problems that are analytically intractable, e.g. glassy systems [22], quantum matter [23], and others [24]. In particular, reinforcement learning (RL) [25] has shown unprecedented success in previously unsolvable problems through combination with deep neural networks [26][27][28][29]. This framework, so-called deep RL, has become a highly efficient tool for quantum feedback control, showing similar or better performance than previous handcrafted policies [30][31][32][33][34]. In this Letter, we employ deep RL to obtain optimal policies in the collective flashing ratchet model, and validate our approach by application to a time-delayed feedback situation that occurs in actual experiments [12].
Collective flashing ratchet -We consider the collective flashing ratchet model [5], which consists of an ensemble of N non-interacting Brownian particles in contact with a heat bath at temperature T and that drift in a spatially periodic asymmetric potential U . The dynamics of the N particles is governed by the following overdamped Langevin equation: where x i (t) is the position of particle i, η is the friction coefficient, and ξ i is a Gaussian noise with zero mean and correlation where E denotes the ensemble average. Here, α is a deterministic control policy that depends on a set of positions s t with an output of 0 (off) or 1 (on). The force is given by In all simulations, we set L = 1, k B T = 1, diffusion coefficient D = k B T /η = 1, U 0 = 5k B T , and time step size ∆t = 10 −3 L 2 /D. The current of the particles in steady state under policy α is denoted as Various policies for maximizing the current (3) have been proposed as follows: the periodic switching pol- icy [4], maximizing instantaneous current (greedy policy) [5], threshold policy [6][7][8], and Bellman's criterion [13].
The periodic switching policy [4] is α(t) = 1 for t ∈ [0, T on ), α(t) = 0 for t ∈ [T on , T on + T off ), and periodic α(t + T on + T off ) = α(t) with optimal periods T on ≈ 0.03L 2 /D and T off ≈ 0.04L 2 /D. For any N , this policy gives the current E α [ẋ] ≈ 0.862D/L because it does not depend on the position but only time.
The greedy policy [5] is defined as α(s t ) = Θ(f (s t )), where f (s t ) = N i=1 F (x i (t))/N is the mean force and Θ is the Heaviside function given by Θ(z) = 1 if z > 0 or else 0. While the greedy policy is the optimal one for N = 1, this policy is outperformed by the periodic switching policy for large N .
The threshold policy [6][7][8] is increasing, with thresholds u on ≥ 0 and u off ≤ 0. The threshold policy with optimal thresholds gives mostly similar performance to the greedy policy for N < 10 2 -10 3 and is better than the greedy policy for larger N . It is also optimal for N = ∞, which is equivalent to the periodic switching policy.
Neither greedy nor threshold policy is optimal for finite N > 1. Roca et al. [13] proposed a general framework for finding the optimal policy via Bellman's principle, and found it for N = 2 using numerical integration. However, this numerical method requires prior information of the model and is computationally infeasible for large N due to the curse of dimensionality.
Methods -We employ the actor-critic algorithm, which is one of the policy gradient methods in RL [25], together with deep neural networks to find the optimal policies in the collective flashing ratchet for any N .
To formulate this problem in RL language, we define the reward as the total mean displacement of the particles: The total discounted reward from time t, called return, is the discounting factor and we set γ = 0.999. We build a policy network π θ , called actor, where θ denotes the trainable neural network parameters, that takes system state s as an input. The outputs π θ (s) = (p on , p off ) are the probabilities for switching the potential on or off [see Fig. 1 We sample the on-off probability from π θ (s t ) every t in the training process.
The goal in RL is obtaining the optimal policy π * that maximizes the expected total future reward, i.e. π * = arg max π E π [G t ]. If the equation of motion is known, E π [G t ] can be numerically calculated using Bellman's equation [13]. However, in this work, we assume that we can only access the system state s t and reward r t .
In such case, called model-free RL, we need an estimator V φ for a value function: which is the expected return given state s t under a policy π. The estimator V φ , called value network or critic, where φ denotes the trainable parameters, is also built with another neural network. There are various optimization methods for the actorcritic algorithm [35]. Among them, we employ proximal policy optimization [36], which is widely used in RL because of its scalability, data efficiency, and robustness for hyperparameters (see Supplemental Material [37] for training details). After the training process is complete, we test the policy deterministically, i.e.
Neural network architecture -First, we employ multilayer perceptron (MLP) architecture for the policy network π θ and value network V φ [see Fig. 1(b)]. The configuration details of the neural network architectures are given in the Supplemental Material [37]. Using the periodicity of the potential U (x), we transform the state s t into the input feature  Therefore, the number of input nodes of the MLP is 2N and the number of output nodes is two for π θ . The value network V φ has the same architecture, except that it has a single output node rather than two output nodes. We note that the discounting factor γ = 0.999, which indicates the return G t , can effectively be considered as the total mean displacement between t and t + ∆t/(1 − γ). Accordingly, V φ (ψ t ) can be interpreted as the expected current given ψ t because the time step size is ∆t = 10 −3 L 2 /D.
For the N = 1 case, Fig. 1(a) shows that the trained π θ agrees with the greedy policy (bottom panel), while V φ is slightly shifted to the right from potential U (top panel). This is because, at the top of the potential valley (x max ), the particle can slide to the right or left with a 50/50 chance, and therefore the expected current is maximum slightly right of x max .
For the N = 2 case, as shown in Fig. 2(b), the greedy policy switches on (off) the potential when the particles are inside (outside) the white contour. On the other hand, the decision boundary of the trained MLP policy π θ (red contour) agrees with the policy discovered by Roca et al. [13] and shows better performance than the greedy policy by considering the future expected current. For instance, in the orange dashed area, the instantaneous net current will be negative because the mean force f (x 1 , x 2 ) is negative when the potential is on. But considering each particle with a long-term view, particle 1 and particle 2 are located on the downhill of the potential (x max < x < x min ) and near the minimum (x min ), respectively; while particle 2 will soon reach x min and become trapped in the potential well, particle 1 can keep moving down along the potential [13].
However, the decision boundary (red contour) and V φ (color gradient) are not symmetric over the line x 1 = x 2 [see Fig. 2(b)] because MLP outputs are not permutation invariant to the order of the elements in the input feature ψ t . To address this issue, we employ a permutation invariant architecture, called DeepSets [38], for the policy and value networks. In this architecture [see Fig. 2(a)], each element ϕ i in the input feature ψ t is independently fed into a single MLP (beige), and the outputs of the MLP are averaged over the elements and then fed to an another MLP. By using DeepSets for training, the decision boundary and V φ show perfect symmetry over the x 1 = x 2 line [see Fig. 2(c)]. Now we apply these methods for N = 2 2 , 2 3 , ... 2 13 , and compare the training results with the greedy (blue circles) and periodic switching (black dotted line) policies in Fig. 3. Results show that the trained MLP policies (orange triangles) outperform the greedy policy for N < 10, but perform poorly for N > 10 due to the lack of permutation invariance. On the other hand, the trained DeepSets policies (green triangles) outperform the other policies for any N > 1 while converging to the periodic policy as N increases (see Fig. S1, Supplemental Material [37]). We have also verified that deep RL works well for the sawtooth potential ( Time-delayed feedback -In an actual experiment, there is an inevitable time-delay between the measurement and the feedback due to the calculation time in the feedback algorithm [9][10][11][12]. To verify that deep RL is applicable to such a realistic situation, we consider a feedback time-delay τ in Eq. (1), i.e. α(s t ) is replaced by α(s t−τ ). In this case, the maximal net displacement  (MND) policy [11], defined by where the displacement function is d( can perform better than the greedy policy for τ > 0 with optimal x 0 < 0 [12]. This can be considered as a τdelayed greedy policy because it predicts the arrival of the particles at x min after τ from x 0 + x min . We train the neural networks for N = 1, 2 1 , 2 2 , ..., 2 5 with time-delay τ in the range of 0.00-0.05L 2 /D, and compare them with the greedy policy and the MND policy with optimal x 0 . For the time-delayed N = 1 case [see Fig. 4(a)], the results show that the trained MLP policies (green triangles) agree with the MND policy (orange triangles) and perform better than the greedy policy (blue circles). For N = 2, he trained DeepSets policies (green triangles) outperform the greedy policy and are slightly better than the MND policy [see Fig. 4(b)].
While the actor-critic algorithm assumes that the feedback-controlled system is a Markov decision process (MDP), the delayed-feedback process is not a MDP because the next state s t+∆t not only depends on the previous state s t but also the history of the on-off information. This problem can be reformulated as a MDP by augmenting the input feature ψ t with the on-off history [39]. Here, the d-step augmented state at time t is defined as In order to efficiently handle the augmented state, we build the policy network with a recurrent neural network (RNN). We employ an embedding layer to transform the discrete variable α into a continuous variable and use a gated recurrent unit (GRU) [40] for the RNN. As shown in Fig. 4(c), we concatenate the output vectors from DeepSets (orange nodes) and the RNN (blue nodes), where DeepSets and the RNN encode the position information ψ t and potential on-off history, respectively. We then feed the concatenated vector to a MLP. See the Supplemental Material [37] for the configuration details. As can be seen in Fig. 4(a) and 4(b), the trained RNN policies (red stars) show slightly better performance than the other policies for N = 1 and noticeably better performance than the others for N = 2. Figure 5 shows that RNN policies also outperform the greedy, MND, and DeepSets policies for the N = 4, 8, 16, 32 cases.
Conclusions and outlook -We have tackled the problem of finding an improved policy for maximizing the current in the collective flashing ratchet model through deep RL. Unlike the previous model-based method [13], the model-free RL approach used in this study does not require information on the parameters of the system (e.g. potential, diffusion coefficient, and others). The deep RL approach makes it is possible to find state-of-the-art feedback strategies using suitable neural network architectures through training only in the process of interacting with the environment. Also, we have demonstrated that deep RL outperforms the previous strategies in a timedelayed feedback situation; therefore, we expect that this study can be effectively applied experimentally.
Although feedback control in the collective flashing ratchet can induce an effective coupling between noninteracting particles, molecular motors like kinesin, for example, explicitly interact with each other via hardcore repulsion. According to previous studies on interacting molecular motors [17][18][19], their cooperative behavior can enhance transportation ability several times or more compared to individual motors. Further research applying deep RL on interacting molecular motors will be intriguing.
In real-world scenarios, there may be measurement or feedback errors due to instrument noise [41][42][43]. Such cases are not only important in physics, e.g. information thermodynamics [44], but also in RL for real-world applications [45]. Therefore, it will also be an interesting future work to study RL from a thermodynamics perspective; we expect that the collective flashing ratchet model can be utilized as a useful environment to benchmark RL algorithms in such situations.
The results of all runs and the code implemented in PyTorch [46]  III. POLICY AND VALUE NETWORKS OVER TIME Figure S1 shows that with increasing N , the deterministic control α(t) of the trained DeepSets policy as a function of time t converges with that of the periodic switching policy.

IV. SAWTOOTH POTENTIAL
We also test the deep RL method with the sawtooth potential (S9) with L = 1 and U 0 = 5. The training results are shown in Fig. S2.