Combining Reinforcement Learning and Tensor Networks, with an Application to Dynamical Large Deviations

We present a framework to integrate tensor network (TN) methods with reinforcement learning (RL) for solving dynamical optimisation tasks. We consider the RL actor-critic method, a model-free approach for solving RL problems, and introduce TNs as the approximators for its policy and value functions. Our"actor-critic with tensor networks"(ACTeN) method is especially well suited to problems with large and factorisable state and action spaces. As an illustration of the applicability of ACTeN we solve the exponentially hard task of sampling rare trajectories in two paradigmatic stochastic models, the East model of glasses and the asymmetric simple exclusion process (ASEP), the latter being particularly challenging to other methods due to the absence of detailed balance. With substantial potential for further integration with the vast array of existing RL methods, the approach introduced here is promising both for applications in physics and to multi-agent RL problems more generally.

Here, we introduce the actor-critic with tensor networks (ACTeN) method, a general framework for integrating TNs into RL via actor-critic (AC) techniques.By combining decision-making "actors" with "critics" that judge an actor's quality, AC methods are used in many state-of-the-art RL applications.Using TNs as the basis for modelling actors and critics within AC and RL represents a powerful combination to tackle problems with both large state and action spaces.
To demonstrate the effectiveness of our approach, we consider the problem of computing the large deviation (LD) statistics of dynamical observables in classical stochastic systems [29][30][31][32][33][34][35], and of optimally sampling the associated rare long-time trajectories [36][37][38][39][40][41][42][43][44][45][46][47][48].Such problems are of wide interest in statistical mechan- In actor-critic RL, the state is passed to an "actor", which chooses the action, and to a "critic", which values the state given the reward.This value is used to improve the actor's policy.In ACTeN, the function approximators for actor and critic are tensor networks.(c) Top: typical trajectory of the ASEP at half-filling and L = 50 sites with one particle highlighted (blue), shown for 3000 steps.Bottom: trajectory with a current large deviation, sampled from the ACTeN solution for biasing (counting) field λ = −3.See the text for details.
ics and can be phrased straightforwardly as an optimization problems that may be solved with RL [49][50][51][52] and similar techniques [53][54][55][56][57][58][59][60][61].For concreteness we consider two models: (i) the East model, a kinetically constrained model used to study slow glassy dynamics, and (ii) the asymmetric exclusion process (ASEP), a paradigmatic model of non-equilibrium, in which particles hop around a lattice while blocking each others movement.In particular, and in contrast to the East model, the ASEP (with periodic boundaries) does not obey detailed balance [62], and thus evades straightforward use of TNs to compute spectral properties of the relevant dynamical generators.We demonstrate that ACTeN can be applied to both problems irrespective of the equilibrium/nonequilibrium distinction, by computing their dynamical LDs for sizes well beyond those achievable with exact methods.Given the vast array of options for improving the RL algorithm that we use, our results indicate that the overall framework outlined here is highly promising for applications more generally.Background: Reinforcement Learning and Actor-Critic.A discrete-time Markov decision process (MDP) [21] consists at each time t ∈ [0, T ] of stochastic variables X t = (S t , a t , R t ), named state, action and reward.We assume these are drawn from t-independent finite sets, S, A and R, where the action set may depend on the current state, A(S) for S ∈ S. The action and state variables are associated with the policy, π(a|S), and environment, P (S ′ |S, a), distributions.These are sampled in a sequence of steps to generate a trajectory of the MDP, ω = (X 0 , X 1 , ..., X T ), see Figure 1(a).We assume that the reward is a deterministic function of the state and action variables.
In the typical scenario of policy optimisation, the policy is controllable and known, while the environment is fixed and potentially unknown.We focus on MDPs that are "continuing", and admit a steady state distribution independent of X 0 .We can then define the average reward per time-step when following a given policy as r(π) = lim t→∞ E π [R t ], where E π [•] is the stationary state expectation over states and over transitions from those states according to policy π.The task of policy optimisation is to find the policy π * that maximises r(π).
Reinforcement learning (RL) refers to the group of methods that aim to discover optimal policies by using the experience gained from sampling trajectories of an MDP.In policy gradient methods, the policy is approximated by a function π w (a|S) with parameters w, and optimized using the gradient of r(π w ) with respect to w. Building on this, actor-critic (AC) methods then assess π w (a|S) by computing the value, v π (S), of the states that result when following the policy.This value is defined as the difference between rewards in the future of a given state and the average reward, The gradient of r(π w ) can be written exactly in terms of these values as where we have introduced the temporal difference (TD) error, [21,49], which quantifies if a resultant state is better than the current one.AC methods use this information to alter the probability of taking that action in the future.
In reality, calculating the true value of every state under the current policy is impractical, and thus an auxiliary approximation for the value-function, v ψ (S) with parameters ψ, is introduced, the so-called "critic".To optimize the critic, we note that the value of a state is related to the value of states reachable from it, as encoded in the differential Bellman equation [21], Minimizing the error in the Bellman equation when substituting the critic for the true values can be done by updating the weight as in terms of the approximate TD error, ), and a learning rate α.Intuitively, the estimated expected reward before a transition occurs, v ψ (S t ), is compared to the expected reward afterwards plus the true reward for that time-step, v ψ (S t+1 ) + R t+1 − r(π w ), and v ψ (S) adjusted to make these closer.The policy is then updated by following the gradient in Eq. ( 1) with the exact TD error δ π replaced by the critic's approximate TD error δ.Analytic Example: Two-Site East Model.To illustrate these ideas, consider two spins, s 1,2 = 0, 1, evolving with a constrained set of transitions as in the East model studied below, such that a spin can flip only when its leftneighbour (in this case simply the other spin due to periodic boundaries) is 1.The states are S = {00, 01, 10, 11}.
The dynamics of this model can be implemented as an MDP using a policy that (stochastically) selects which spins to flip.Denoting no-flip/flip by 0/1 and requiring at most one spin flip per time-step, the action-sets for this model are then: A(10) = {00, 01}, A(01) = {00, 10}, and A(11) = {10, 01}, with the state 00 being disconnected from the rest.An example policy that selects from all possible actions equally would assign a probability of 1/2 to each of these, e.g.π(a = 00|S = 10) = π(a = 01|S = 10) = 1/2 and similarly for the other states.The transition to a new state is then enacted by the environment.We can choose this to be deterministic, and simply apply the spin-flip operations selected by the actor to the current state, s i → (1−a i )s i +a i (1−s i ).Finally, an example reward can be defined via the function R(S, a, S ′ ) = −λ(1 − δ S ′ ,S ), which awards −λ every time a spin flip occurs (a ̸ = 0), encouraging activity/inactivity for negative/positive λ.For this reward, the optimal policy for negative λ will maximize activity, flipping a spin at every step, requiring going from 10 and 01 to 11 with probability 1, i.e. π * (01|10) = 1, etc. Method: Actor-Critic with Tensor Networks (ACTeN).We now focus on applying AC to solve problems with large state and action spaces.For example, we may wish to find the optimal dynamics of a system of L binary components, resulting in 2 L states, with individual agents and their actions associated to each component.To ensure optimal choices for a given task, actions may need to be correlated not only with other agents states, but also the actions other agents are about to take.In such problems, simple approaches such as tabular RL fail due to exponentially large memory requirements and sampling costs, and a common alternative is to use neural networks (NNs) when defining π w (a|S) and v ψ (S).TNs offer another approach, with polynomial memory and computational costs, and showing state-ofthe-art performance in many settings; see e.g. the review [6].Motivated by this, we define a general framework (which we call ACTeN) that exploits TNs to efficiently represent π w (a|S) and v ψ (S).A TN is a set of tensors, T = {T i.e., for each site we select the corresponding matrix A si , multiplying the L matrices together with a trace to produce a real scalar.To take advantage of translation invariance and apply approximations from smaller L to larger L systems, we then define the value function in terms of φ(S) after the additional application of a square and log, which prevents the exponential growth or decay of values as L is changed for fixed A sij .Hence, To define π w (a|S) we use a matrix product operator (MPO).This TN is built from a single real-valued the contraction is given by the traced matrix product, To use this to define a policy, we need to ensure positivity and normalization, as well as preventing the policy from producing invalid actions.To achieve this we define where N (S) = a|C(a,S)=1 φ(a, S) 2 is the (statedependent) normalisation factor and C(a, S) returns one if an action a is possible in state S or zero otherwise [63].
Application: Dynamical Large Deviations.To test ACTeN we consider the problem of computing the large deviations (LDs) of trajectory observables [29,31,33] in the East model and the ASEP in 1D with PBs.Both models are many-body binary spin systems with large state-spaces for large L, S = {s i } L i=1 , with s i = 0, 1 (see the Appendix for details on the models).The dynamics of these systems are subject to local constraints that lead to rich behaviours in their trajectories, ω T 0 = {S t } T 0 .This can be observed in the time-integrals of time-local quantities, O(ω T 0 ) = T t=1 o(S t , S t−1 ), the moments of which are contained in derivatives of the moment generating function (MGF), Z T (λ) = ω T 0 e −λO(ω T 0 ) P (ω T 0 ), where P (ω T 0 ) = T t=1 P (S t |S t−1 )P (S 0 ) is the trajectory probability under the dynamics.
In the long-time limit, the MGF obeys a large deviation principle [29,31,33] with the scaled cumulant generating function (SCGF), θ(λ) = lim T →∞ 1 T ln Z T (λ), playing the role of a free-energy for trajectories.In principle the SCGF can be obtained by sampling methods.However, this is exponentially hard (in time and space) using the original dynamics.An alternative is to find a more efficient sampling dynamics which may then be combined with importance sampling to obtain unbiased statistics.This can be formulated as a RL problem as follows: can we find a parameterized dynamics P w (S t |S t−1 ) such that P w (ω T 0 ) = e −λO(ω T 0 ) P (ω T 0 )/Z T (λ), i.e. it reproduces a trajectory ensemble biased towards rare trajectories of the original dynamics.This dynamics is connected to an underlying policy π w (a|S) by a deterministic environment FIG. 3. Dynamical large deviations in the ASEP using ACTeN.(a) In the ASEP particles can only move to an unoccupied neighbouring site, with probability p to the left and q = 1 − p to the right.(b) SCGF for the time-integrated particle current as a function of biasing field.We show results from ACTeN for p = 0.1 (squares) and p = 1/2 (diamonds).The lack of detailed balance for PBC and p ̸ = 1/2 prevents straightforward application of DRMG, but for small sizes (here L = 14) we can compare to exact diagonalisation (blue curve for p = 0.1, green for p = 1/2).(c) SCGF for p = 0.1 from ACTeN for size L = 50 which is beyond the scope of ED.Compared to L = 14 (blue curve from ED), we see that ACTeN captures the flattening of the SCGF for larger sizes indicative of a LD phase transition, cf.Ref. [32].The inset shows the smooth convergence of our ACTeN numerics with L for two values of λ.(d) Since ACTeN provides direct access to the optimal dynamics, observables such as the time-integrated current can be evaluated directly (black squares for L = 50).We show for comparison the numerical differentiation of the ACTeN SCGF (red circles) and of the ED SCGF at L = 14 (blue line).
which returns states after receiving an associated action, i.e.P w [S ′ = f (a, S)|S] = π w (a|S), where for each S, f (a, S) returns a unique S ′ for each a.For example, in the East model if we take the action a = {a i } L i=1 then sites with a i = 1 are flipped and those with a i = 0 are not flipped; the new state is then Optimizing the KL divergence between the two trajectory ensembles gives a regularized form of RL with a reward depending on the policy [49] R t = −λo(S t , S t−1 ) − ln P w (S t |S t−1 ) P orig (S t |S t−1 ) , with its expected value becoming the SCGF at optimality [49].Intuitively, choosing actions (e.g.flips) to maximize the first term increases the likelihood of rare events with extreme values of the observable, while maximizing the second term minimizes the difference between the parameterized and original dynamics, thus making the event more probable.Maximizing this reward is a balancing act between these two aims, resulting in dynamics biased towards rare events in a way representative of their occurrence in the original dynamics.In the appendices, we illustrate ACTeN by solving explicitly the 2-site East model and showing how this can be exactly represented by the TN ansatz.
(i) East model and dynamical activity: Figure 2 shows the SCGF of the dynamical activity [total number of spin flips in a trajectory, defined by o (S t , S t−1 ) = 1 − δ St,St−1 ], calculated using ACTeN (symbols).Since the East model obeys detailed balance, the SCGF is the log of the largest eigenvalue of a Hermitian operator and can be estimated via density matrix renormalisation group (DMRG) methods, cf.Ref. [64][65][66][67][68][69][70] (here we ITensors.jl[71]).Figure 2 shows that the DMRG results (blue curve) coincide with ACTeN (black squares) for size L = 50, which is well beyond what is accessible to exact diagonalisation (ED).Note that DMRG with PBs tends to be much less numerically stable than for open boundaries.Nonetheless, ACTeN can reach L ≳ 50 without the need for any special stabilisation techniques.
(ii) ASEP and particle current: Figure 3 presents the LDs of the time-integrated particle current, defined by o(s t , s t shows the SCGF obtained via ACTeN (black squares/diamonds).Unlike the East model, for asymmetric hops (p ̸ = 1/2) Hermitian DMRG cannot be applied directly to the ASEP, so for comparison we show results from exact diagonalisation for both p = 0.1 (blue line) and p = 1/2 for L = 14.Beyond L = 14 ED becomes prohibitive, while ACTeN remains feasible.Figures 3(c,d), show the expected phase transition behaviour [32] and convergence with L up to L = 50.The optimal dynamics itself, i.e. the learnt policy, can be used to generate trajectories representative of λ ̸ = 0, see Fig. 1(c), and directly sample rare values of the integrated current, see Fig. 3(d).
Outlook.ACTeN compares very favourably with stateof-art methods for computing rare events without some of the limitations, such as boundary conditions or detailed balance.From the corpus of research in both TNs and RL, our approach has considerable potential for further improvement and exploration.These include: numerical improvements to precision via hyper-parameter searches; stabilisation strategies for large systems; integration with trajectory methods such as transition path sampling or cloning; integration with advanced RL methods such as those offered by the DeepMind ecosystem [72]; generalisation to continuous-time dynamics; and applications to other multi-agent RL problems, such as PistonBall [73], via integration with additional pro-cessing layers particularly those for image recognition.
Acknowledgement.We would like to thank Christopher J. Turner for useful discussions.We acknowledge funding from The Leverhulme Trust grant no.RPG-2018-181, EPSRC Grant No. EP/V031201/1, and University of Nottingham grant no.FiF1/3.We are grateful for access to the University of Nottingham's Augusta HPC service.DCR was supported by funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (Grant agreement No. 853368).We thank the creators and community of the Python programming language [74], and acknowledge use of the packages JAX [75], NumPy [76], pandas [77], Matplotlib [78] and h5py [79].We also thank the creators and community of the Julia programming language [80], and acknowledge use of the packages ITensors.jl[71] and HDF5.jl [81].

Appendix on kinetically constrained models:
(i) East Model.Flipping of a spin is constrained on the spin to its left being in state 1 [31,82].Dynamics then amounts to two steps: first, select a random site i with probability 1/L; second, if spin s i−1 = 1 then flip s i .Given N spins in state 1 and periodic boundary conditions, the transition probability is P (s ′ |s) = 1 L for each possible new state s ′ ̸ = s, and probability P (s|s) = 1 − N L for no flip occurring.(ii) Asymmetric simple exclusion process.The constraint is particle exclusion: a particle at site i (s i = 1) can move left or right only if the destination is unoccupied [83].The movement of a particle to, say, the right thus corresponds to 10 → 01, i.e. a flip of both spin variables.The dynamics again amounts to two steps: first, select a particle with probability 1/N , with N = i s i the particle number; second, choose whether this particle hops right or left with probabilities p or 1−p, respectively, with the hop occurring if the new site is unoccupied.The transition probabilities are then: P (s ′ |s) = p/N for a right hop; P (s ′ |s) = (1 − p)/N for a left hop; and, given the number of neighbouring particles N nn = L i=1 s i s i+1 , with s L+1 := s 1 , the probability of no change is P (s|s) = 1 − N nn /N .In the main text, we consider the case of half-filling, N = L/2.

Appendix on analytical solution of the two-site East model with tensor network ansatz:
An exact solution for the optimal policy in the twosite East model can be found by analytically constructing the so-called "Doob dynamics" [34,84] as follows.First we define a "tilted" evolution operator [29,31,33] and |P ss ⟩ is the stationary state vector for P .In the large T limit the SCGF is the log of the largest eigenvalue of P λ [29,31,33].The corresponding left eigenvector l λ is related to the dynamics that maximises the expected value of Eq. ( 7), given by the so-called Doob (or optimal) dynamics . Ap-plied to the two-site East model, defining the function a(λ) =

√
1+8e −2λ , we find θ(λ) = − ln (a(λ)), with optimal dynamics We may then find the corresponding value function by solving the differential Bellman equation ( 2).To do this, note first that by symmetry the values of states 10 and 01 are identical, and second that Eq. ( 2) is invariant under an overall shift of the value function by a constant.Therefore, we may choose the value of states 10 and 01 to be 0, and thus find V λ (11) = −λ − θ(λ).
These results are what is expected intuitively.Trajectories biased towards enhanced activity (λ < 0) have a(λ) < 1, making P D λ (11|10) = P D λ (11|01) > 1/2, i.e. the system is more likely to transition to the state which is guaranteed to flip at the next step rather than remain in 01 or 10.Furthermore, V λ (11) > 0, i.e. the state guaranteed to flip is more valuable.In contrast, trajectories biased towards reduced activity (λ > 0) show the opposite behaviour.
For the value function, taking an exponential and square-root element-wise of the value function to invert Eq. (4) leads to the vector Since this is already a sum of symmetric products, it is easy to rewrite it as a translation invariant MPS with χ = 2, i.e.These are initiated at random for = 4 with χ = 16 and trained for 10 6 steps.Every 5000 training steps the average reward of the policy is evaluated over 10 4 steps (black squares) and the weights of the policy (which we call a "snapshot" for that time) are stored.The evaluated values can be compared to the training estimate of r(π) (red circles), which tends to overestimate r(π) initially.The policy snapshot with the highest evaluated r (blue dashed line) is used to initiate the policy for higher values of L. This is repeated every ∆L = 2 up to L = 50, with L = 14 shown here.(b) For each bias, several policies (here six) are independently trained via the same procedure from different random initial conditions.This produces a distribution of evaluated average rewards, here represented by the median (black squares) and inter-quartile range (red-shaded region).The policy with the maximum average reward at each L is selected as the optimal dynamics (blue triangles).(c) Same as (a) for L = 50.The learning curves appear nosier than in (a) but note that the vertical scale is much smaller.The learning rate is kept fixed throughout.(d) The distribution of r across parallel agents for L = 50 is again much tighter than for L = 14.
We now provide more details on the training procedure used to obtain the policies of the main text.First, we outline the update step used to improve the policy and value function approximations.Second, we outline size-annealing, where we apply transfer learning by using systems of increasing size.Finally, we discuss policy evaluation and selection, whereby the best policy is chosen from a set of candidates.
(i) Basic Outline of Training.We start by initializing the parameters w 0 , ψ 0 and r0 , where rt is an estimate of the average reward per-time step, r(π w ), after t training steps, along with the environment and initial state s 0 .Choosing the three learning rates α π , α v and α r , for each step t ∈ [0, T ] we: 5. Update the parameters of the policy, w t+1 = w t + α π δ t+1 ∇ w ln π w (a t |s t ) .
(ii) Annealing and Transfer Learning.In the context of machine learning, "annealing" (sequentially solving an optimisation problem reusing solutions to improve an initial guess) can be considered a form of transfer learning.In our case, we anneal the size of the system: the optimal policies for two system sizes will be similar as long as L ′ ≳ L, and in the settings considered the optimal dynamics should converge as L → ∞.
We first approximate the optimal policy for a small system, L = 4, starting from random initial weights.The weights after optimisation at this size are then used as the initial weights for L = 6.This is repeated in steps of ∆L = 2, up to the maximum desired L. This process ensures that effectively much longer training times are used for larger systems, and produces smooth convergence curves in L, which can be used both as diagnostic tools and for extrapolation [c.f.Fig. 3(c) inset].
(iii) Policy Evaluation and Selection.To determine the quality of a policy, we use it to generate trajectories without any change to the policy weights [c.f.Fig 1 .(c) of the main text].The set of rewards along these trajectories can then be averaged to estimate r(π w ) for the policy, allowing for different policies to be compared.
To ensure that we obtain the best policies possible, we then employ policy selection in two ways.Firstly, throughout training a given policy we store its weights periodically.After some number of periods, these weight snapshots are then evaluated and the best one is selected, ensuring that the policy can only improve with more training.Secondly, we run parallel policy optimisations and evaluations, starting from different random initial weights, with the best one selected.
The specific processes of policy evaluation and selection used to produce the results in the main text are illustrated for the ASEP in Fig. 4 (details in caption).
Here, N (s) = a|C(a,s)=1 φ(a, s) 2 = a C(a, s)φ(a, s) 2 and C(a, s) is the constraint function, which returns one if an action a is possible given a state s, or zero otherwise.
The constraint function, C(a, s), allows us to include explicit constraints on the actions selected by our function approximations.This is particularly powerful when modelling the dynamics of spin systems whose constraints are both known and such that only a few states can be reached from any other.In that case we can construct C(a, s) so that our policy reflects these constraints exactly, whereas in other scenarios this must be approximated or learnt.
Single Spin-Flip Constraint: In the dynamics studied in the main-text, we consider two varieties of constraint.The first, which pertains to both models, requires that at most a single spin is allowed to flip at a given time.For convenience, we will define the set of actions ãk for k ∈ [1, L + 1] that represent a flip at site k when k ≤ L and no flip at any site when k = L + 1 (in which case the state is unchanged by the action).Taking L = 4 for illustration, in terms of the variables in the main text where a = (a 1 , a 2 , a 3 , a 4 ) with each a k indicating a flip at site k, we have ã1 = (1, 0, 0, 0), ã2 = (0, 1, 0, 0), ã3 = (0, 0, 1, 0), ã4 = (0, 0, 0, 1), and ã5 = (0, 0, 0, 0).This constraint can be included in π w (a|S) straightforwardly via the choice, With the choice of constraint function (S8), sampling π w (a|S) will select actions only from the L + 1 possibilities ãk , with other actions having strictly zero probability of occurring.As such, sampling can equivalently be performed by selecting an action from the probabilities {π w (ã k |s)} L+1 k=1 alone.This simplifies the problem of sampling π w (a|S), because the normalisation factor N (s) [c.f.(S6)] -which in general is hard to calculate for conditional probabilitiescan be calculated explicitly by enumerating C(ã k , s)φ(ã k , s) 2 for all k = 1, 2, ..., L + 1.Thus, at most L + 1 computations are required to sample an action from a policy with this constraint.While, in principle, these could be computed in parallel, for the problems here we present an alternative method (as applied in the main text) where the policy is instead sampled via a "sweep", similar to that performed for more standard tensor network algorithms.
Local Kinetic Constraint: The second aspect of the constraints is the local kinetic constraint.Here, whether a spin at site k = 1, 2, ..., L can flip depends only on the states of the neighbouring sites at k − 1, k and k + 1.For example, in the case where the possibility of ãk depends on a three-site neighbourhood we can further write that, Note that here we have separated out the "no-flip" action, ãL+1 , as this must typically be treated separately in a given problem.
While both the east model and ASEP are subject to local kinetic constraints, the specific form of the local constraint function, C(ã k , s k−1 , s k , s k+1 ), will depend on the model in hand.As such, the function approximations for the polices will differ slightly and, therefore, so will the implementation of the forward passes.

Forward Pass for πw(a|S) in the East Model
We now describe the implementation of the forward pass for π w (a|S) in the east model, see [85] for an explicit example of implementation.In the east model, a spin can only flip if the spin to its left it up.As such, this local kinetic constraint can be captured by the local constraint function, For the case of no-flips, ãL+1 , we take this to be always possible unless every spin is up, i.e., C(ã L+1 , s) = 1 − δ s,(1,1,1,...,1) .(S11) Due to the constraint Eq. (S8), we need consider only the possible actions ãk .For these actions, the matrix product operator used in the function approximation for the policy (S6) takes the form, φ(ã k , s) = Tr Expressed in this manner, it is clear that to implement the forward pass for φ(ã k , s) we can again apply a scan function, just as with the east model.Indeed, the details of this procedure are very similar to those of the east model outlined previously, although slightly more involved due to the fact the L + 1 potential actions ãk change two sites rather than one.For further details, a full example implementation is given in [85].

FIG. 1 .
FIG. 1. Actor-Critic with tensor networks (ACTeN) (a) Sketch of a Markov decision process.(b) In actor-critic RL, the state is passed to an "actor", which chooses the action, and to a "critic", which values the state given the reward.This value is used to improve the actor's policy.In ACTeN, the function approximators for actor and critic are tensor networks.(c) Top: typical trajectory of the ASEP at half-filling and L = 50 sites with one particle highlighted (blue), shown for 3000 steps.Bottom: trajectory with a current large deviation, sampled from the ACTeN solution for biasing (counting) field λ = −3.See the text for details.

Fig. 1 (
b).This results in a single tensor that can be viewed as defining a multivariate function, φ(x) = T x , T x = C[T ], where C indicates the chosen contractions and x all remaining uncontracted indices.For a given problem, the selection of an appropriate TN depends on factors such as dimensionality and geometry.Here we consider applying ACTeN to study one-dimensional (1D) systems with periodic boundaries (PBs) and L components, such that a state is S = (s 1 , • • • , s L ) with s i taking d values.To represent the value function v ψ (S) we use a translation invariant matrix product state (MPS) which mirrors the chain geometry of the system.This TN is built from a single real-valued tensor A sij of shape (d, χ, χ) (or equivalently

FIG. 2 .
FIG. 2. Dynamical large deviations in the East model using ACTeN.Scaled-cumulant generating function for the dynamical activity of the East model as a function of biasing field λ from ACTeN (symbols), for L = 50 and PBC.Our RL results coincide with those obtained from the current state-ofthe-art method using DMRG, cf.Ref. [64] (which is possible since the East model obeys detailed balance).Inset: Kinetic constraint of the East model; a spin, si, can flip, si → 1 − si, only if the spin to the left is up, si−1 = 1.

FIG. 4 .
FIG.4.Training Procedure and Learning Curves (ASEP) (a) For each bias [we show λ = −1 (top row), λ = 1 (middle row), λ = 2 (bottom row)] TN-based policies and value-functions are produced via actor-critic optimization.These are initiated at random for = 4 with χ = 16 and trained for 10 6 steps.Every 5000 training steps the average reward of the policy is evaluated over 10 4 steps (black squares) and the weights of the policy (which we call a "snapshot" for that time) are stored.The evaluated values can be compared to the training estimate of r(π) (red circles), which tends to overestimate r(π) initially.The policy snapshot with the highest evaluated r (blue dashed line) is used to initiate the policy for higher values of L. This is repeated every ∆L = 2 up to L = 50, with L = 14 shown here.(b) For each bias, several policies (here six) are independently trained via the same procedure from different random initial conditions.This produces a distribution of evaluated average rewards, here represented by the median (black squares) and inter-quartile range (red-shaded region).The policy with the maximum average reward at each L is selected as the optimal dynamics (blue triangles).(c) Same as (a) for L = 50.The learning curves appear nosier than in (a) but note that the vertical scale is much smaller.The learning rate is kept fixed throughout.(d) The distribution of r across parallel agents for L = 50 is again much tighter than for L = 14.

1 .
Sample an action a t ∼ π w (•|s t ) [where x ∼ Y (•) stands for x sampled from Y ], and from it get its log probability and eligibility, ln π w (a t |s t ), ∇ w ln π w (a t |s t ).2. Get the next state and reward given the current state and action, (s t+1 , r t+1 ) ∼ P (•, •|a t , s t ).3.Get the temporal difference error with the current value function, δ t+1 = v ψ (s t+1 )+r t+1 − rt −v ψ (s t ).