Abstract
The stochastic dynamics of reinforcement learning is studied using a master equation formalism. We consider two different problems— learning for a two-agent game and the multiarmed bandit problem with policy gradient as the learning method. The master equation is constructed by introducing a probability distribution over continuous policy parameters or over both continuous policy parameters and discrete state variables (a more advanced case). We use a version of the moment closure approximation to solve for the stochastic dynamics of the models. Our method gives accurate estimates for the mean and the (co)variance of policy variables. For the case of the two-agent game, we find that the variance terms are finite at steady state and derive a system of algebraic equations for computing them directly.
- Received 7 August 2022
- Revised 3 January 2023
- Accepted 16 February 2023
DOI:https://doi.org/10.1103/PhysRevE.107.034112
©2023 American Physical Society