Unlearnable Games and “Satisficing” Decisions: A Simple Model for a Complex World

,

Classical economics is based on the idea that rational agents make optimal decisions, i.e. optimize their expected utility over future states of the world, weighted by their objective probabilities.Such an idealisation of human behaviour has been criticized by many (see e.g.[1,[3][4][5][6][7]).In particular, assuming that all agents are rational, allowing one to use game theoretic arguments to build such optimal strategies -often the result of complicated mathematical calculations -is implausible, to say the least.
A way to possibly save the rational expectation paradigm is to posit that agents are able to learn best responses from past experience.Yes, agents are only "boundedly" rational, but they learn and in the long run, they act "as if" they were rational [8].This is clearly expressed by Evans and Honkapohja in their review paper on the subject [9].They note that [i]n standard macroeconomic models rational expectations can emerge in the long run, provided the agents' environment remains stationary for a sufficiently long period.
While seemingly reasonable, this proposition is by no means guaranteed to be legitimate.Indeed, the hypothesis that the environment should be stationary over "sufficiently long periods" can be restated in terms of the speed of convergence of the learning process, that should be short enough compared to the correlation time scale of the environment.However, in many circumstances and in particular in complex games, the convergence of the learning process to a collectively optimal state can be exceedingly long, or may in fact never take place.For example, reasonable learning rules can trap the system in some sub-optimal regions of the (high dimensional) solution space, see e.g.[10][11][12][13].In other words, the learning process itself can be non-ergodic, even if the environment is described by an ergodic, stationary process.Another possibility is that agents' strategies, even probabilistic, evolve chaotically forever, as was found by T. Galla & D. Farmer [14] in the context of competitive multi-choice two-player games, or by one of us (JPB) with R. Farmer in a simple binary choice, multiplayer game [15].In such cases, the probabilities governing the different possible choices are not fixed but must themselves be described by probabilities.This is in fact a generic feature of "complex systems".As proposed by G. Parisi [16,17], the description of such systems requires the introduction of probabilities of probabilities, as their statistical behaviour themselves (and not only individual trajectories) are highly sensitive to the small changes in parameters, initial conditions, or time.The inability to describe such systems with knowable probabilities was coined "radical complexity" in [18].
The sensitivity of optimal solutions to the parameters of the problem, or to the algorithm used to find them, has a very real consequence: one can no longer assume that all agents, even fully rational, will make the same decision, since any small perturbation may lead to a completely different solution, although similar in performance.In other words, the "common knowledge" assumption is not warranted.[19] This has already been underlined for example in the context of portfolio optimisation in [20,21], or in the context of networked economies [13,22], but is expected to be of much more general scope, as anticipated by Keynes long ago and emphasized by many heterodox economists in the more recent past [1,4,6,7].
Here we want to dwell on this issue in the context of a multi-player binary game -the "SK-game" -, understood as an idealisation of the economic world where agents strongly interact in such a way that their payoffs depend non trivially on the action of others.In our setting, some relationships are mutually beneficial, while others are competitive.Agents have to learn how to coordinate to optimize their expected gains, which they do in a standard reinforcement way by observing the payoff of their past actions and adapting their strategies accordingly.
We find that our stylized model gives rise to a very wide range of dynamical behaviour: sub-optimal fixed points (a.k.a.Nash equilibria), but also limit cycles when learning is too fast, and "chaos", meaning that individual mixed strategies never converge.Chaos can be deterministic, when learning is noiseless, or stochastic when random noise is introduced in the learning process.In such chaotic situations, the environment of each agent changes over time, not because of exogenous shocks but because of the endogenous learning dynamics of other agents.As also argued in [13], a purely static network of interactions can generate an apparent never-ending evolution of the world in which agents live, although it may appear stationary for very long periods of time.In such a case, one speaks of "aging", which we address in a dedicated section below.
Even when learning is efficient and fixed points are reached, the average pay-off of agents -although better than random -is noticeably lower than the maximum possible value, that could be obtained if a perfectly in-formed social planner with colossal computing power was dictating their strategies to each agent.
Hence, our model provides an explicit example of Herbert Simon's satisficing principle [1]: with reasonable, boundedly rational methods, agents facing complex problems can only reach some satisfying and sufficing but sub-optimal solutions.In fact, in the absence of the omniscient and omnipotent social planner alluded to above, agents can hardly do better -the world in which they live in is de facto unlearnable.This is the main message of our paper.
A further interesting twist of the model is that the fixed point reached by the learning process depends sensitively on the initial condition and/or the detailed structure of the interaction network.As a consequence, which agents will do better than average and which will do worse is totally unpredictable, even when the economic interactions between agents are fixed and known.Correspondingly, small changes in the interaction network (as a result ofsay -small exogenous shocks, or as some agents die and other are born [15]) can lead to very large changes in individual strategies that agents would have to re-learn from scratch.
This feature actually corresponds to another possible definition of a "complex" system [17]: small perturbations can lead to disproportionally large reconfigurations of the system -think of sand (or snow) avalanches as an archetype of this type of behaviour: a single grain of sand might leave the whole slope intact or, occasionally, trigger a landslide [23] (see also, e.g., [4,6,22,[24][25][26][27][28][29] for early and more recent discussions of complexity ideas in the context of economics).

B. Related Ideas
Our work follows the footsteps of several important contributions on the subject of learning in economics, beyond the papers already quoted above, such as [1,9], and in particular the complex two-player game of Galla & Farmer [14].Another relevant work is the study of Grandmont [3] on the stability of economic equilibrium under different learning rules.Although his general conclusions are somewhat similar to ours, there are important differences, due to the fact that complexity, in our model, arises from the strong interaction between agents.The main differences are the following: • While our agents can in some cases (i.e. when interactions are reciprocal) converge to a locally stable equilibrium, this equilibrium is rather inefficient compared to the best theoretical one, which is unattainable without the help of a central planner with super-powers.Furthermore, instead of a single equilibrium in the Grandmont case, the equilibrium reached by our learning agents is one among an exponential number of possible equilibria, that are each sensitively dependent on the parameters of the model.In other words, which equilibrium is reached by the agents in totally unpredictable.
• As soon as some small amount of noise is present, Grandmont style instabilities set in and the system starts exploring the set of possible equilibria, albeit in a slow, non-ergodic way.In other words, the stability of the visited equilibria increases with the age of the system.
• In the case where interactions are sufficiently nonreciprocal, there are no fixed points to the learning dynamics except the trivial one where agents play randomly at each turn (like "rock-paper-scissors").However, this is not what agents agree to do.They keep holding strong beliefs at each time step, such beliefs evolving chaotically in time -but now in an ergodic way.[30] Another somewhat related idea can be found in the recent paper of Hirano & Stiglitz [31], where there can be a plethora of equilibrium trajectories, neither converging to a steady state or even to a limit cycle, what they call "wobbly" macro-dynamics.We may finally mention the work of Dosi et al. [32], which showcases that sophisticated learning rules can fail at improving the individual and collective outcomes when interacting agents are heterogeneous.Although the context is quite different, such a finding appears to be consistent with the saturation of the average reward at sub-optimal satisficing levels that we observe when the memory loss rate of our learning agents vanishes.

C. Outline of the paper
The layout of the paper is as follows.In Sec.II, we introduce our spin-glass inspired game.We then review the model's main features and present what we believe are the most important conceptual results in the socioeconomic context in Sec.III.In Sec.IV, we enter the technical analysis of the "SK-game" by discussing the existence and abundance of fixed point solutions.A similar analysis is conducted for short limit cycles when there are no fluctuations in the system in Sec.V. Having determined when these solutions may or may not exist, we take interest in the N → ∞ dynamics in Sec.VI by writing the Dynamical Mean-Field Theory of the problem.Combining the static and dynamic pictures, we focus on the fully rational (or "zero temperature") limit and in particular on the role of learning on the collective dynamics in Sec.VII.Section VIII finally considers the effect of fluctuations stemming from the bounded rationality of agents in their decision making process.In Sec.IX, we summarize our findings and discuss future perspectives as well as the relevance of our model for other complex systems such as biological neural networks.

II. A SIMPLE MODEL FOR A COMPLEX WORLD A. Set-up of the Model
As a minimal, stylized model for decision making in a complex environment of interacting agents, we restrict ourselves to binary decisions, as in many papers on models with social interactions, see e.g.[15,[33][34][35][36].At every timestep t, each agent i plays S i (t) = ±1, with i = 1, . . ., N , which can be thought of, for example, as the decision of an investor to buy or to sell the stock market, or the decision of a firm to increase or to decrease production, etc.The incentive to play and is the agent's estimate of the payoff associated to S i (t) = +1 compared to that of S i (t) = −1.The actual decision of agent i is probabilistic and drawn using the classic "logit" rule [37], i.e. sampled from a Boltzmann distribution over the choices, (1) or, equivalently, the expected choice m i (or "intention") of agent i at time t is given by Parameter β, assumed to be independent of i henceforth, is analogous to the inverse temperature in statistical physics and represents the agent's rationality or intensity of choice.[38] The limit β → ∞ corresponds to perfectly rational agents, in the sense that they systematically pick the choice corresponding to their preference (given by the sign of Q i (t)), while setting β = 0 gives erratic agents that randomly pick either decision with probability 1/2 regardless of the value of Q i (t).This will in fact turn out the be the case, in our model, in a whole region of parameter space: when β is smaller than some critical value β c , all intentions m i do converge to zero, leading to a random string of decisions.The evolution of the preference Q i (t) is where the learning takes place.We resort in so-called "Q-learning" [39], i.e. reinforcement learning with a memory loss parameter α.Given the (yet unspecified) reward ±R i (t) associated to making the choice ±1 at time t, the evolution of incentives (and, in turn, beliefs) is given by This map amounts to calculating an Exponentially Weighted Moving Average (EWMA) on the history of rewards R i (t).Taking α = 0, the agent's preferences are fixed at their initial values, and we thus restrict ourselves to α > 0. When α → 0, Q i (t) is approximately given by the average reward over the last α −1 time steps.Note here that this averaging of past rewards is not exactly the same as the accumulation rule (where the reward would not be multiplied by α in Eq. ( 3)) appearing in some forms of "Experience Weighted Attraction" that are popular in the socio-economic context [40].In fact one can always rescale the prefactor α into β, so our choice just means that Q does not artificially explode as α → 0, which would impose that β effectively diverges in that limit.Now, the missing ingredient is the specification of the rewards, that encodes heterogeneity and non-reciprocity of interactions.Inspired by the theory of spin-glasses, in particular by the Sherrington-Kirkpatrick (SK) model [41,42], we set Here, the matrix elements J ij specify the mutually beneficial or competitive nature of the interactions between i and j. (Note that J ij measures the impact of the decision of j on the reward of i.) In the context of firm networks, a client-supplier relation would correspond to J ij > 0, whereas two firms i, j competing for the same clients would correspond to J ij < 0. In the so-called "Dean problem", J ij > 0 means that agents i and j get along well whereas J ij < 0 means that they are in conflict [43].The sign of S i determines in which of the two available rooms agent i should sit, in order to minimize the number of possible conflicts.A predator-prey situation is when J ij × J ji < 0, meaning that if i makes a gain, j makes a loss and vice versa.
Note that reward R i (t) depends on the actual (realized) decision of other players, and not their expected decisions or intentions.In other words, agents resort in online learning, which differs from offline learning where other players' decisions S i (t) are averaged over large batch sizes during which their inclinations would be assumed constant and replaced by their expectation m i (t).
Based on the learning dynamics, agents thus make a decision based on an imperfectly learned approximation of what other players are likely to do.Indeed, using Eq. ( 4) to express Q i (t) as the time-weighted sum (EWMA) of past rewards, plugging Eq. ( 3), and finally replacing in Eq. ( 2), one finds that the evolution of agent i's intention writes where mα i (t) can be interpreted as the estimate of agent j's expected decision at time t+1 based on its past actions up to time t, Expressed in this form, it is clear that there is characteristic timescale τ α ∼ 1/α over which past choices contribute to the moving average mα i (t).Note that offline learning would correspond to a different evolution equation, namely although the two coincide in the α → 0 limit.At this stage, it may be useful to compare and contrast the present model with previous work.On the one hand, the learning procedure closely resembles the original proposition by Sato and Crutchfield [44], and its treatment by Galla [14,45,46] and others [47][48][49][50], however these authors considered games comprising only two players with many strategies.The subsequent case explored by Galla and Farmer considering a larger number of players [51] therefore lies closer to our setting, but still consider many strategies, while the similar binary decision models proposed by Semeshenko et al. [52,53] are restricted to perfectly rational agents and homogeneous interactions J ij = J 0 > 0 ∀i, j.Note that all these works also consider accumulated rewards and offline learning, in contrast with our averaged rewards and online learning (for a comparison between the two and other learning dynamics, see [50]).On the other hand, replicator models with random non-symmetric interactions between a large number of species [54,55] share many features with the system at hand, but the prescribed dynamics are inherently linked to evolutionary principles such as extinction that are not present in our model.Finally, other Isinginspired games such as that introduced in [56] are conceptually similar, in particular in their extension with myopic strategy revision (meaning updating based on future expectations and not directly passed realizations as done here) [57].So far, however, these models have been studied without heterogeneities and therefore do not present the "radical complexity" related to the presence of a very large number of possible solutions discussed hereafter.

B. The Interaction Matrix
In order to rely on known results about the SK model, we will assume in the following that the all agents randomly interact with one another, meaning that all elements of the matrix J are non-zero.Sparse matrices, corresponding to low-connectivity interaction matrices, would probably be more realistic in an economic context.However, we expect that many of the conclusions reached below will qualitatively hold in such cases as well.
We choose interactions J ij between i and j to be random Gaussian variables of order N −1/2 , with J ij in general different from J ji , accounting for possible nonreciprocity of interactions.More precisely, we introduce the parameter ε and write the interaction matrix as with J S a symmetric matrix and J A an anti-symmetric matrix.The entries of both these matrices are independent and sampled from a Gaussian distribution of mean 0 and variance σ 2 /N (the case of a non-zero average value of J S will be briefly discussed below).This defines what we will call the "SK-game" henceforth.
In the following we set σ = 1 without loss of generality.The resulting variance of J ij is thus given by The specific cases ε = {0, 1, 2} hence correspond to fully symmetric (J ij = J ji ), a-symmetric (i.e.J ij and J ji independent) and anti-symmetric (J ij = −J ji ) interactions respectively.We can thus also characterize the correlation between J ij and J ji through parameter η, where overlines indicate an average over the disorder.It may actually be insightful to allow for a non-zero average value to the interaction parameters, and define the reward R i (t) as (11) Note that if only the J 0 term is present, the game becomes simple, in the sense that either J 0 > 0 and βJ 0 > 1 and all agents converge to the same strategy m i ≡ m with m a non zero solution of m = tanh(βJ 0 m), or they converge to the random "rock-paper-scissor" strategy m i ≡ 0.
Finally, one may think that agents have some idiosyncratic preferences, or different costs associated to the two possible decisions S i = ±1.This would amount to adding to the reward R i (t) a time independent term H i , where H i favors S i = +1 if positive and S i = −1 if negative.In the present paper we will restrict to H i ≡ 0, ∀i, but one expects from the literature on spin-glasses that the main results discussed below would still hold for small enough H i s.Beyond some threshold value, on the other hand, agents end up aligning to their a priori preference, i.e. m i H i > 0.

III. BASIC INTUITION AND MAIN RESULTS
In this section, we review the most salient properties of the "SK-game" introduced above, and why these results may be of interest in the context of socio-economic modeling.This section is intended to be give a broad, nontechnical overview, whereas the following sections will delve in more details, using the language and methods of statistical physics.Readers more interested in conceptual results and less in technical details are invited to read this section and skip the subsequent ones, and jump directly to the final discussion Sec.IX.
In the SK-game, the payoff of each agent is a random function of the decisions of all other agents.Hence, learning the optimal strategy (in terms of the probability for FP and (Q) refer to "fixed point" and "quasi" respectively (mi(t) = 0 ∀t)."Random" refers to the case where agents pick ±1 with equal probability, whereas "chaos" means that at a given instant of time players have well defined intentions mi but these evolve chaotically with time.
agent i to play +1 or −1) is bound to be extremely difficult.Defining the average reward per agent as and noting that , the largest possible average reward R ∞ for N → ∞ can be exactly computed using the celebrated Parisi solution of the classical SK model and reads [42]: However, in practice this optimal value can only be reached using algorithms that need a time exponential in N .[58] Hence, it is expected that simple learning algorithms will inevitably fail to find the true optimal solution.Nevertheless, we also know from the spinglass folklore [42] (see below for more precise statements) that many configurations of {S i }'s correspond to quasioptima, or, in the language of H. Simon, satisficing solutions [1] (see also [21]).It is in a sense the proliferation of such sub-optimal solutions that prevent simple algorithms to find the optimum optimorum.Furthermore, if learning indeed converges (which is not the case when ε is too large, i.e. when interactions are not reciprocal enough), the obtained fixed point heavily depends on the initial condition and/or the specific interaction matrix J.

A. Phase Diagram in the Noiseless Limit
Let us first consider the case where agents always choose the action that would have had the best average reward in the past Q i (t) (assuming that other agents still played what they played).This corresponds to the noiseless learning limit β → ∞.In this case, the iteration map Evolution of the average reward with (a) the memory loss rate α, for ε = {0, 0.6, 0.85, 1.05, 1.5, 2} from dark purple to light green, β → ∞; (b) the asymmetry ε for α = 0.01 and α = 0.1, β → ∞, the vertical dotted line indicates the value εc above which the system becomes chaotic [61]; (c) the noise level 1/β for α = 0.01, ε = {0, 0.6} (dark purple and blue respectively), note the non-monotonic behaviour; (d) system size N for α = 0.01, β → ∞, ε = {0, 0.6} (dark purple and blue respectively), continuous lines showing fits R N = R∞ − AN −2/3 , excluding N = 32.The dotted line indicates the best fit for the SK ground state, for which A ≈ 1.5 [62].In Eq. ( 5) becomes and the model is fully specified by two parameters: α (controlling the memory time scale of the agents) and ε (controlling the reciprocity of interactions).For N not too large, the evolution of Eq. ( 14) leads to either fixed points, or oscillations, or else chaos.The schematic phase diagram in the plane (α, ε) is shown in Fig. 1.One clearly sees a region for (α, ε) small where learning reaches a fixed point, in which the average reward is close, but significantly below the theoretical optimum R ∞ given by Eq. ( 13), see Fig. 2. [59] Note that learning definitely helps: for ε = 0, most fixed points are characterized by a typical reward R ≈ 1.01 [60], significantly worse than the value ≈ 1.40 reached by our learning agents, extrapolated to N → ∞.
As ε and α are varied one observes the following features: • When ε is not too large (interactions sufficiently reciprocal) and α increases (shorter and shorter memory) learning progressively ceases to converge and oscillations start appearing: impatient learning generates cycles.This leads to a sharp decrease of the average reward (see Fig. 2(a)), as agents overreact to new information and are no longer able to coordinate on a mutually beneficial equilibrium.A similar effect was observed in dynamical models of supply chains, where over-reaction leads to oscillating prices and production -the so-called bullwhip effect [63] (see also [22,64]).
• Conversely, when α is small (long memory) and ε increases, the probability to reach a fixed point progressively decreases, and when a fixed point is reached, the average reward is reduced -see Fig. 2(b).Beyond some threshold value, the dynamics becomes completely chaotic, leading to further loss of reward.Note that "chaos" here means that although agents have well defined intentions m i (t) ≠ 0 at any given instant of time, these intentions evolve chaotically forever.[65] • Surprisingly, oscillations reappear when ε becomes larger than unity, i.e. when interactions are mostly anti-symmetric, "predator-prey" like.Perhaps reminiscent of the famous Lotka-Volterra model, agents' decisions and payoffs become periodic, with a period that scales anomalously as α −1/2 , i.e. much shorter than the natural memory time scale α −1 .
Although not the Nash equilibrium m i ≡ 0, this oscillating state allows the average reward to be positive, even when at each instant of time, some agents have negative rewards.
• Only in the extreme competition limit ε = 2 (corresponding to a zero sum game, Eq. ( 13)) and for small α, are agents able to learn that the unique Nash equilibrium is to play random strategies m i ≡ 0 (see Fig. 1, bottom right region).[66] • Finally, in the extreme (and unrealistic) case α = 1, where agents choose their strategy only based on the last reward, the system evolves, as ε increases from zero, from high frequency oscillations with period L = 2 to "weak chaos", to "strong chaos" when ε ≈ 1, and finally back to oscillations of period L = 4 when ε → 2.
In order to characterize more precisely such temporal behaviours, it is useful to introduce the two-point autocorrelation function of the expected decisions or intentions: where the angular brackets now refer to an average over initial conditions.[67] In cases where the dynamics are assumed to be time-translation invariant, we will write C(τ ) which corresponds to the above quantity averaged over time after the system has reached a steady-state.
The autocorrelation function corresponding to the different cases described above are plotted in Fig. 3.Note that the signature of oscillations of period L is that C(nL) ≡ 1 for all integer n.However, note that when ε < ε c , not all spins flip at each time step.The fact that C(2n + 1) = 0 means that half of the spins in fact remain fixed in time, while the other half oscillate in sync between +S i and −S i .[68] In the chaotic phases, C(τ ) tends to zero for large τ , with either underdamped or overdamped oscillations.Hence in these cases, the configuration {S i } evolves indefinitely with time, and hardly ever revisit the same states.
The previous results hold in the long time limit for finite N , i.e. when we take formally the limit t → ∞ before N → ∞.There is however another regime for which the N → ∞ limit is taken first, which might be more appropriate in the context of firm networks (say) when the number of firms is very large.
Interestingly, specific analytical tools are available to treat this regime -the so-called Dynamic Mean-Field Theory (DMFT) that we will use below.
In this infinite N regime, new types of behaviour appear that one can call quasi-fixed points or quasi-cycles.In the case of quasi-fixed points, learning does not strictly speaking converge, but actions S i (t) fluctuate around time-independent averages.In other words, the twopoint correlation C(τ ) is not equal to one for all τ (which would be the case for a fixed point) but reaches a positive plateau value for large τ : Only when ε = 0 does one find C ∞ = 1.The same holds for quasi-cycles of length L if one considers the correlation function computed for τ = nL, with n an integer: The schematic phase phase diagram drawn in Fig. 1 then continues to hold in the large N finite t limit, provided one interprets "fixed points/cycles" as "quasi-fixed points/cycles" in the sense defined above.More details on the subtle role of α, N and t will be provided in the next, technical sections.

C. Noisy Learning & "Aging"
In the presence of noise, the "convictions" |m i | of agents naturally decrease, and in fact become zero (i.e.decisions are totally random) beyond a critical noise level that depends on the asymmetry parameter ε: more asymmetry leads to more fragile convictions (see Fig. 16 below for a more precise description).
When the noise is weak but non-zero, strict fixed points do not exist anymore, but are replaced (for ε small enough) by quasi-fixed points -in the sense that the intentions m i fluctuate around some plateau value for very long times, before evolving to another configuration completely uncorrelated with the previous one.This process goes on forever, albeit at a rate that slows down with time: plateaus become more and more permanent.This is called "aging" in the context of glassy systems.
In a socio-economic context, it means that a form of quasi-equilibrium is temporarily reached by the learning process, but such a quasi-equilibrium will be completely disrupted after some time, even in the absence of any exogenous shocks.This is very similar to the quasinonergodic scenario recently proposed in Ref. [15], although in our case the evolution time is not constant but increases with the "age" of the system, i.e. the amount of time the game has been running.
Perhaps counter-intuitively, however, the role of noise is on average beneficial when 1/β is not too large.Indeed, as shown in Fig. 2, the average reward first increases as a small amount of noise is introduced, before reaching a maximum beyond which "irrationality" becomes detrimental.The intuition is that without noise, the system gets trapped by fixed points with large basins of attraction, but lower average rewards.A small amount of noise allows agents to reassess their intentions and collectively reach more favorable quasi-fixed points, much as with simulated annealing or stochastic gradient descent.
When learning leads to a chaotic evolution, i.e. when J ij and J ji are close to uncorrelated (ε ∼ 1), noise in the learning process does not radically change the evolution of the system: deterministic chaos just becomes noisy chaos.However, there is still a distinction between a low-noise phase where at each instant of time, agents have non-zero expected decisions m i (that will evolve over time) from a high-noise phase where agents always make random choices between ±1 with probability 1/2 (see Fig. 1).
Finally, in the case where learning leads to cycles, any amount of noise irremediably disrupts the synchronisation process, and cycles are replaced by pseudo-cycles, with either underdamped or overdamped characteristics.In the limit ε → 2, fluctuations drive the system to a paramagnetic state where q = C(0) = 0 (see Fig. 16 below), meaning the agents remain undecided.

D. Individual Rewards
As we have noted above, the average reward is close, but significantly below the theoretical optimum R ∞ given by Eq. ( 13).However, some agents are better off than others, in the sense that the individual reward e i at the fixed point (when fixed points exist), is different from agent to agent.Noting that where the second equality holds because at the fixed point one must have , it is clear that in the fully reciprocal case ε = 0, all rewards e i are positive.The distribution ρ(e) of these rewards over agents is expected to be self-averaging for large N , i.e. independent of the specific realisation of the J ij and of the initial condition.Such distribution is shown in Fig. as for the standard SK model, although the value of κ is distinctly different from the one obtained for the true optimal states of the SK model, for which κ SK ≈ 0.6 [69,70].Such a discrepancy is expected, since the fixed points are obtained as the long time limit of the learning process -in particular, since κ > κ SK , the number of poorly rewarded agents is too high compared to what it would be in the truly optimal state.Note that once α is sufficiently small for the system to reach a fixed point, its precise value does not seem to have an impact on the distribution of rewards and κ.
Another important remark is that the distribution of rewards ρ(e) does not develop a "gap" for small e, i.e. a region where ρ(e) is exactly zero.In other words, although all agents have positive rewards, some of them are very small.This is associated with the so-called "marginal stability" of the equilibrium state [71], to wit, its fragility with respect to small perturbations, as discussed in more details in the next subsection.
For very large e, the distribution ρ(e) decreases like a Gaussian (Fig. 4 inset), corresponding to a Central Limit Theorem behaviour in that regime, as for the SK model.[72] Fig. 5 shows how the rewards of individual agents evolve from an initially random configuration, before settling to constant (but heterogeneous) values at the fixed point.
As competitive effects get stronger (i.e. as ε increases) and the system ceases to reach a fixed point, the distribution ρ(e) develops a tail for negative values of e, meaning that some agents make negative gains, see Fig. 6.In the extreme "predator-prey" limit ε = 2, the distribution ρ(e) becomes perfectly symmetric around e = 0, as expected -see Fig. 6, rightmost plot.However, note that there is no persistence in time of the winners: the individual reward autocorrelation function eventually decays to zero, possibly with oscillations in the competitive region.

E. Unpredictable Equilibria
Now, the interesting point about our model is that the final rewards are highly dependent on the initial conditions and/or the realisation of the J ij 's.In other words, successful agents in one realisation of the game become the losers for another realisation obtained with different initial conditions.A way to quantify this is to measure the cross-sectional correlation of final rewards for two different realisations, i.e.
where a, b corresponds to two different initial conditions and ⟨e⟩ corresponds to the cross-sectional average reward.
As shown in Fig. 7, C × N goes to zero at large N , indicating that the final outcome of the game, in terms of the winners and the losers, cannot be predicted.The dependence on N appears to be non-trivial, with different exponents governing the decay of the mean overlap C × N (decaying as N −0.85 ) and its standard deviation (decaying as N −2/3 ).
A similar effect would be observed if instead of changing the initial condition one would randomly change the interaction matrix J by a tiny amount ϵ.The statement here is that for any small ϵ, C × N goes to zero for sufficiently large N .This is called "disorder chaos" in the context of spin-glasses [73]; by analogy with known results for the SK model, we conjecture that C × N is a decreasing function of N ϵ ζ , where ζ is believed to be equal to 3 in the SK case [74].This means that when N ≫ ϵ −ζ the rewards between two systems with nearly the same interaction structure, starting with the same initial conditions, will be close to independent.
Such a sensitive dependence of the whole equilibrium state of the system (in our case the full knowledge of the intentions m i of all agents) prevents any kind of "common knowledge" assumption about what other agents will decide to do in a specific environment.No reasonable learning process can lead to a predictable outcome; even the presence of a benevolent social planner assigning their optimal strategy to all agents would not be able to do so without a perfect knowledge of all interactions between agents and without exponentially powerful (in N ) computing abilities.Such a "radically complex" situation leads to "radical uncertainty" in the sense that the behaviour of agents, even rational, cannot be predicted.Learning agents can only achieve satisficing solutions, that are furthermore hypersensitive to details.As we have seen in Sec.III C, any amount of noise in the learning process will make the whole system "jump" from one satisficing solution to another in the course of time.

F. Increasing Cooperativity
A way to help agents coordinate is to use rewards given by Eq. ( 11) with J 0 > 0, representing a non-zero average (a) cooperative contribution to rewards.This term obviously helps agents finding mutually beneficial strategies.(Note that with our normalisation, the J 0 term is in fact N −1/2 times smaller than the random interaction terms J ij .) The impact of such a term is well understood in the context of the SK model for ε = 0 [75], which nicely translates into the current dynamical framework.For β = ∞, one finds that whenever J 0 ≤ 1, the average intention M (t) remains zero for large N and one expects that the learning process is not affected by such a "nudge".When J 0 > 1, on the other hand, the situation changes as all agents start coordinate on one of the two possible choices.As shown in Fig. 8, the average intention becomes non-zero, although a finite fraction of agents still play opposite to the majority because of their own idiosyncratic rewards.
For J 0 ≫ 1, radical complexity disappears and learning quickly converges to the obvious optimal strategy where all agents make the same move S i = +1 or S i = −1, ∀i.In this case, R N = J 0 as M (t) eventually reaches unity, see Fig. 8.For ε > 0, the same occurs albeit for different values of J 0 .
In the case J 0 < 0 with |J 0 | ≫ 1, the only solution of Eq. ( 7) (valid for α → 0) is m i = 0 for all i, i.e. agents cannot coordinate and play random strategies.

G. Habit Formation
Up to this point, all results have assumed that there is no self-interaction, J ii = 0. Nonetheless, it is interesting to consider the possibility of having an O(1) positive diagonal term in the interaction matrix.In the socio-economic context, such a contribution is relevant as it represents self-reinforcement of past choices, which is also called "habit formation" where agents stick to past choices, a popular idea in behavioural science, see e.g.[10,12] and refs.therein.
The introduction of a diagonal contribution has important consequences for the problem.Assuming the self-interaction is identical for all agents, J ii = J d > 0, it will rather intuitively favor the emergence of fixed points since agents are tempted to stick to past choices.It is for instance known that in the case of fully random interactions ε = 1, fixed points will start to appear when J d is sufficiently large [76].Interestingly, these fixed points can be very difficult to reach dynamically with standard Hopfield dynamics (α = 1).
Adding such diagonal contribution to our learning dy-namics, we have observed that the fraction of trajectories converging to seemingly dynamically inaccessible configurations significantly increases, specially when α ≪ 1.
While further work is required to precisely assess the effectiveness of learning with self-reinforcement (particularly as finite size effects appear to play a significant role), such a result is consistent with the overall influence of learning reported here.

H. Core Messages
In line with the conclusions of Galla & Farmer [14], our multi-agent binary decision model provides an explicit counter-example to the idea that learning could save the rational expectation framework (cf.Sec.I and Ref. [9]).
Learning, in general, does not converge to any fixed point, even when the environment (in our case the interaction matrix J) is completely static: non-stationarity is self-induced by the complexity of the game that agents are trying to learn, as also recently argued in [13].
When learning does indeed converge (which requires a minima a high level of reciprocity between agents) the collective state reached by the system is far from the optimal state, which only a benevolent, omniscient social planner with formidable powers can achieve.In other words, even more sophisticated learning rules would not really improve the outcome: the SK-game is unlearnable and -as argued by H. Simon [1] -agents must resort to suboptimal, satisficing solutions.
Furthermore, any small random perturbation (noise in the learning process, or slow evolution in the environment) eventually destabilises any fixed point reached by the learning process, and completely reshuffles the collective state of the system: in the long run, agents initially favoring the +1 decision end up favoring −1, and betteroff agents end up being the underdogs, and vice-versa (much as in the simpler model of Ref. [15]).
Finally, even in the most favourable case of a fully reciprocal game with slow learning, the average reward is in fact improved when some level of noise (or irrationality) is introduced in the learning rule, before degrading again for large noise.

IV. FIXED POINT ANALYSIS AND COMPLEXITY
We have seen that our model displays a wide variety of complex collective dynamics.Only in some cases does learning converge to non-trivial fixed points where strategies are probabilistic but with time independent probability p ± to play ±1, such that p ± = (1±m ⋆ i )/2 for agent i.Such a steady-state would be analogous to an economic equilibrium (although it is essential to dissociate this notion from that of a thermodynamic equilibrium, which may only exist in the case of fully reciprocal interactions, ε = 0).We will mostly focus, in the following, on the long memory case α ≪ 1 which is most relevant for thinking about learning in a (semi-)realistic context.In this case, one can show that the exponential moving average on the realized values S i (t) converges to one on the expected values m i (t).Indeed, as detailed in Appendix A, (18) This means that up to fluctuations of order √ α, we can describe the dynamics of the system through a deterministic iteration on m i (t), in fact corresponding to offline learning.(We will see below that the neglected fluctuations are of order √ α/β.)Further making the ansatz that the mean-field dynamics will eventually reach a fixed point m i (t) = m ⋆ i ∀i given sufficient time, Eq. ( 5) then yields This equation is known in the spin-glass literature as the Naive Mean-Field Equations (NMFE) when ε = 0, and defines a so-called static Quantal Response Equilibrium, similar to its fully mean-field equivalent (J ij = J/N ) studied in [56].
To the reader familiar with the physics of disordered systems, this equation is immediately reminiscent of the celebrated Thouless-Anderson-Palmer (TAP) equation [77] describing the mean magnetization in the Sherrington-Kirkpatrick (SK) spin-glass [41].Physically, the NMFE equation is satisfied when minimizing the free energy of a system of N sites comprising M → ∞ binary spins, with sites interacting through an SK-like Hamiltonian [78].Despite being seemingly simpler than its previously mentioned TAP counterpart, which includes an additional Onsager "reaction term", the NMFE shares many of its properties.Relevant to our problem, both the NMFE and the TAP equations have a paramagnetic phase (m ⋆ i = 0 ∀i) for β < β c , while above this critical value there is a spin-glass phase where q ⋆ = N −1 ∑ i (m ⋆ i ) 2 > 0 and solutions are exponentially abundant in N [78][79][80].The NMFE has a critical temperature 1/β c = 2 as opposed to 1/β c = 1 in the TAP case, while the two equations become strictly equivalent in the β → ∞ limit.
Using known properties from the spin-glass literature, we can therefore already establish that if the system reaches a fixed point when interactions are fully reciprocal (ε = 0) and memory is long ranged, it will be either a trivial fixed point where agents continue making random decisions for ever (m ⋆ i = 0 ∀i), or, when learning is not too noisy (β > β c ) the number of fixed points is ∼ exp[Σ(β)N ], where Σ(β) is called the "complexity".In this second case, the fixed point actually reached by learning depends sensitively on the initial conditions and the interaction matrix J.
How is this standard picture altered when interactions are no longer reciprocal?In such cases, the system cannot be described using the equilibrium statistical mechanics machinery.

A. Critical Noise Level
In order to extend the notion of critical noise β c to ε > 0, one can naively look at the linear stability of the paramagnetic solution m ⋆ i = 0 ∀i to Eq. (19).Just as in the TAP case [42], expanding the hyperbolic tangent to the second order and projecting the vector of m i on an eigenvector of J, the stability condition can be expressed with the largest eigenvalue of the interaction matrix.Adapting known results from random matrix theory to our specific problem formulation, the spectrum of J can be expressed as an interpolation between a Wigner semi-circle on the real axis (ε = 0), the Ginibre ensemble (ε = 1) and a Wigner semi-circle on the imaginary axis (ε = 2) [81].The resulting critical "temperature" is then given by recovering the known result 1/β c = 2 for the case ε = 0. (We recall that we have set the interaction variance σ 2 to unity throughout the paper.If needed σ can be reinstalled by the rescaling β → βσ.)

B. The Elusive Complexity
To determine if there are still an exponential number of fixed point to reach below the candidate critical noise level, i.e. if there is a spin-glass phase, for β > β c (ε) when ε > 0, we should study the complexity, defined for a single realization of the disorder as where N J is the number of fixed points in the system for a given interaction matrix.There are then two ways to compute an average of this quantity over the disorder: the "quenched" complexity, where the mean of the logarithm of the number of solutions is considered, and its "annealed" counterpart, where the logarithm is taken on the mean number of solutions.The former is usually considered to be more representative, as unlikely samples leading to an abnormally large number of solutions can be observed to dominate the latter (see e.g.[21,82] for recent examples), but requires a more involved calculation with the use of the so-called "replica trick" [42].In the TAP case, quenched and annealed complexities coincide for solutions above a certain free-energy threshold [60] (where most solutions lie but importantly not the ground state).
As a matter of fact, even in the annealed case, the computation of the TAP complexity has proved to be a formidable task, and has sparked a large amount of controversy, as the original solution computed by Bray and Moore (BM) [60] has been put into question before being (partially) salvaged by the metastability of TAP states in the thermodynamic limit [83].For a relatively up to date summary of the situation, we refer the reader to G. Parisi's contribution in [84].
While the BM approach can be adapted to the NMFE [79,85], several aspects of the calculation remain unclear, particularly as the absence of a sub-dominant reaction term means that the argument of the metastability of states in the N → ∞ limit is no longer valid a priori, although numerical results support the marginally stable nature of NMFE fixed points in the thermodynamic limit [79].We leave its extension to ε > 0 to a later dedicated work.
Nevertheless, the previously introduced critical β c and the existing computation of the number of fixed points as a function of ε in the β → ∞ limit [61,86] can be used to conjecture the boundaries of the region in (β, ε) space where the complexity Σ is non-vanishing.Indeed, in the zero temperature case, it has been shown [61] that the annealed complexity can be expressed as a function of the asymmetry parameter η defined in Eq. (10) as with Φ the Gaussian cumulative density and x is the solution to The main insight provided by this result is that the complexity vanishes at η = 0, corresponding to ε = 1, where the paramagnetic fixed point is supposed to be unstable as As the complexity is a decreasing function of temperature, this therefore means that ε = 1 is an upper limit for the existence of fixed points when β is finite.This conjecture is also consistent with the breakdown of fixed point solutions to the dynamical mean field theory below the critical noise level that will be discussed in Sec.VIII A, as well as the saddle point equations obtained when adapting the BM calculation to ε > 0.
Combining these two somewhat heuristic delimitation for the existence of a large number of non-trivial fixed points, we obtain the critical lines shown in Fig. 9(a).Overlaying these borders with the annealed complexity measured numerically, we find a very good agreement.In particular, the vanishing of the complexity at ε = 1 in appears to be consistent for T < T c (ε = 1) = 1/ √ 2, as shown in Fig. 9(b).The agreement with the β → ∞ analytical result, represented by the continuous line, also appears to validate our counting method at low temperatures.Note that one can in fact show that for ε > 1 and N = ∞, the only fixed point (or Nash equilibrium) is the "rock-paper-scissors" equilibrium m ⋆ i = 0, ∀i.

V. COUNTING LIMIT CYCLES
In the previous section, we have established the region of parameter space where exponentially numerous fixed points exist, which might possibly be reached by learning in the slow limit α ≪ 1.However, limit cycles of various lengths turn out to also be exponentially numerous when ε < 1, so we need to discuss them as well before understanding the long term fate of the learning process within our stylized complex world.
A. Cycles without Memory (α = 1) In the memory-less limit, the dynamics becomes that of the extensively studied Hopfield model [86,88,89] where the binary variable represents the activation of a neuron evolving as with parallel updates.Counting limit cycles of length L is even more difficult than counting fixed points (which formally correspond to L = 1).Some progress have been reported by Hwang et al. [61] in the memory-less case α = 1.The notion of fixed point complexity Σ (defined in Eq. ( 21)) can be extended to limit cycle complexity Σ L for limit cycles of length L, with Σ L=1 ≡ Σ.The results of Hwang et al. [61] can be summarized as follows: • When ε < 1, limit cycles with L = 2 have the largest complexity, which is exactly twice of the fixed point complexity: Σ 2 = 2Σ 1 (as was in fact previously shown by Gutfreund et al. [86]).
• Close to ε = 1, the cut-off length L c , beyond which limit cycles become exponentially rare, grow exponentially with N : L c ∼ e aN , where a weakly depends on ε.
From this analysis, one may surmise that: a. When a limit cycle is reached by the dynamics, it is overwhelmingly likely to be of length L = 2 for ε < 1 and of length L = 4 for 1 < ε ≤ 2.
b.Even if exponentially less numerous, exponentially long cycles will dominate when e aN > e N Σ2 , which occurs when These predictions are well obeyed by our numerical data, see Figs.When α < 1 and β = ∞, the update of S i (t) is given by Eq. ( 14), have the same fixed points independently of α, but of course different limit cycles, which may in fact cease to exist when α is small.In this section, we attempt to enumerate the number of cycles of length L in the spirit of the calculation of Hwang et al. [61] for α < 1.As detailed in Appendix B, we write the number of these cycles as a sum over all possible trajectories of a product of δ functions ensuring the α < 1 dynamics of Q i are satisfied between two consecutive time-steps, while a product of Heaviside step functions enforces S i (t) = sign(Q i (t)).Introducing the integral representation of the δ function, averaging over the disorder and taking appropriate changes of variable to decouple the N dimensions, the (annealed) complexity of cycles of length L writes where R(t, s) and K(t, s) are symmetric matrices while V (t, s) is not a priori, and I L is explicitly given in the Appendi, and where we can notably identify −i R(t, s) = C(|t − s|).As a sanity check, one can verify that the L = 1 case, corresponding to the fixed point complexity, is indeed independent of α and is given by the same expression as Eq. ( 22), see Appendix B 1. In a similar vein, one can recover Σ 2 (α = 1) = 2Σ 1 (η).
Numerically solving the saddle point equations for decreasing values of α, it appears that Σ 2 (α, η) → Σ(η) when α → 0 + , see Fig. 10.While numerical difficulties prevent us from exploring very small values of α, it seems clear that the saddle point corresponding to the L = 2 cycles eventually coalesces with the fixed point saddle (which is known to be a sub-dominant saddle point when α = 1, see [61] and the discussion above).In any case, and perhaps surprisingly, there does not appear to be a critical value of α below which fixed points become more abundant than cycles.We therefore expect a progressive crossover and not a sharp transition.

VI. DYNAMICAL MEAN-FIELD THEORY
We have thus seen that both fixed points and limit cycles are exponentially numerous.However, the question remains as to what happens dynamically, as the existence of a large number of fixed points or limit cycles by no means guarantees that these will be reached at long times.
In fact, the number of agents N is expected to play a major role in determining the long term fate of the system.In particular, there are strong indications that the time τ r needed to reach a fixed point or a limit cycle grows itself exponentially with N , at least when α = 1 [89].More precisely, where s is an exponent (possibly dependent on ε) and B(ε) an effective barrier such that B(ε = 0) = 0. Hence one expects that as N grows, fixed points/limit cycles will in fact never be reached, even if they are numerous.This is in fact what happens numerically, see Fig. 11.
What is then to be expected in the limit N → ∞?In order to study the complicated learning dynamics that takes place, we will resort to Dynamical Mean-Field Theory (DMFT).In a nutshell, DMFT allows deterministic or stochastic dynamics in discrete or continuous time of a large number N of interacting degrees of freedom to be rewritten as a one-dimensional stochastic process with self-consistent conditions.While difficult to solve both analytically and numerically due to their self-consistent nature, DMFT equations have proved very effective at describing a very wide range of complex systems -see [90] for a recent review.Note however that such an approach is only valid when N → ∞; as will be clear later, strong finite size effects can appear and change the conclusions obtained using DMFT.
In our case, we write the DMFT for the evolution of the incentives Q i (t), which directly yield m i (t) = tanh(βQ i (t)).In order to do so, we rewrite our online learning process, which depends on the realized S i (t), as an expression solely in terms of m i (t) with additional fluctuations, with ξ i (t) = S i (t) − m i (t) and hence ⟨ξ i (t)⟩ = 0 and ⟨ξ i (t)ξ i (s)⟩ = (1 − (m i (t)) 2 )δ t,s .Now, assuming the central limit theorem holds, the random variables η i become Gaussian for large N with ⟨η i (t)⟩ = 0, ⟨η i (t)η j (s)⟩ = υ(ε)(1 − q(t))δ t,s δ i,j , (28) where q(t) = C(t, t) as defined in Eq. (15).As required, in the noiseless limit β → ∞ limit, one has q(t) = 1 ∀t and the random variables η i are identically zero.
Starting from the N equations where h i (t) is an arbitrary external field that will eventually be set to 0, the DMFT can be derived using path integral techniques or the cavity method, the latter being detailed in Appendix C. Remaining in discrete time to explore the entire range of values of α, one finds, in the N → ∞ limit, with ⟨ϕ(t)⟩ = 0, and The memory kernel G and correlation function C are then to be determined self-consistently, where the averages ⟨. ..⟩ are over the realisations of the random variable ϕ.These discrete time dynamics can first be integrated numerically with an iterative scheme to update both the memory kernel and correlation function until convergence, see [91] or [92].As detailed in the original work of Eissfeller and Opper [93,94], one can also make use of Novikov's theorem to compute the response function with correlations, avoiding the unpleasant task of taking finite differences on noisy trajectories, at the cost of the inversion of the correlation matrix.Note that this inversion will however mean that very long trajectories become difficult to integrate.While we shall see that this numerical resolution can provide precious intuition to understand the role of finite N in the dynamics, a continuous description will be much more convenient to obtain analytical insights.In the α ≪ 1, t ≫ 1 regime, we can rescale the time as t → t/α.Interestingly, doing so requires expanding Q(t + 1) to the second order if one is to keep an explicit dependence on α.The resulting continuous dynamics reads with and the memory kernel and correlation function are similarly defined self-consistently with, we recall, q(t) = C(t, t) = ⟨m 2 (t)⟩.Very importantly, note the rescaling in time introduces a prefactor α in the variance of the ϕ, which stems from the noise in the learning process.Since 1 − q(t) ∼ β −2 for large β, this extra term is of order α/β 2 , as anticipated above.
In the next sections, the DMFT equations will be used to shed light on the dynamical behaviour of the model in the limit N → ∞.

VII. NOISELESS LEARNING
In this section, we use both the DMFT equations and the results on the complexity of fixed points and limit cycles to classify the different dynamical behaviours of the learning process in the noiseless case β → ∞, where the realized and expected decisions are equal, m i (t) = S i (t) = sign(Q i (t)).
A. The Memory-less Limit α = 1 In this case, corresponding to Eq. ( 24), both approaches (DMFT and complexity of limit cycles) seem to agree on the overall picture: as ε increases from 0 to 1, the system transitions from L = 2 cycles to overdamped oscillations and chaos -see Fig. 3.However, upon scrutiny, one realizes that the perfect agreement between DMFT and direct numerical simulations of the dynamics for finite N is only valid in a region where ε is small and N large enough -see Fig. 11.In particular, when 0.5 ≲ ε ≲ 0.8, L = 2 cycles do persist when N is smaller than ∼ 200.For larger N , the lag 2 autocorrelation function C(τ = 2) is noticeably smaller than unity, and well predicted by DMFT as soon as N ≳ 1000.
What happens for ε ≲ 0.5 when N → ∞?The numerical solution of the DMFT equations suggest the following scenario: when ε < ε RM ≈ 0.473, the long time value m ∞ of the correlation with the initial conditions C(0, 2n) at even time steps is strictly positive, hence the subscript "RM" for Remnant Magnetisation.[95] It is only exactly equal to one for ε = 0 (permanent oscillations) and decreases to reach zero when ε → ε RM [94].Below ε RM , we can conclude that the system is not ergodic, which will have important implications on the finite temperature dynamics.For asymmetries greater that ε RM on the other hand, the decorrelation becomes exponential and we enter a bona fide chaotic, ergodic regime.
Although memory-less learning is clearly unrealistic, these results are rather instructive.The system is indeed unable to display aggregate coordination when interactions are mutually independent (chaotic region around ε = 1).Placing ourselves in the socio-economic setting, it seems evident that the number of players vastly exceeds the number of iterations, and the results above indicate that the chaotic region is in fact quite large.Perhaps more importantly, in the absence of learning, agents will see their decisions vary at a high frequency without ever reaching any static steady-state, not only when the game is close to zero sum (ε → 2), but even when it is fully reciprocal (ε → 0).Clearly, this last point underlines the importance of introducing memory to recover realistic learning dynamics.

B. Memory Helps Convergence to Fixed Points
For N not too large, we observe numerically that the fraction of "frozen" agents for which S i (t + 1) = S i (t) quickly tends to 1 as α decreases from 1, as shown in Fig. 12.This is somewhat consistent with intuition, as the learning dynamics average rewards over a period τ α ∼ 1/α, meaning that high frequency cycles observed for α = 1 are expected to be "washed out" when α is sufficiently small.Since fixed points exist in large numbers, it appears natural that they are eventually reached given their abundance at zero temperature.However, as we have shown in the previous section, L = 2 limit cycles are still much more numerous than fixed points for α ≳ 0.5.The fact that C(τ = 1) approaches unity as α is reduced much faster than in Fig. 10(b) suggests that the basin of attraction of fixed points quickly expands, at the expense of L = 2 limit cycles.[96] Our numerical results therefore indicate that for any finite size system which has enough time to reach a steady state, the effect of α is effectively to help the system find fixed points -see Fig. 13.Focusing for example on the points corresponding to N = 128 and α = 0.1, we indeed observe that the 95% quantile includes C(2/α) = 1 even for ε = 1, i.e. fixed points can be reached even in the chaotic regime with modest simulation times, which would be an overwhelmingly improbable scenario in the memory-less case, as illustrated by Figs. 13 (b) and (c).
As the number of agents N increases, we enter the DMFT regime shown as plain lines in Fig. 13.One finds that decreasing the value of α slows down the decorrelation of the system.However, for small α, the evolution becomes a function of ατ only, as suggested by Eq. ( 33) when α → 0: the dynamical slowdown is dominated by the long memory of learning itself.
Fig. 13 shows that when ε ≳ 0.5, sufficiently large systems (described by DMFT) decorrelate with time for all α, and we expect C(τ → ∞) → 0: learning leads to chaos in such cases.
When ε ≲ 0.5, on the other hand, we found that there is ergodicity breaking, in the sense that C(τ → ∞) > 0, as we found above for cycles when α = 1.More precisely, a numerical analysis of the DMFT equations suggests that when α → 0 and ε small, 1−C(τ → ∞) is extremely small but non zero.For example, when ε = 0.4 we found 1 − C(τ → ∞) ≈ 0.005.This is compatible with the numerical results of [94].
In other words, there seems to exist a critical value ε RM (α) separating the ergodic, chaotic phase for ε > ε RM (α) from the non-ergodic, quasi fixed-point behaviour for ε < ε RM (α).However, our numerical results are not precise enough to ascertain the dependence of ε RM on α, which seems to hover around the value 0.473 found for α = 1.More work on this specific point would be needed to understand such a weak dependence on the memory length.
The precise dynamical behavior of the autocorrelation function C(τ ) can be ascertained in the continuous limit α → 0 when ε = 1.Indeed, the influence of the memory kernel vanishes in this case where interactions are exactly non-symmetric, leaving us with where we emphasize that the time variable has been rescaled as t → αt.From there, the classical solution method proposed by Crisanti & Sompolinsky [97,98] can be straightforwordly adapted with a small modification due to our parametrization of the interaction matrix that scales the variance of the entries by a factor 1/2 for ε = 1, see Appendix D. The two-point autocorrelation function is found to be given by where ∆(τ ) = ⟨Q(t + τ )Q(t)⟩ follows the second-order ordinary differential equation with ∆(0) = 1 − 2 π [99].Very quickly, this means that the autocorrelation decays exponentially, C(τ ) ∝ e Both the full solution, obtained by integrating the ODE numerically, as well as this exponential decay, are shown in Fig. 14, displaying a very satisfactory match with numerical simulations.

C. Anomalous Stretching of Cycles
Having a better understanding of how the memory allows the system to find fixed points when they exist, a central question is what will happen if the memory loss rate is reduced in the region of parameter space where there are only limit cycles.As previously stated, averaging over a period τ α ∼ 1/α clearly suggests that the occurrence of short cycles (starting at L = 4 for α = 1) should gradually vanish.
Naively, one might expect a simple rescaling in time t → t/α, yielding cycles -when they exist -of period inversely proportional to α itself.Looking at the numerical results from both the finite size game and the DMFT integrated numerically in Fig. 15 (a), it quickly appears that such a simple rescaling in time does not provide the correct description.Indeed, the period of cycles is observed to be proportional to 1/ √ α, i.e. much shorter than 1/α -see Figs. 15 (b) and (c).
One important aspect to note is that there is some decorrelation, as the second peak of C(τ ) does not quite reach unity (in Fig. 15 (b)), meaning that we may see quasi-cycles and not exact limit cycles, complicating the analytical description of the phenomenon.Just as true fixed points in the N → ∞ limit only exist only ε = 0, it appears that only the case ε = 2 does display true limit cycles.
Another subtle point to consider is similar to the ε < 1 cases discussed, we expect the time taken to reach these cycles will depend on the system size and the relative distance to the chaotic region.This is confirmed by the DMFT solved for fixed trajectory times for ε = 1.5 (light crosses), which progressively departs from the ω 0 ∼ √ α regime around α = 0.1.
To understand how such non-trivial stretching occurs, we go back to the continuous DMFT equation, While the presence of the second order derivative Q(t) appears natural to recover limit cycles, it should be noted that this term, being pre-factored by α, is superficially subdominant relative to the dissipation represented by Q(t).While we have seen that there is some decorrelation, the fact that robust oscillations are present therefore suggests that the complicated self-consistent forcing terms almost exactly compensate dissipation over a period, allowing the system to periodically revisit quasiidentical configurations.In fact, the shape of these oscillations is far from sinusoidal, but rather of see-saw type, see Fig. 15 (b).This suggests that in the limit α → 0, Q(t) diverges each time Ċ(τ ) changes sign, such that α 2 Q(t) cannot be neglected and therefore sets the relevant time scale to α −1/2 .We have however not been able to perform a more precise singular perturbation analysis of this phenomenon.Going beyond this rather loose argument, and precisely characterizing such see-saw patterns appears very challenging and is left for future work.A possible approach would be to first take the ε = 2 case where true cycles should exist, and to assume the correlation function is an exact triangular wave of frequency ω as suggested by Fig. 15 (c).As a result, m(s) = sign(Q(s)) is an exact square wave, and the convolution with G can be written as a product in Fourier space.Enforcing the dissipation over a period to be zero, one could then perhaps find a closed equation for Q and ω if appropriate ansätze for the response and forcing functions are taken.

VIII. NOISY LEARNING
While we have shown that the β → ∞ deterministic limit can be relatively well understood with the analytical tools at our disposal, one of the key features of our model is the uncertainty in the decision occurring for boundedly rational agents.Besides, it is also in this situation that the online learning dynamics differ significantly from the more widely studied offline learning where the entire model can be understood in terms of deterministic mixed strategies parameterized by the coefficients m i (t) (compare Eqs. ( 5) and ( 7)).
When α is close to unity and β becomes small, the fluctuations are too large for coordination to occur.Taking for instance α = 1, it is indeed clear that the iteration will have extremely large fluctuation in the argument on the right hand side.As a result, we expect to lose the sharp transition as a function of T that can be observed for the NMFE (see Fig. 16).The order parameter q instead continuously tends to 0 with T , regardless of the asymmetry ε.This regime is shown in Fig. 16 (a), representing the heat map of q for α = 0.5.Clearly, the linear stability analysis of the paramagnetic fixed point presented in Sec.IV cannot hold when the thermal fluctuations are not averaged on large periods of time.To find a richer phenomenology, we will therefore focus on the α ≪ 1 regime where more complex dynamics can be observed.

A. Fixed Points for Intentions
In Sec.IV, we studied the fixed points of the NMFE that the game may reach if the fluctuations from imperfect learning can be neglected, i.e. if α → 0. Now, we have seen that the DMFT equations proved effective in the zero "temperature" limit, and importantly established that true fixed points or two-cycles are only reached in finite time in the fully reciprocal case ε = 0 when the system size diverges (see Fig. 11).The same equations can also be used to revisit this finite β regime.
Going back to Eq. ( 33) and neglecting the term in √ α from the correlation function as we did in the static setup, fixed point solutions require where z is now a static white noise of unit variance, q is simply the now constant autocorrelation and χ is the integrated response function that we assume to be timetranslation invariant, The averages on the effective process can now be taken on z to self-consistently solve for q and χ (see e.g.[55] for a more detailed description).The resulting set of equations are then to be solved simultaneously with where m(z) is the solution to Although our model is entirely built on a dynamical evolution equation, and not on a notion of thermal equilibrium, this set of self-consistent equations coincides with the replica-symmetric solution of the NMFE model found by Bray, Sompolinsky and Yu [78] for ε = 0. Since replica symmetry is broken in the whole low temperature phase of the NMFE model, we expect that these static solutions of the DMFT cannot correctly describe the long time limit of the dynamics, as we now show.The numerical solutions for the DMFT fixed point equations are shown in Fig. 17 (a) and compared to numerical results of the game for small α and for ε = 0.1 (similar results, not shown, are obtained for other values of ε ≲ 0.8).[100] We find that the long time behaviour of q for the direct simulation of the SK-game (circles and squares) and for long time dynamical solution of the DMFT equations match very well, but differ from the value of q inferred from the set of self-consistent equations established above.This is expected since with such solution the order parameter q approaches unity exponentially fast as T → 0, whereas the fact that the probability of small local fields (i.e.rewards in the game analogy) vanish linearly (see Fig. 4) suggest that q = 1 − κT 2 , as for the full RSB solution of [78] but with presumably a different value of κ.
To ascertain the range over which this non-trivial mean-field solution should be valid, we can study the stability of the DMFT fixed point close to the critical temperature 1/β c , following the procedure first detailed in [54].Considering a random perturbation to the fixed point ϵξ(t), with ξ(t) a Gaussian white noise and ϵ ≪ 1, we study the perturbed solution with Q 0 the fixed point given in Eq. (40), where the noise is no longer static but similarly given by ϕ(t) = √ qυ(ε)z+ ϵϕ 1 (t).Replacing in the DMFT continuous dynamics for α → 0 + and collecting terms of order ϵ, we find that the perturbation evolves as where we have used sech 2 (βQ 0 ) = 1 − m 2 (z) from Eq. ( 44), giving in Fourier space where we have again assumed that the memory kernel is time-translation invariant.Now, in the limit βQ 1 ≪ 1, i.e. close to the critical temperature, one can we linearize the hyperbolic tangent tanh(βQ 1 (t)) and write a closed equation for the spectral density of Q 1 at order As a result, we have the criterion for the onset of instability for ω = 0: where we noticed Ĝ(ω = 0) = χ, given, close to β c , by Taking ε = 0, we recover the criterion found by Bray et al. [78] for the critical temperature, giving T c = 1/β c = 2 in their case.
For non-zero ε, we can also find the critical temperature by replacing Eq. ( 50) in (49), to recover yet again the critical value given by Eq. (20).The invalidity of the fixed point solution is illustrated in Fig. 17 (b), where the spectral density evaluated at ω = 0 can be observed to become negative for T < T c (ε).While the linearization used to obtain the relation is expected to become invalid, we understand this negativity as a strong sign that the solution is unstable and thus likely invalid throughout the low "temperature" phase, consistently with the discrepancy observed between the static prediction and the direct simulations and the dynamic solution of the DMFT show in Fig. 17 (a).In this finite β regime, we therefore effectively only have what we previously referred to as quasi-fixed points at best (including when ε = 0), even when neglecting the vanishing fluctuations caused by the online learning dynamics as α → 0.

B. Aging
The inadequacy of the static solution of the DMFT equations to describe the long term dynamics of the system is a well known symptom associated with the "aging" phenomenon [101], i.e. the fact that equilibrium is never reached and all correlation functions depend on the "age" of the system [102]:  42)-( 44)) for ε = 0.1.(b) Spectral density of a small perturbation Q1 to the fixed point solution of the DMFT close to the critical temperature.As the quantity is necessarily positive for a valid solution, the grey region corresponds to instability.
where t w is the waiting time, or age of the system, t w = 0 corresponding to a random initial condition.
Aging typically arises in complex systems in the low temperature (low noise) limit.Pictorially, the energy landscape of such systems (like spin-glasses) are highly non convex and "rugged", with a very large number of local minima or quasi-stable saddle points in which the dynamics gets stuck for extended periods of time [102,103].The consequence is then that the time required to exit a local minimum is a function of the age of the system, i.e. the time taken for the system to reach this configuration.Intuitively, the deeper in the energy landscape the solution is, the longer it will take for a sufficiently large random fluctuation to occur and allow the system to resume its exploration of the landscape.Such aging phenomena are known to occur in a wide range of complex systems with reciprocal interactions, such as glassy systems, populations dynamics [104] or neural networks that are described by very similar mean-field dynamics [105].Aging dynamics was also recently found in a "habit formation" model, see [12].
Not surprisingly in view of its similarity with usual spin-glasses, the SK-game displays aging for reciprocal interactions (ε = 0) and sufficiently low temperatures β > β c , see Fig. 18 (a), for which Eq. ( 51) accurately describes the data with the initial relaxation component C relax (t) well fitted by a power law t −x and corresponding to the aging behaviour found in a wide range of glassy models [102,[106][107][108][109][110][111].Interestingly, this is not the behavior observed in the "physical" SK model [112,113], which is typically found to display "subaging", although the precise scaling of the aging correlation function remains unclear.It is however important to emphasize that the learning dynamics of the SK-game is markedly different from the physical dynamics of the original SK model, so there is a priori no reason to expect their complicated out-of-equilibrium relaxation to be directly comparable.
What happens in our model when ε > 0? It is known from previous work (in somewhat different contexts), that aging is interrupted at long times in the presence of non reciprocal interactions, but survives for finite times provided the asymmetry is not to large [114][115][116].If the asymmetry strength is further increased, we expect the amount of mixing in the system to eventually be large enough for the dynamics to no longer get stuck [117].From our numerical simulations, shown in Fig. 18 (b)-(d), it appears that aging (as described by Eq. ( 51)) still holds when ε < ε ⋆ , but that the dynamics becomes time translation invariant when ε > ε ⋆ .It is tempting to conjecture that aging disappears exactly when the dynamics becomes ergodic, i.e. when the correlation with a random initial condition decays to zero.This suggests that ε ⋆ = ε RM , which is roughly in line with our numerical data.Note however that the transition between aging dynamics and time translation invariant correlations seems to occur somewhat progressively, hence it may well be that we in fact observe interrupted aging beyond a time that decreases not only with ε but also with temperature 1/β.More analytical work is needed to clarify the situation.
A particularity of the aging dynamics of the SK-game is related to its online learning dynamics.As clearly visible in both the DMFT equation and the derivation of the Naive Mean-Field equation detailed in Appendix A, there will inevitably be some decorrelation in time of the expected decisions if α(1 − q) becomes significant.We therefore naturally expect the region where time translation invariance breaks down to be dependent on all three parameters α, β and ε.
The interpretation of aging in the socio-economic context is quite interesting and has been discussed in Sec.III C. In a nutshell, it means that as time goes on, agents get stuck in locally satisficing strategies for longer and longer, but after a time proportional to the total time the game has already been played, the system eventually evolves and individual strategies m i reach an altogether different configuration.This process goes on forever, but becomes slower and slower with time: the notion of quasiequilibrium therefore makes sense at long times, for small enough noise and small enough non-reciprocity.

C. Chaos and (quasi-)Limit cycles
When the non-reciprocity of interactions is sufficiently small and quasi-fixed points exist, we have established that boundedly rational systems displays complicated ag- ing dynamics when learning noise, parameterised by the value of β, is present.The immediate question is now how such noise influences the complex dynamics, chaos and (quasi) limit cycles that we have found in the β → ∞, α ≪ 1 regime (see Sec. VII).
To qualitatively illustrate the effect of non-zero noise, we have run simulations for α = 0.1 and different values of ε and β.Fig. 19 displays individual trajectories of the intentions m i (t), as well as the distribution of the values of individual m i over all agents, realizations and time-steps.We also show the auto-correlation function C(τ ) (assumed to be time-translation invariant over the short time scales considered), which is also compared to the DMFT solved numerically.Note that for the smallest value ε = 0.1, the trajectories illustrate the previously discussed quasi-fixed points emerging from the online dynamics.Clearly, while the correlation function remains close to constant, individual intentions are not exactly frozen (notice the wiggles in the top row of Fig. 19, specially for β = 2), explaining how the system as a whole eventually decorrelates and displays aging, as discussed in the previous section.
In the chaotic regime around ε = 1, it is clear that decreasing β (increasing noise) spreads the distribution of the m i , which is less and less concentrated around ±1 as an immediate consequence of the smoothed out hyperbolic tangent.As a result, the equal-time autocorrelation C(0) = q naturally decreases when β decreases.It is furthermore interesting to note that its value significantly decreases as the asymmetry parameter ε increases.We expect a decrease in α to have a similar role, as suggested by the phase diagrams presented in Fig. 16 (b) and (c).
Dynamically, the decay of the autocorrelation in this chaotic regime appears to be more or less independent of the strength of the noise, which can be seen by comparing the insets of the third row of Fig. 19.While it is known that external noise kills deterministic chaos in neural net-works with uncorrelated couplings [118], what is interesting in our case is that both the non-linearity of the hyperbolic tangent (governed by β), and the strength of the effective noise (which scales as 1 − q, see Eq. ( 34)) are varied simultaneously and, in a sense, self-consistently.Determining the way the decorrelation rate evolves with both β and ε is therefore quite non trivial and would be a very interesting endeavor.
Where we previously had limit cycles, for ε = 1.5 for instance (last row of Fig. 19), it appears that oscillations survive for large values of β.Note however that in this case the value q = C(0) appears to very quickly vanish when β decreases.This is consistent with the linear stability analysis of the paramagnetic solution that should be valid for vanishingly small α.As visible in Fig. 16 (c) we indeed expect that the region in which the system displays any form of aggregate coordination becomes increasingly narrow as ε gets closer to its maximum value of 2. Precisely for ε = 2, the system likely becomes fully disordered (q = 0) for any finite values of β when α → 0. For small but finite values of α as those presented here, this is not quite the case however, and large asymmetries ε > 2 − ε c do give rise to clear oscillations, both in individual trajectories and the correlation function (for instance Fig. 19, bottom row shows for ε = 1.5 that some oscillations can be somewhat sustained).

D. Role of the Noise: Recap
In the presence of noise, the "conviction" of agents naturally goes down, in the sense that individual m i 's become smaller in absolute value, i.e. less polarized around ±1.For small α (long memory time), there exists a well defined transition line in the plane ε (asymmetry), 1/β (amplitude of the noise) above which agents start playing randomly at each round (i.e.m i = 0), but below which some instantaneous propensity to overplay +1 or −1 appears.The time evolution of this propensity depends on the value of ε: for small asymmetries, the fixed points that are reached in the absence of noise become quasifixed points around which the system settles for longer and longer periods of time, after eventually moving on to a completely different configuration (aging).For ε ∼ 1 (uncorrelated influence from i to j and from j to i), deterministic chaos when noise is absent becomes noisy chaos, with not much changes.In the highly competitive region ε → 2, periodic cycles progressively become over-damped, as expected since noise does not allow synchronisation to survive at long times.In terms of individual rewards, not surprisingly, noise tends to be detrimental, except in the reciprocal region (ε small) where weak noise actually helps agents coordinating around mutually beneficial actions.
While the precise characterization of this very rich ecology of dynamical behaviors is left for future work, we emphasize that the decorrelation induced by the fact that incentives Q i are themselves random variables is a key difference between the online learning presented here and their offline counterpart that are often considered [14,46].While the phenomenology between these two types of learning is somewhat similar for α ≪ 1, β ≫ 1, there are clearly key differences whenever the ratio √ α/β ceases to be negligible.

A. Blindsided by Complexity
Let us summarize our main conceptual assertions.As a schematic model of the complexity economic agents are confronted with, we introduced the "SK-game", a discrete time binary choice model with N interacting agents and three parameters: α (memory loss rate), β (inverse amplitude of noise in the learning process or intensity of choice) and ε (non reciprocity of interactions).
We have shown that even in a completely static environment where the pay-off matrix does not evolve, agents are unable to learn collectively optimal strategies.This is either because the learning process gets trapped by a suboptimal fixed point (or remains around one for very long times), or because learning never converges and leads to a never ending (chaotic or quasi-periodic) evolution of agents intentions.
Hence, contrarily to the hope that learning might save the "rational expectation" framework [9], which still holds the upper hand in macroeconomics textbooks, we argue that complex situations are generically unlearnable.
Agents, therefore, must do with satisficing solutions, as argued long ago by H. Simon [1], an idea embodied by our model in a concrete and tangible way.
Only a centralized, omniscient agent may be able to ascribe an optimal strategy to all agents -which incidentally raises the question of trust: would agents even agree to follow the central planner advice?Would they even believe in her ability to solve complex problems, knowing that their solution sensitively depends on all parameters of the model?If a finite fraction of all agents fail to comply, the resulting average reward will drop precipitously below the optimal value and not be much better than the result obtained through individual learning.
As general ideas of interest in a socio-economic context, we have established that 1. long memory of past rewards is beneficial to learning whereas over-reaction to recent past is detrimental; 2. increased competition generically destabilizes fixed points and leads first to chaos and, in the high competition limit, to quasi-cycles; 3. some amount of noise in the learning process, quite paradoxically, allows the system to reach better collective decisions, in the sense that the average reward is increased; 4. non-ergodic behaviour spontaneously appear (in the form of "aging") in a large swath of parameter space, when α, 1/β and ε are small.
On the positive side, we have shown that learning is far from useless: instead of getting stuck among one of the most numerous fixed points with low average reward, the learning process does allow the system to coordinate around satisficing solutions with rather high (but not optimal) average reward.Numerically, the average reward at the end of the learning process is, for ε = 0, ≈ 8% below the optimal value, when the majority of fixed points lead to a much worse average reward ≈ 33% below the optimal value [60].

B. Technical Results & Conjectures
From a statistical mechanics perspective, our model is next of kin to, but different from several well studied models; a synthesis of our original results can be found in Figs. 1, 2, 9, 16.
For example, when α = 1, β → ∞, the dynamics is equivalent to a Hopfield model of learning with nonsymmetric Gaussian synaptic couplings, for which many results are known, in particular on the number of fixed points and L-cycles.Introducing some memory with α < 1, we found that previously dynamically unattainable fixed points become typical solutions, replacing the short limit cycles in which the α = 1 parallel dynamics get stuck.We also showed how the number of L-cycles can be calculated for all values of α < 1.
The chaotic region that is known to exist when interactions are mostly non-symmetric (ε ≈ 1) also appears to be reduced by memory.When couplings are mostly non-reciprocal ε ≲ 2, periodic oscillations survive but we found that decreasing α non-trivially increases the cycle length, as α −1/2 .When β is finite, the fixed point solutions to the dynamics correspond to the so-called Naive Mean-field Equation of spin glasses, another model that has been studied in detail [78].One knows in particular that such solutions become exponentially abundant for small enough noise 1/β and for ε = 0, a result that we have extended to all ε < 1.When ε > 1, on the other hand, the only fixed point (or Nash equilibrium) is m i = 0, ∀i, i.e. completely random decisions at each time step.
The Dynamical Mean-Field Theory (DMFT) is a tool of choice for investigating the dynamics of the model when N → ∞.DMFT is however frustratingly difficult to exploit analytically in the general case, so we are left with numerical solutions of our DMFT equations that accurately match direct numerical simulations of the model when N is large, but fails to capture some specific features arising when N is small.From our numerical results, we conjectured that quasi-fixed points (and correspondingly, aging dynamics) persist for small noise and when ε ≤ ε ⋆ , where the value of ε ⋆ is difficult to ascertain but could be as high as ε RM = 0.47, perhaps related to the remnant magnetisation transition found in [94].
For β = ∞, the long-time, zero temperature autocorrelation C(∞) appears to drop extremely slowly with ε, but we have not been able to get an analytical result.DMFT equations also clearly lead to the anomalous α −1/2 stretching of the cycles mentioned above, but an analytic solution again eluded us.
One of the reasons analytical progress with DMFT is difficult is presumably related to the phenomenon of "Replica Symmetry Breaking" and its avatar in the present context of dynamical learning.Indeed, any attempt to expand around a static solution of the DMFT equations leads to inconsistencies in the interesting situation β > β c (ε) when decisions are not purely random, see Fig. 17(b).In fact, as shown in Fig. 17(a), the value of the order parameter q predicted by such static solutions is substantially off the value found from the long time, numerical solution of the DMFT equations, which itself coincides with direct simulations of the SK-game.En passant, we noticed that the value of q seems to be given by a universal function of β c (ε)/β, independently of the value of ε.Again, we have not been able to understand why this should be the case.
Finally, we have numerically established several interesting results concerning average and individual rewards, that would deserve further investigations.For example, the average reward seems to converge towards its asymptotic value as N −2/3 , exactly as for the SK model, although, as already noted above, this asymptotic value is ≈ 8% below the optimal SK value.Is is possible to characterize more precisely the ensemble of configurations reached after learning in the long memory limit α → 0? Can one, in particular, understand analytically the distribution of individual rewards shown in Fig. 4 and the corresponding asymptotic value of the average reward, as well as its non monotonic behaviour as a function of the noise parameter β?These are, in our opinion, quite interesting theoretical questions left for future investigation.

C. Extensions and Final Remarks
Many extensions of the very simple framework presented here can be imagined for the model to be more representative of real socio-economic systems.For example, by analogy with spin-glasses, going beyond the fully connected interactions and towards a more realistic network structure should not change the overall phenomenology although some subtle differences may show up.While analytical predictions become even more challenging, recent works on dynamical mean-field theories with finite connectivity, so far developed for Lotka-Volterra type systems, could perhaps be adapted to the learning dynamics of our model.
Allowing the interaction network to evolve with time would of course also make the model more realistic, as in [13]; in this case one would have to distinguish the case where the learning time is much longer or much shorter than the reshuffling time of the network.
Other interesting additions to Ising games could also include the introduction of self-excitation [119] or of alternative decision rules that might be less statistical mechanics-friendly [36].Extension to multinary decisions, beyond the binary case considered here, as well as higher-order interactions (i.e. a "p-spin game"), would obviously be interesting as well, specially as higher order interactions are known to change the phenomenology of the SK model (see e.g.[102]).In particular, we expect that in the p-spin case with p ≥ 3 a much larger gap would develop between the optimal average reward and the one reached by learning.
Finally, whereas temperature in physics is the same for all spins, there is no reason to believe that the amount of noise β or the memory span α should be the same for all agents.Introducing such heterogeneities might be worth exploring, as some agents with longer memory may fare systematically better than others, like in Minority Games, see [34].
Beyond the socio-economic context that was our initial motivation for its design, we believe that the simplicity and generality of our model makes it a suitable candidate to describe a much wider range of complex systems.In the context of biological neural networks, the parameter α indeed allows one to interpolate between simple discrete-time Hopfield network [88], and continuous-time models where Q i is an activation variable for the firing rates m i [120][121][122][123][124][125].Although in our case the influence of β introduces some perhaps unwanted stochasticity, these fluctuations can in principle be suppressed (at least partially) with sufficiently small α.The memory loss parameter could also represent an interesting way to tune the effective slowing down of the dynamics caused by symmetry and described in [105].Here, the description of real neural networks would likely require much more sparse interactions, but also perhaps the introduction of some dedicated dynamics for the interactions themselves, see e.g.[126] for recent ideas.
Closer to our original motivation, more applied socioeconomic problems might benefit from the introduction of this type of reinforcement learning.In macroeconomics for instance, some form of "habit formation" could perhaps be relevant to extend existing descriptions of input/output networks [13,64], where client/supplier relationships are probably strongly affected by history (on this point, see Kirman's classic study on Marseilles' fish market [127], see also [29]).Finally, while it is an aspect of our model we have not investigated here, previous works have reported that similar dynamics yield interesting volatility clusters and heavy tails, corresponding to sudden changes of quasi-fixed points.Such effects might be relevant to describe financial time series [14,51].
In the context of financial markets, our model could also be used to challenge the idea that efficiency is reached by evolution.Indeed, a theory that has been proposed to explain empirical observations going against the Efficient Market Hypothesis is the so-called "Adaptive Markets Hypothesis", stating that inefficiencies stem from the (transient) evolution of a market towards true efficiency [128,129].However, reinterpreting the Q i in our model as some form of fitness measure, it appears unlikely that simple evolutionary dynamics (akin to the learning dynamics) could overcome the type of radical complexity discussed here.As a matter of fact, such an evolutionary twist to the "SK-game" could also be used to conjecture that simple Darwinian dynamics is unlikely to lead to a global optimum in a complex biological setting [130].
Last but not least, we believe that the learning dynamics presented here may be useful from a purely algorithmic point of view in the study of spin-glasses and socalled TAP states.Indeed, in the β → ∞ limit, we have seen that our iteration relatively frequently finds fixed points in regions where their abundance is known to be sub-exponential (close to ε = 1 in particular), and this even for relatively large values of α.Interestingly, similar exponentially weighted moving averages have been employed in past numerical studies of TAP states for symmetric interactions [131,132], but on the magnetizations m i themselves and not on the local fields Q i like is the case above.
Using an offline version of our learning procedure could then be of use to effectively converge to fixed points of the TAP equations or Naive Mean-Field Equations, and study their properties.Perhaps even more interestingly, the online dynamics and the resulting fluctuations of the m i themselves could prove to be extremely valuable to probe hardly accessible regions of the solution space.In some sense, the fluctuations related to finite values of α and β could allow to define "meta-TAP" states, in the sense of closely related TAP states mutually accessible thanks to such extra fluctuations, in the same spirit as standard Langevin dynamics in an energy landscape.
Finally, as mentioned in Sec.III G, in the zero temperature limit and for ε = 1, it has recently been reported that for a certain range of self-interaction strengths J d , there appears an exponential number of accessible solutions to the TAP equations that are seemingly not reachable with standard Hopfield dynamics [76].Preliminary numerical experiments seem to suggest that our learning dynamics find such fixed points, as suggested by their effectiveness for ε = 1 without any form of self-interaction.Beyond existing interest around neural networks, such a self-reinforcement, "habit formation" term could also be, as stated above, interesting to study from the socioeconomic perspective [10,12].Now, it is immediately apparent that I L is a function of the product S1S2 = ±1 rather than S1 and S2.As a result, the complexity is given by Σ2(α, η) = saddle with Q * 1 = [Q1 Γ2] ⊺ and similarly for Q * 2 , and We now take the particular case α = 1.In this case, we expect two-cycles with a zero overlap in between consecutive steps, meaning i R = 0.As a result, the matrix C is the identity matrix, and thus Ψ2(Γ1, Γ2; C) = Φ (η( V21 + S V2 ) Φ (η( V12 + S V1) ) . (B48) In order for i R = 0 to satisfy Eq. (B41), we then require i K = 0 and V1 = V2 = 0 such that ∑ S=±1 Se i K Ψ2(Γ1, Γ2; C) = 0. Given recovering the known solution of [61,86].

1 FIG. 1 .
FIG. 1. Qualitative phase diagram in the (ε, α) plane for the SK-game in the noiseless (left) and weak noise (right) regimes.FP and (Q) refer to "fixed point" and "quasi" respectively (mi(t) = 0 ∀t)."Random" refers to the case where agents pick ±1 with equal probability, whereas "chaos" means that at a given instant of time players have well defined intentions mi but these evolve chaotically with time.

FIG. 4 .
FIG. 4. Distribution of individual rewards at β → ∞ for N = 512 and 128 initial conditions and realizations of the disorder in the fully symmetric case ε = 0 for α = {0.5, 0.1, 0.01} (dark to light coloring).The dashed line is the Sommers-Dupont analytical solution to the SK model [69].Inset: associated survival function in a lin-log scale and focusing on the right tail, dotted line corresponding to a Gaussian fit.

FIG. 7 .
FIG. 7. Overlap between solutions for different initial conditions and identical draws of interactions, for α = 0.01, β → ∞, ε = 0. (a) Distribution of overlaps shifted by the mean and rescaled with the system size N to the power 2/3.(b) Average overlap as a function of system size in log-log coordinates, with the best regression line N −0.85 .Error-bars show the 95% confidence interval over 16 different draws of the disorder.

2 FIG. 9 .
FIG. 9. (a) Annealed complexity of the Naive Mean-Field Equation in (T, ε) space, where we recall T = β −1 , measured numerically for N = 40.The dashed line represents the critical temperature Tc(ε) for which the paramagnetic fixed point ceases to be stable, while the continuous line indicates ε = 1 inferred from the β → ∞ result.(b) Annealed complexity as a function of ε for varying temperatures T < 1/ √ 2, i.e. in the bottom region of (a) where the complexity vanishes at ε = 1.The continuous line represents the β → ∞ analytical solution, recovering the result of Tanaka and Edwards [87] Σ ≈ 0.1992 for ε = 0.For ε > 1 and T = 0, the only possible fixed point is m ⋆ i = 0, ∀i.

3 , 11 .
Note however the strong finite N effects that show up in the latter figure, which we discuss in the next sections.B.Cycles with Memory (α < 1)

FIG. 10 .
FIG. 10.(a) L = 2 cycle complexity from the numerical resolution of the saddle point equations for ε = {0, 0.2, 0.4, 0.6} from dark purple to light green, dashed lines indicating the fixed point complexity associated to each parameter.(b) Order parameter at the L = 2 saddle, showing the nontrivial coalescence of the cycle and fixed point saddle point as α is decreased.The numerical resolution appears to breakdown when we get close to α = 0.

FIG. 11 .
FIG.11.Steady-state two-point correlation between configurations shifted by τ = 2 time-steps in the α = 1, β → ∞ limit from finite N numerical simulations averaged over 128 samples of disorder and initial conditions, error bars showing 95% confidence intervals.The N = {64, 128} simulations are run for t = 10 8 time-steps to illustrate taking the t → ∞ limit before N → ∞, whereas N = {256, 512, 1024} have been simulated for t = 5 × 10 6 time-steps to recover the N → ∞ before t → ∞ regime.The continuous line represents the N → ∞ DMFT solution integrated numerically, while the vertical dotted line corresponds to the critical value εc found by Hwang et al.[61].

FIG. 13 .
FIG. 13.Influence of finite size N , non-reciprocity ε and memory span α on the learning dynamics.(a) Steady-state two-point correlation function shifted by τ = 2/α in the β → ∞ limit for different memory loss rates α, from light green to black (color map on the right axis).Symbols correspond to direct simulations and plain lines to the solutions of the DMFT equations, while the dashed line is the solution to the DMFT equations for α → 0. The N = 128 simulations are initialized with t0 = 10 8 time-steps, whereas N = {256, 512} have been simulated for t0 = 10 6 iterations before taking measurements.Results are averaged over 32 samples, with error-bars showing 95% confidence intervals.(b) and (c) Sample trajectories for all N = 128 sites for ε = 1 and α = {1.0,0.1} respectively, clearly reaching chaotic and fixed-point steady states.

FIG. 14 .
FIG.14.Evolution of the time shifted autocorrelation function in the non-symmetric case ε = 1, β → ∞, for different system sizes and memory loss parameters averaged over 20 realizations.Left: lin-lin scale, errobars showing 95% confidence intervals, continuous lines representing the numerically integrated full DMFT equations (Eq.(36)).Right: lin-log scale and rescaling of the time shift by α such that points collapse onto a single curve (errobars not shown), black continuous line representing the analytical solution found by solving the Sompolinsky & Crisanti ODE Eq.(38), dashed line representing a pure exponential decay with characteristic time τ1.

1 αFIG. 15 .
FIG. 15.Evolution of the oscillation frequency ω0 with the memory loss rate α for β → ∞, ε > 2 − εc averaged over 96 samples of disorder and initial conditions, errorbars showing 95% confidence interval.(a) Log-log plot highlighting the near square root dependency ω0 ∼ √ α the dashed line correponding to an exponent 1/2.(b) Two-point autocorrelation for ε = 1.8, N = 256 as a function of the rescaled time lag for different values of α, confirming the suitability of the scaling in particular at short time scales.(c) Power spectrum of the autocorrelation for the same parameters, displaying the maximum at ω0 and secondary peaks at odd multiples of this fundamental frequency as expected from the triangular aspect of the autocorrelation.

FIG. 17 .
FIG. 17.(a) Order parameter q = C(t, t) averaged in time in the (quasi) stationary regime vs. rescaled temperature for ε = 0.1.Circular and diamond markers correspond direct, finite size simulations at N = 256 for α = 0.01 and α = 0.1 respectively, whereas crosses represent the (dynamical) numerical solution to the complete set of N → ∞ DMFT equations for α = 0.1.Continuous lines show the solution to the static DMFT fixed point equations (Eq.(42)-(44)) for ε = 0.1.(b) Spectral density of a small perturbation Q1 to the fixed point solution of the DMFT close to the critical temperature.As the quantity is necessarily positive for a valid solution, the grey region corresponds to instability.

FIG. 19 .
FIG. 19.Finite β trajectories obtained from numerical simulations at α = 0.1, N = 256, t0 = 10 8 averaged over 96 samples of disorder and initial conditions for (a) β = 4, (b) β = 2 and from top to bottom ε = {0.1,0.85, 1.05, 1.5}.Each line displays (from left to right) the evolution of 16 randomly selected agents, the associated histogram of mi over both agents, time and realizations and the autocorrelation function assumed to be time translation invariant on short time scales.Third row: insets representing the evolution of the normalized autocorrelation C(τ )/C(0) with a logarithmic vertical scale.