Quantum adaptive agents with efficient long-term memories

Central to the success of adaptive systems is their ability to interpret signals from their environment and respond accordingly -- they act as agents interacting with their surroundings. Such agents typically perform better when able to execute increasingly complex strategies. This comes with a cost: the more information the agent must recall from its past experiences, the more memory it will need. Here we investigate the power of agents capable of quantum information processing. We uncover the most general form a quantum agent need adopt to maximise memory compression advantages, and provide a systematic means of encoding their memory states. We show these encodings can exhibit extremely favourable scaling advantages relative to memory-minimal classical agents, particularly when information must be retained about events increasingly far into the past.


I. INTRODUCTION
The world is awash with complex, interacting systems. Predators chasing prey, investors trading stocks, grandmasters playing chess: all share in common that they process information from their environment and act in response, with an eye to achieving some desired outcome. They can be described as adaptive agents [1][2][3][4][5], systems that receive input stimuli and respond with output actions. This framework can be applied to a plethora of problems, including financial markets [6,7], biofilm formation [8], and HIV spread [9].
To be effective, an agent must typically adapt its future behaviour based on past experiences. A rudimentary chatbot, for example, would base its response purely on the last phrase it heard -often resulting in wildly out-ofcontext output. Meanwhile, a more sophisticated design would extract context from conversational history -in both what they have heard, and what they have said. Tracking this contextual data requires a memory, and a policy for deciding on what action to take based on the current stimulus and this memory. For agents performing elaborate tasks, effective strategies often require copious information about past data [10]; tools that ameliorate the amount of information agents must retain can thus provide a valuable competitive advantage.
To what extent can agents benefit from quantum technologies? Proof-of-principle quantum agents have demonstrated memory compression beyond classical bounds [11], yet do not make use of the full gamut of possible quantum effects. Here we identify the features of vastly improved quantum adaptive agents that use * physics@tjelliott.net † mgu@quantumcomplexity.org ‡ thompson.jayne2@gmail.com

FIG. 1. Agents and their quantum realisations. (a)
We consider agents that alternately receive input stimuli and perform output actions. To execute complex behaviour, an agent requires a memory to keep track of relevant information about past events (both stimuli and actions), and a strategy for deciding on future actions based on this information together with the current stimulus. (b) A quantum circuit implementing a quantum agent that encompasses all memory-minimal agents (see Theorem 1). At each timestep it interacts with an input stimulus encoded in |xt , and some blank tape. After the interaction, measurement of the output tape delivers the appropriate action yt. In general an agent must also dispose of additional redundant information, requiring junk tape that is discarded into the environment. This process can be repeated to execute the desired strategic behaviour ad-infinitum.
less memory -and provide a systematic procedure for their design -using insights from quantum stochastic modelling [12][13][14][15][16][17][18][19][20]. The resulting agents can display extreme scaling advantages over provably minimal classical counterparts [21]. We derive sufficient conditions under which such scaling advantages can occur, and illustrate this with a family of scenarios where the agent's decisions rely on events in the distant past. Complementing techniques for quantum agents to speed-up the learning of effective strategies [22][23][24], our work illustrates that they will also be able to execute them with lower memory overhead. Together, they represent key components of quantum-enhanced artificial intelligences.

II. FRAMEWORK
Agents and strategies. We describe adaptive agents as automatons that interact with their environment at discrete timesteps t ∈ Z. At each timestep the agent receives an input stimulus x t ∈ X and responds with output action y t ∈ Y, manifest by random variables X t and Y t respectively (throughout, upper case indicates random variables and lower case the corresponding variates). Taking t = 0 as the present, we denote the past sequences of stimuli and actions as ← − x := . . . x −2 x −1 and ← − y := . . . y −2 y −1 respectively. For shorthand we denote the pair z := (x, y), and similarly ← − z := ( ← − x , ← − y ) for the entire history. The agent's choice of action is governed by a strategy, describing the probability that the agent should select action y in response to stimuli x given preceding stimuli and actions ← − z [25]. Each strategy P is thus defined by the distribution P (Y | ← − Z , X); we assume strategies to be time-invariant [11,21].
To execute a desired strategy P, an agent must be able to execute actions in a manner statistically faithful to the distribution for any sequence of received stimuli. This necessitates that the agent possesses a memory system M that stores relevant information from the past. A bruteforce approach would be to record all past stimuli and actions, allowing a direct sampling from P (Y | ← − z , x). However, storing the entire history fast becomes prohibitively expensive.
A more refined approach is to use an encoding function f that maps possible histories { ← − z } to a corresponding memory state from the set {σ m }, labelled by m ∈ M. Given a history ← − z , upon receiving any of the possible stimuli x the agent must be able to use its memory to 1. Produce output y with probability P (Y | ← − z , x); and 2. Update the state of M to one consistent with the new history ← − z z (i.e, f ( ← − z z)).
This process is illustrated schematically in Fig. 1(a). This requires the agent to have a policy Λ -a systematic procedure that governs the internal dynamics of the agent. Repeated application of Λ then allows the agent to execute the strategy over multiple timesteps. Provided such a Λ exists for an encoding function f , then this can be used to specify an adaptive agent. That is, the tuple (X , Y, {σ m }, f, Λ) formally defines an adaptive agent; see Technical Appendix A for further details.
Since the encoding function is a deterministic mapping from histories to memory states, we are able to succintly describe the update of the memory according to an update rule m = λ(z, m), where σ m is the memory state corresponding to any given history ← − z , and σ m that of ← − z z. We can also replace the distribution P (Y | ← − Z , X) by P (Y |M, X), where the substitution of histories by memory state labels is done in accordance with the encoding function (i.e., f ( ← − z ) = σ m implies the substitution ← − z → m).
Memory costs. Different choices of f lead to different memory states, and consequently, agents with different memory requirements. Here we are concerned with memory-minimal agents -those that are able to extract and store the minimal amount of historical information possible whilst still being able to execute a given strategy for any future stimuli. Correspondingly, we take the amount of information stored in the agent's memory system M as our metric of performance: where S vN is the von Neumann entropy [26] (reducing to the Shannon entropy for classical memory states) of the memory state distribution, here assumed to be their steady-state distribution [11,21]. The second subscript R recognises that this distribution typically depends on how the stimuli the agent receives are selected -be they drawn from a stochastic process, or more generally, by another agent responding to the actions of the agent. The procedure for how the input stimuli are selected is referred to as the input strategy R, as is formally defined in Technical Appendix A. It is often useful to also consider a 'worst-case' information cost -the necessary amount of memory an agent must have available to be able to respond appropriately to any input strategy. Memory-minimal classical agents.
Using tools from complexity science [27][28][29], the provably memoryminimal classical adaptive agents can systematically be determined [21]. Consider that if the strategy dictates that two histories ← − z and ← − z should have statistically identical action responses for all possible future stimuli sequences, there should be no need to distinguish between them in the memory. Similarly, consider that if they do have different action responses, then they have to be represented by different memory states. This rationale, while seemingly simple, directly motivates an encoding function that can be shown to be memory-minimal in the design of classical adaptive agents.
This encoding function f ε is thus defined by: (2) The corresponding memory states {σ s }, labelled by s ∈ S -referred to as the causal states of the strategy [21,27] -are a partitioning of histories into equivalence classes based on their responses to future stimuli. The respective information cost Eq. (1) of this encoding function is given by C µ,R = − s∈S P (s) log 2 [P (s)], where P (s) = ← − z ∈s P ( ← − z ) for the given input strategy R.
The agent as a whole is called the ε-transducer of the strategy [21], and crucially, is classically memoryminimal for any non-pathological input strategy. In recognition of this, its memory requirements are seen as fundamental properties of the strategy; in particular, the worst-case information cost is designated as the structural complexity of the strategy [21]. These ideas have seen application in contexts such as agent-based learning [30,31] and energy-harvesting [32], and understanding quantum contextuality [33].

III. QUANTUM ADAPTIVE AGENTS
A quantum adaptive agent is able to store and process quantum information in its memory system M, such that the encoding function f maps histories into quantum states {ρ m }, and the policy Λ is a quantum channel. As per Eq. (1), the information cost of a quantum encoding function q is given by C q,R = −Tr(ρ log 2 [ρ]), where ρ = m P (m)ρ m . A specific design for a quantum agent has already demonstrated the potential for a quantum memory advantage over memory-minimal classical agents [11]. Yet, there is a great flexibility in how a quantum agent can be designed beyond these prior proof-ofprinciple constructions; we now proceed to explore how quantum agents can maximise their advantage.
A central result of this work (proven in Technical Appendix B) is the following set of constraints that a quantum agent can satisfy without penalty to their ability to achieve peak memory compression advantage: • The agent receives input stimuli {x} encoded in the computational basis states {|x }.
• The input stimulus is not consumed by the evolution of the agent; Λ preserves the input tape.
• The agent delivers output actions {y} via projective measurements in the computational basis states {|y } of its output tape.
• The memory states are pure and in one-to-one correspondence with the strategy's causal states S.
That is, generalising beyond these features cannot provide further memory advantage. With stimuli and actions encoded as classical states, all quantum dynamics occur within the agent's internal dynamics -the quantum memory advantage is not contingent upon access to a quantum environment. Further, these constraints imply a specific form of memory-minimal quantum agents. Theorem 1: A provably memory-minimal quantum agent executing any strategy P -for any input strategy R -can always be realised using the circuit of Fig. 1(b). That is, the policy Λ is realised in two stages. The first stage is a unitary operator U acting on the joint system of (i) the agent's memory M, (ii) input tape containing stimuli x encoded as |x , (iii) output tape initialised in |0 , and (iv) 'junk' tape also initialised in |0 . Then, the output action y is realised by a computational basis measurement of the output tape, and the junk tape is discarded. Moreover, the memory states {|σ s } are all pure and in one-to-one correspondence with the causal states S of the strategy; the encoding function satisfies Eq. (2). The unitary evolution can be expressed as U |σ s |x |0 |0 = y P (y|x, s)|σ λ(z,s) |x |y |ψ(z, s) , (3) where |ψ(z, s) represents the final state of the junk tape before it is discarded.
This implies that the only effective degrees of freedom in designing an agent's memory encoding lie in the choice of junk states {|ψ(z, s) }, as U and the memory states {|σ s } are then defined implicitly through Eq. (3). However, not every choice of junk states is physicallyrealisible, due to the constraint that U is unitary. Consider the overlap of two memory states s and s , given by c ss := σ s |σ s . Using the condition U † U = I and defining d z ss := ψ(z, s)|ψ(z, s ) , from Eq. (3) we obtain Though expressed for a given stimulus x, consistency requires that this equation yield identical {c ss } for all x. While this constraint can always trivially be satisfied by setting |ψ(z, s) = |s for all z, this enforces that quantum memory states are mutually orthogonal, recovering the classical ε-transducer and removing all quantum memory advantage. The crux of the quantum advantage is thus in finding junk states that admit non-orthogonal memory states, and optimising their assignment to maximise it. It is tempting to look for simple junk states that are just complex scalars, removing the need for junk tape altogether (as the corresponding phase can be absorbed by the output tape). However, this is generally impossible. Theorem 2: Junk states {|ψ(z, s) } cannot always be assigned as complex scalars. There exist strategies that can only be executed by quantum agents with access to a multi-dimensional junk tape that is discarded into the environment at each timestep.

(5)
The left-hand side of this equation represents the overlaps of the memory states, and so for consistency we must have that the right-hand side is equal for all possible stimuli x. To prove the theorem, we need only establish that there is at least one strategy for which no set of phases {ϕ zs } exists that can satisfy this condition. Consider the strategy illustrated in Fig. 2. For this strategy, Eq. (4) demands c AB = 0 as there is no overlap in future statistics for stimulus 1. Meanwhile, we must then have that d 0,0 AB = 0 -clearly this cannot be satisfied if |ψ(0, 0, A) and |ψ(0, 0, B) differ only by a phase factor. Thus, Theorem 2 is proven. This is not an isolated example. In Technical Appendix C we derive a sufficiency condition on the strategy that indicates Eq. (5) cannot be satisfied for any set of phases {ϕ zs }, and hence non-scalar junk is required. Informally, this condition holds when the strategy has two states which must give rise to very similar behaviour on one string of possible future stimuli, and very differently on another. The above example represents an extreme case of this. The requirement of non-trivial junk has operational significance, as it mandates that the agent discard information into the environment at each timestep, corresponding to a source of thermal dissipation. The next theorem suggests this dissipation manifests from the data processing inequality. Theorem 3: The magnitude of the overlap between any pair of quantum memory states cannot exceed the overlap of their future statistics for any input strategy R; Physically, this can be understood as requiring that the future statistics do not provide a means of distinguishing between quantum memory states beyond what is information-theoretically possible, imposing a constraint on their maximum fidelity [34]. In Technical Appendix D we show how this bound can be calculated. However, this bound cannot always be saturated; a counterexample is provided in Technical Appendix E.

IV. SYSTEMATIC QUANTUM AGENT DESIGN
We now provide a systematic method for assigning junk states such that the corresponding quantum agents achieve superior memory efficiency relative to memoryminimal classical [21] and prior quantum counterparts alike [11]. The design involves an effective representation of each of the memory states as a tensor-product form |σ s = x |σ x s , where the {|σ x s } behave as memory states specialised to each input (see Technical Appendix F). These have associated overlaps c x ss := σ x s |σ x s , such that c ss = x c x ss . In this representation we identify the junk states as |ψ(z, s) = x =x |σ x s , and correspondingly, their overlaps (for pairs with identical z) as d z ss = x =x c x ss . In Technical Appendix F we prove that for any strategy, a unitary of the form Eq. (3) can always be found that is based on these states. Given a strategy's ε-transducer, Algorithm 1 then provides a systematic means of designing quantum agents with this encoding.
defined ∀s, s ∈ S, x ∈ X and solve to obtain {c x ss }. 2: Use a reverse Gram-Schmidt procedure [16,35]   In this encoding, any given pair of memory states has non-zero overlap iff there is no string of input stimuli for which they are certain to produce distinguishable strings of output actions; provided at least one such pair exists, the quantum agent exhibits a memory advantage over provably minimal classical counterparts [11,12,26]. Note that despite their factorised representation presenting as an |S||X |-dimensional space, the reverse Gram-Schmidt procedure ensures that memory states can be supported by a memory system of at most |S| dimensions.

V. SCALING ADVANTAGE
The memory advantage of quantum agents can grow without bound. Consider a setting where an agent's optimal strategy depends on tracking some continuous parameter of its environment τ . This can occur when naturally continuous parameters are involved, such as spatial position or time. Alternatively, for strategies with a de-pendence on events long ago in the past, the set of pasts { ← − z } can be mapped to a continuous parameter over the interval [0, 1), by taking ← − z to specify a |Z|-ary fraction. In either case, small differences in τ often require only slightly different responses to future stimuli. However, if an agent must store τ precisely, it requires an unbounded amount of memory.
To circumvent this, the conventional classical method is to adopt coarse-graining, in which an approximation of the optimal strategy P is executed based on storing τ only to some finite precision. That is, τ is divided up into a set of discrete bins, and all values of τ within a given bin are mapped to the same memory state. An nbit precision coarse-graining divides τ into 2 n such bins, each of width δτ (n) ; the corresponding coarse-graining of the strategy is denoted P (n) . For a classical agent, the memory cost then diverges linearly with n [36], forcing a trade-off between precision and memory cost.
On the other hand, quantum agents may be able to avoid such divergences. Consider a family of quantum agents that implement coarse-grainings P (n) of a strategy P at each level of precision n ∈ N. Consider also the following pair of convergence conditions, defined formally in Technical Appendix G: • Distributional convergence: The steady-state probability (densities) of the memory states converge exponentially with increasing precision.
• Memory-overlap convergence: The overlaps of each pair of memory states converge exponentially with increasing precision.
These convergence conditions encapsulate the intuition that if the strategy varies smoothly with a continuous parameter, then so too may the properties of the memory states of a quantum agent executing the strategy. When these conditions are met, a quantum agent can execute the strategy P to arbitrary precision with bounded memory cost, giving rise to a scaling advantage over classical agents. The formal statement of this result is given in Theorem 4, which may be found in Technical Appendix G together with its proof. We illustrate an example of such scaling advantages occurring for agents tasked with executing certain strategies requiring co-ordinated stimuli-action responses over an increasingly greater number of timesteps. We demonstrate this with an example family of resettable stochastic clocks. In this setting, the agent is tasked with behaving as a clock with stochastic tick events, that may be reset by an external stimulus. This stimulus can take two values: x = 0 for 'evolve normally', and x = 1 for 'reset', while possible actions are y = 0 for 'no tick' and y = 1 for 'tick'. When x = 0, the agent behaves as a stochastic clock [37,38], modelled by a renewal process [39] where the agent emits a tick at stochastic intervals t governed by a distribution φ(t). Upon receiving x = 1 however, the agent must immediately reset its time-counter, such that the clock behaves as though it has just ticked. The agent must replicate this behaviour to some desired temporal resolution, such that time is broken into finite timesteps δt -as illustrated diagrammatically in Fig. 3(a), with further details in Technical Appendix H. For a given φ(t) this prescribes a family of coarse-grained strategies parameterised by δt.
In Technical Appendix H we show that our quantum agents satisfy the convergence conditions for a large class of φ(t) representing typical resettable stochastic clocks, and thus may execute them to arbitrary precision with a bounded cost. Meanwhile, the memory-minimal classical models must store an ever-increasing amount of information as δt is refined. That is, our quantum agents converge to a finite memory cost, while the classical agents diverge. Fig. 3(b) highlights this by comparing the scaling of our quantum agents (labelled C q∞ ) with the memory-minimal classical (C µ ) and best prior quantum counterparts (C q1 ) for the particular case where φ(t) is uniformly distributed over the interval [0, τ ], and resets are triggered at a constant rate 1/2τ [40].

VI. DISCUSSION
We have introduced a general framework for adaptive agents that can capitalise on access to a quantum memory to reduce the information they must track about past stimuli and actions. Key to this, we isolated the features of an agent that are relevant to memory advantages, and showed that they are in direct correspondence with the information it discards into its environment. Coupled with this, we provided a systematic algorithm for encoding the memory states of a quantum agent for any strategy, achieving a memory advantage relative to min-imal classical and prior state-of-the-art quantum counterparts. Moreover, this advantage can grow without bound. These advantages can be utilised by agents for both executing fixed strategies and in running candidate strategies during their development [41][42][43][44], as well as by researchers modelling the behaviour of agents. Our systematic quantum agent design may also be used for enhancing mechanical agents, for example, by endowing smart technologies with quantum processors. Our framework is agnostic to the specific engineering details of its implementation, and so can be realised with any quantum architecture that can receive (classical) input, and process and store quantum information according to the required policy evolution of Eq. (3). Proof-ofprinciple demonstrations are feasible with current setups, by, for example, adapting prior implementations of quantum models of passive stochastic processes in photonic setups [45,46] to undergo different evolutions at each timestep conditional on the input.
Our results use entropic benchmarks for the memory, thus naturally assuming an ensemble setting. They describe quantum memory advantages with operational relevance for multiple agents implementing a strategy in parallel with shared memory [15]. This aligns well with scenarios where one wishes to sample over the conditional distributions for various strategies, for example in Markov Chain Monte Carlo-type methods [47]. A compelling extension is to single-shot settings, where one may instead consider the max entropy -the dimension of the state-space inhabited by the memory states. Single-shot advantages have been found for quantum models of passive stochastic processes [18-20, 45, 48, 49], and for specific cases of input-output behaviour modelling repeated measurement of a quantum system [50]. Since our general treatment ultimately relates to what can affect memory state overlaps, many of our results will continue to hold in single-shot settings -in particular our form of the memory-minimal quantum agent -and thus can direct the search for systematic encodings based on other such benchmarks. Based on links established between quantum compression advantages and thermal efficiency in stochastic modelling [51,52], one may expect that our quantum agents are also able to execute their strategies with less thermal dissipation than classical counterparts.
A further enticing extension would be to the case where only near-faithful execution of the strategy is requiredthat is, some error is tolerated [49,53,54]. Our quantum agents bear a similarity to models of quantum walks with memory [55][56][57] and other instances of memory compression through quantum processing [58,59] such as quantum auto-encoders [60][61][62][63]. Moreover, our general form for quantum adaptive agents Eq. (3) produces superpositions of all possible future trajectories for the input [16], potentially allowing for interference experiments that probe the overlap in the distributions of different strategies [46], or different input sequences. One can also consider superpositions of input sequences, akin to algorithms in quantum-enhanced reinforcement learn-ing [23,24,64], where our agents may augment existing quantum speed-ups with extreme memory advantages.

TECHNICAL APPENDIX A: Framework (Extended)
Here we provide further details of the framework used to describe adaptive agents, containing additional material relevant to the remaining appendices. We begin by formally defining an adaptive agent, as introduced in the main text.
• X is the set of stimuli the agent can recognise; • Y is the set of actions the agent can perform; • {σ m } is the set of memory states the agent can store in its memory system M, labelled by an index m ∈ M; • f : ← − Z → {σ m } is the encoding function that determines the memory state to which the agent assigns each history ← − z ; scribing how the agent selects action y in response to stimulus x given its current memory state, and how the memory state is updated.
An encoding f is said to be a valid encoding of a strategy P if there exists a policy Λ by which the agent is able to execute actions in a manner statistically faithful to the strategy for every possible history and sequence of future stimuli. That is, f is valid iff An agent with such a policy and encoding function is then said to faithfully execute strategy P. Hereon, we consider such faithful agents. The physics of the memory states determines the physics of the agent; that is, a classical agent can only store classical states in its memory and use classical dynamics for its policy, while for a quantum agent M can support quantum states, and Λ takes the form of a quantum channel. A strategy P can be described as a conditional distribution P (Y | ← − Z , X). Mathematically, this corresponds to a stochastic input-output process [11,21,65,66], where the stimuli are the inputs, and the actions the outputs, and the process maps stimuli and past actions to future actions. Consequently, our results encompass as a special case quantum models of passive stochastic processes -stochastic processes that evolve autonomously without environmental input -by taking the input alphabet to consist only of a single symbol (i.e., the strategy does not condition on any observed stimuli).
There are certain conditions implicitly placed on these input-output processes due to the limits of what an agent can predict about the future. That is, an agent cannot leverage information about future events that cannot be deduced from what they have already seen. The two conditions are referred to as the agent being nonanticipatory and causal [21]. The former requires that the strategy for choosing the current action must not depend on future input stimuli whenever these future stimuli are generated independently of past actions, i.e., [11,21,67]. The latter requires that the memory of an agent can depend only on the past, and not the future -i.e., that f is a deterministic map from histories to memory states [18]. We also assume that the strategy is stationary (time-invariant), such that the weightings P (Y | ← − Z , X) are independent of the timestep t. Pasts and futures are taken to consist of semi-infinite strings of stimuli and actions. That is, at t = 0 we take ← − x := lim l→∞ x −l:0 and − → x := lim l→∞ x 0:l , where x k:l := x k , x k+1 , . . . , x l−1 denotes a contiguous string in the interval k ≤ t < l.
In the main text we note that the input stimuli are in full generality drawn from an input strategy, where the stimuli manifest as actions of the agent's environment, potentially conditioned on the previous actions of the agent. Definition 2: (Input strategies) An input strategy R is an input-output stochastic process specified by a conditional distribution R(X t | ← − Z t ) used to generate input stimuli of an adaptive agent. That is, it maps histories { ← − z } to the next stimulus received by the agent.
The subscripts indicate that the input strategy can have a temporal dependence (i.e., that it need not be stationary), while the conditioning on the entire history allows the stimuli to have a dependence on the actions of the agent. In the case where the stimuli are generated independent of the agent's actions, R reduces to a passive stochastic process. Note that in previous works the worst-case memory cost was considered only with respect to such input stochastic processes [11,21], rather than the more general input strategies described here.

B: Proof of Theorem 1
We begin with the most general form a quantum adaptive agent can take, progressively examining each aspect to ascertain whether it is essential to its function, and whether it offers potential compression advantages -in order to constrain to the most general functional agent.
In full generality, at each timestep, we have an evolution (i.e., a quantum channel) that acts on the current memory state ρ m and the input stimulus x, encoded into a state ρ x . These are mapped by the policy to an output action y, extractable from a state ρ Y (x, m) with probability P (y|x, m), and an updated memory state ρ m according to m = λ(x, y, m). For a complete accounting, we allow for the inclusion of a 'blank' ancilla |0 tape with the input, and a 'junk' state |ψ(x, y, m) with the output -both may without loss of generality be considered in their purified form [26]. Lemma 1: There is no further quantum advantage from allowing memory states to be non-pure. Moreover, there is no further advantage for the memory states to be anything other than in one-to-one correspondence with the causal states of the ε-transducer.
These results follow by generalising the so-called causal state correspondence [34] and mixed state exclusion [18] found for quantum models of passive stochastic processes to the case of strategies. These establish that the memory states of the minimal quantum agents are in one-toone correspondence with the causal states of the strategy, and can be instantiated as pure states. Our proofs of the generalisations largely follow those of the originals, with the modification to input-conditioned probability distributions. Proposition 1: (Causal state correspondence) For any strategy P with causal encoding function f ε , there exists a memory-minimal causal, non-anticipatory quantum agent implementing the strategy with memory encoding function f that satisfies for all past histories ← − z and ← − z .
We first prove the reverse direction through its contrapositive. Suppose we had two histories ← − z and ← − z belonging to different causal states, but mapped to the same memory state by f . The former condition implies . Since the two memory states are identical, there is no quantum operation that could distinguish between them, and hence no operation that could produce different future statistics from them -and thus no quantum agent can generate the correct conditional future statistics for both histories. Therefore, we require The forward direction follows from concavity of entropy [26]. Consider the set of histories { ← − z } belonging to causal state s. We define the contribution to the steady-state of the memory coming from histories not in this set as ρs = . From the concavity of entropy, it follows that Let ← − z * be the particular history that minimises this inequality. We thus have that for any valid quantum agent, an encoding which assigns all histories belonging to s to f ( ← − z * ) will have lower or equal entropy. Moreover, the modified encoding is also a valid encoding: as the future statistics the agent must produce from f ( ← − z ) for any other history ← − z ∈ s are the same as those that must be produced from f ( ← − z * ), an encoding with f ( ← − z ) = f ( ← − z * )∀ ← − z ∈ s will produce the correct future statistics. This procedure can be repeated for histories belonging to all other s = s, and we hence find that for any quantum agent there exists another quantum agent implementing the same strategy with lower or equal entropy using an encoding function that assigns all histories in the same causal state to the same memory state. Proposition 2: (Mixed state exclusion) For any quantum agent implementing a strategy P using a valid encoding with memory states {ρ m }, there exists a valid encoding of lower or equal entropy with pure memory states {|σ m }. We start by invoking the causal state correspondence, such that our goal is to show that for any valid quantum encoding for a strategy P with memory states {ρ s }, there exists a valid quantum encoding of lower or equal entropy with pure memory states {|σ s }. Suppose a particular memory state ρ s is non-pure, such that we can decompose it as ρ s = j p j |a j a j | for some set of pure states {|a j }. Recall that causality demands the memory contain no information about the future that cannot be determined from its past; given the past stimuli and actions, there must be no correlations between the memory states and the futures they produce. This means that each of the states {|a j } in our decomposition of ρ s must all individually give rise to the same statistical futures as ρ s , and thus a valid quantum encoding can be formed by replacing ρ s with any of the |a j . We again collect all contributions to the steady-state from terms belonging to causal states other than s as ρs, such that ρ = j p j (P (s)|a j a j |+ρs). From concavity of entropy: Let |a j be the particular state that minimises the inequality, and designate it as |σ s . We can thus obtain a valid quantum encoding of lower or equal entropy after replacing ρ s with |σ s . We can repeat the procedure for the memory states corresponding to other causal states, thus obtaining a valid encoding of lower or equal entropy where all memory states are pure. Note that the above lemma is not specific to the von Neumann entropy, and holds for any entropy satisfying concavity. Since Lemma 1 allows us to restrict our attention to pure memory states, and the Gram matrix representation [68] of an ensemble of pure quantum states allows us to express the entropy as a function of pairwise overlaps of the states, we can hereon consider features of the agent that can affect the overlap of memory states to be synonymous with those that can (potentially) reduce the memory cost. Lemma 2: There is no memory advantage to encoding the input stimulus x as anything other than the computa-tional basis state |x . Moreover, the input state need not be consumed by the evolution.
Consider that for each input state ρ x there is a computational basis state |x appended to it which remains unchanged by the evolution. Then, since it can be factored out it can be seen that it does not influence the overlaps of the memory states, and hence does not affect the amount of information stored. However, we can perform operations conditioned on the appended state, which, since they are orthogonal, allows us to imprint the ρ x directly onto part of the blank ancilla space and proceed as before. Specifically, we can realise this as a unitary operation U X |x |0 |0 , where the third subspace is discarded into the junk and ρ x is the resulting state of the second subspace after tracing out the other two. We see that it is sufficient to consider orthogonal input states {|x }, which can be used to mimic the effect of any set of input states -in effect, accounting for the pre-processing used to create ρ x from the input stimulus as part of the evolution. As the appended input space is not affected by the evolution, it can be later used to retrieve the input stimulus. Lemma 3: There is no memory advantage for the extraction of y from ρ Y (x, m) to be anything other than a projective measurement in the computational basis.
The output action must be extracted from ρ Y (x, m) through measurement. Neumark's dilation theorem allows us to express any quantum measurement as a projective measurement on a purified state in a larger space [69][70][71] -we can consider any model of the extraction that does not strictly use projective measurements to effectively be relegating this extended space into the junk. This dilation does not change the evolution of the memory state and hence there is no penalty to working with the projective measurement picture. As Lemma 2 allows us to take the input states |x to be orthogonal we can consider the output subspace to always be conditionally rotated at the end of the evolution such that the appropriate measurement basis is the computational basis, independent of the input stimulus.
With these lemmas, we can express the evolution at each timestep by a global unitary operator U [Eq. (3)]. The amplitudes follow from the requirement that outcome y must be obtained with probability P (y|x, s) [16,20], and without loss of generality can be taken to be real by offloading any phase factor into the junk subspace.

C: Sufficiency condition for necessity of junk
Here we provide a sufficient (but not necessary) condition on a strategy upon which no physically-realisable quantum agent can implement said strategy without use of discarded junk states.
For a given strategy, consider a pair of states s and s and strings of stimuli x 0:L and actions y 0:L for which λ(z 0:L , s) = λ(z 0:L , s ), where the output of the update function on a string of stimuli/actions is understood to be the sequential application of the update for each timestep (i.e., λ(z 0 z 1 , s) := λ(z 1 , λ(z 0 , s)). Let us for shorthand denote p := P (y 0:L |x 0:L , s) and p := P (y 0:L |x 0:L , s ). Iterating through Eq. (5), we obtain that this provides a contribution of magnitude √ pp to the overlap of the two states if there is no junk. The magnitude of the remaining terms (corresponding to other action strings) must then be bounded by (1 − p)(1 − p ). If p + p = 1 + α for some non-negative α, we then have that This can be verified by direct substitution into the condition √ pp − (1 − p)(1 − p ) > α/2 after rearrangement and squaring, and using that p + p > 1 implies We note that this condition need only be met for a single pair of stimuli strings on a single pair of states in order for the agent to require junk.

D: Bounding quantum memory state overlaps
Eq. (6) in the main text places an upper bound on the overlap between any pair of quantum memory states, based on the distinguishability of their future statistics. Here, we provide two methods by which this bound can be calculated: the first method is approximate, with a computational cost that grows quadratically with the number of causal states and linearly with the depth of the approximation; the second is exact, but bears an exponential scaling in cost.
Suppose we are told that the memory has been initialised in one of two memory states {|σ s , |σ s }, and we are asked to determine which one with a fixed number of input stimuli L. Obviously, if L = 0, we are unable to distinguish between the possible states. With L = 1, we wish to choose the stimulus x that minimises the fidelity of the next output action, i.e., argmin x y P (y|s, x)P (y|s , x). For L = 2, we are able to choose the second stimulus based on the action output in response to the first, and the first stimulus should be chosen bearing this in mind.
Denoting F (1) ss := min x y P (y|s, x)P (y|s , x), we see that the best strategy for choosing the first stimulus x is argmin x y P (y|s, x)P (y|s , x)F (1) λ(z,s)λ(z,s ) . An iterative strategy can be developed, leading to our first method: define F The second method makes use of the fact that for each pair of memory states there is an optimal choice of next input stimulus, conditional on the number of subsequent input stimuli we are able to make. Observing that in the above iterative procedure we should have F λ(z,s)λ(z,s ) , we can postulate the optimal stimulus for each pair, and solve the associated linear equations. Minimising this over all possible postulates for the optimal stimuli, we obtain the actual bound. However, there are |X | |S|(|S|−1)/2 possible assignments of stimuli, and hence the computational cost of this method scales exponentially with the number of causal states.
We can also consider a hybrid of the two methods, to obtain an improved estimate over the first: begin by carrying out the first method to some desired depth L, then using the corresponding arguments that minimise the expressions as the postulate, evaluate the recursion relations from the second method.

E: Counterexample to fidelity bound tightness
As noted in the main text, counterexamples to the tightness of the fidelity upper bound on memory state overlap Eq. (6) exist. Here we provide such a counterexample.
Consider an agent with three memory states {s a , s b , s c }, three actions {a, b, c} and two stimuli {0, 1}. The dynamic is Markovian, such that after action y the memory transitions to state s y . Let the corresponding strategy be defined by the following probabilities (illus- with the remaining unspecified probabilities all zero. Each of the states possess non-equal output responses to the stimuli, and so form the causal states of the strategy. From stimulus 0 we obtain the following upper bounds on memory state overlaps: while stimlus 1 yields the bounds If the fidelity bound is to be saturated, we must have For stimulus 1 the evolution must be of the form |σ a |1 |a |ψ(1, a, s c ) To attain the prescribed values of |c ab | and |c bc | we must have |ψ(1, a, s a ) = exp(iϕ 1 )|ψ(1, a, s b ) = exp(iϕ 2 )|ψ(1, a, s c ) -i.e., equal up to phase factors. However, the condition on |c ac | would then require which clearly cannot be satisfied. Thus, the fidelity bound cannot be tightly satisfied. Interestingly, this manifests only for non-trivial strategies; for passive stochastic processes it is always possible to construct a quantum model of the process with trivial (i.e., one-dimensional) junk states that saturates the fidelity bound [16,20]. and c ss = x c x ss . Consider input stimuli-specific unitaries {U x } that act in the following manner [16,20] on the corresponding input-specialised memory substates: where we have combined the first and second subspaces on the left-hand side together on the right; this implicitly defines the memory substates. We also define a selection operation U select that ensures that the correct memory state is acted on with the correct U x , conditioned on the input state. Specifically, we define this operation to permute the memory substates conditioned on stimulus x such that the xth memory substate is in the first position and exchange the remaining memory substates with the junk. We then act with U x conditioned on the input state. Defining U = ( x U x ⊗|x x|⊗I)U select , we obtain the total evolution consistent with Eq. (3). In this representation, the junk states are given by |ψ(z, s) = x =x |σ x s , i.e., the unused memory substates corresponding to other input stimuli. Using that U † U = I we obtain This can then be reduced to be purely in terms of the substate overlaps, recovering Eq. (7): x c x λ(z,s)λ(z,s ) . (20) As described in the algorithm, the overlaps can then be found by solving this set of multivariate polynomial equations. A solution always exists for any process that asymptotically synchronises (i.e., lim L→∞ H(S 0 |Z 0:L ) = 0): since a sufficiently long string of past stimuli-action pairs allows the causal state to be determined with certainty, by iterating through the recursion relations we obtain the solution c ss = lim L→∞ x0 y0 P (y 0 |x 0 , s)P (y 0 |x 0 , s ) × x1 y1 P (y 1 |x 1 , λ(z 0 , s))P (y 1 |x 1 , λ(z 0 , s )) ×. . .× x L y L P (y L |x L , λ(z 0:L , s))P (y L |x L , λ(z 0:L , s )) The final step is to use forward and reverse Gram-Schmidt procedures [16,35] to construct the memory states, junk states and evolution operator. Notably, while the factorised memory state representation is specified in terms of an |S||X |-dimensional space, because there are only |S| memory states the reverse Gram-Schmidt procedure ensures that the constructed memory states inhabit only an |S|-dimensional space. Similarly, because overlaps of junk states corresponding to different z are irrelevant to the construction, the seemingly |S|(|X | − 1)dimensional junk states are actually encodable into an |S|-dimensional space. The evolution operator U then acts on this |S| 2 |X ||Y|-dimensional joint memory-inputoutput-junk space.

G: Proof of scaling advantage
To rigorously evaluate the memory costs of agents implementing coarse-grained strategies, we must first introduce some formal definitions. We provide definitions implicitly in terms of a single continuous parameter; the corresponding definitions for the case of coarse-graining multiple continuous parameters straightforwardly follow by nested application of the single parameter definitions. We assume the continuous parameter to be of finite domain, and without loss of generality we can take this domain to be [0, 1). We also explicitly consider binary coarse-grainings; the definitions and results readily generalise to arbitrary d-ary coarse-grainings. Definition 3: (Binary coarse-graining) An n-bit precision coarse-graining of a continuous parameter τ divides τ into 2 n bins of equal width δτ (n) = 2 −n . An n-bit precision coarse-graining P (n) of a strategy P with respect to a continuous parameter τ groups together all values of τ within each bin into a single memory state.
A continuous parameter over the domain [0, 1) can be (asymptotically) represented as a binary fraction, i.e., τ = ∞ k=1 τ k 2 −k , where τ k ∈ {0, 1}. Correspondingly, an n-bit precision coarse-graining of τ , denoted by τ (n) , stores only the first n bits of this expansion, i.e., τ (n) = n k=1 τ k 2 −k . This also provides a convenient representation for indexing the discretised bin, where the same truncated binary expansion prescribes a unique integer τ (n) = n−1 k=0 τ k 2 k . Analogous to how the index of a causal state denotes both the label of a memory state and an equivalence class of pasts, we use τ (n) to denote both the label of the bin and the interval it spans, with the distinction clear in context. Thus, the notation τ ∈ τ (n) indicates τ ∈ [τ (n) , τ (n) + δτ (n) ). For n > n , we also use the notation τ (n) ∈ τ (n ) to indicate the set of all possible n-bit precision coarse-grainings of a τ ∈ τ (n ) .
In this manner we can construct a family of coarsegrainings of a strategy at each level of precision {P (n) }, where n ∈ N. It is implicitly assumed that such a family should converge upon the behaviour of the exact strategy in the infinite precision limit.
When we say that an agent executes a strategy with n-bit precision, we mean that it has a valid encoding of an n-bit precision coarse-graining of the strategy. Like these coarse-grainings, we can similarly define families of agents that implement families of coarsegrainings. We denote the n-bit precision coarse-grained (quantum) memory states -corresponding to the states stored by the agent executing the n-bit precision coarsegrained strategy -as |σ (n) τ (n) for all τ ∈ τ (n) . Correspondingly, we denote the overlaps of these states as c For notational convenience in the following definitions and proof we will use the notation P (n ) (τ (n) ) for n > n , which should be interpreted as P (n ) (τ (n ) ) for all τ (n) ∈ τ (n ) . That is, when the argument to a coarsegrained probability is of higher-precision than the distribution, then the argument should be further coarsegrained to match the precision of the probability. An analogous interpretation should be made for the coarsegrained memory states and their overlaps, i.e., for n > n , c With this preamble, the memory state convergence conditions can now be formally stated. Definition 4: (Distributional convergence) A family of coarse-grained strategies {P (n) } are said to exhibit distributional convergence if for all possible input strategies R there exists an n 0 and constant K such that for all n > n 0 the steady-states satisfy |P (n) (τ (n) )/δτ (n) − P (n−1) (τ (n) )/δτ (n−1) | < Kδτ (n) ∀τ (n) .
A weaker version of this definition can be formulated, where the distributional convergence can be only with respect to a particular input strategy. If only this weaker form is satisfied, then Theorem 4 can be restated in an input strategy-dependent manner. We also note that distributional convergence implies that P (n) (τ (n) ) ∼ δτ (n) .
. Armed with these definitions, we are now in a position to formally state and prove the result given in the main text regarding bounded memory costs for quantum agents executing coarse-grained strategies. Theorem 4: Consider a strategy P that has a valid en-coding using memory states labelled by a finite number of continuous parameters of finite domain and a finite set of discrete parameters. A quantum adaptive agent can execute a coarse-graining of the strategy to arbitrary precision with bounded memory cost if distributional and memory-overlap convergence are satisfied.
We first prove this for the case where the memory states are labelled by a single continuous parameter, after which we will extend to the general case. Lemma 4: Consider a strategy P that has a valid encoding using memory states labelled by a single continuous parameter of finite domain. A quantum adaptive agent can execute a coarse-graining of the strategy to arbitrary precision with bounded memory cost if distributional and memory-overlap convergence are satisfied.
Consider such a quantum encoding at n-bit precision, where n is sufficiently large that we above the n 0 required for the convergence conditions. The steadystate of the quantum agent's memory is given by The Gram matrix [68] of ρ (n) is given by , and has the same spectrum (and hence von Neumann entropy) as ρ (n) . We also define a dilated Gram matrixḠ (n) : such that the elements are given bȳ . From the properties of the tensor product, it follows that (the non-zero elements of) the spectra of G (n) and G (n) are identical, and thus they have the same von Neumann entropy.
The Schatten p-norms of a matrix A are defined A p := Tr(|A| p ) 1 p for p ∈ [1, ∞) [72]. They satisfy Hölder's inequality, whereby AB 1 ≤ A p B q for 1/p + 1/q = 1. Two special cases of relevance here are p = 1, also referred to as the trace norm, and p = 2, which is equivalent to the Frobenius norm A F := jk |A jk | 2 = A 2 . Noting that ∆ (n) has 2 n × 2 n elements, we have that ∆ (n) 2 ∼ 2 −n . Then, by applying Hölder's inequality with p = 2, A = ∆ (n) , and B the identity matrix over the space occupied by ∆ (n) , we have that ∆ (n) 1 ∼ 2 −n/2 . The Fannes-Audenaert inequality [72] relates the difference in von Neumann entropies of two operators with the trace norm of their difference. For two operators ρ A and ρ B of dimension d, it states that . Setting ρ A = G (n) and ρ B =Ḡ (n−1) , together with the above we arrive at Thus, beyond a sufficiently high precision, the increase in the quantum agent's memory cost for each extra degree of precision is exponentially-decreasing. Correspondingly, the memory cost will eventually converge when the precision is increased an arbitrary number of times, leading to a bounded memory cost at any level of precision. When we have an additional set of discrete parameters m ∈ M labelling the memory states, such that the pair (t, m) ∈ (τ, M) uniquely specifies the memory state, we effectively have a finite number of sectors for the memory state space, with each sector corresponding to a different m. The state convergence conditions readily generalise to this regime, by imposing the conditions on each sector individually. Then, by applying the above arguments in the proof of Lemma 4 to each sector, we see that the total contribution to the memory cost from the memory states in each sector is bounded. Since there are a finite number of sectors, the total memory cost is thus bounded.
When there are multiple continuous parameters, the conditions on convergence must apply to all such parameters. Beginning from a sufficiently fine discretisation of all continuous parameters, we can apply the arguments above to each continuous parameter in turn, to deduce that the memory cost remains bounded at arbitrary precision in all continuous parameters. This completes the proof of Theorem 4.
Finally, we remark that while we have assumed finite, discrete stimulus and action alphabets in the above, the definitions and proofs readily extend to the case where these also are continuous parameters of finite domain.

H: Details for resettable stochastic clocks
A renewal process [39] is described by a series of identical events, where the time interval between consecutive events is drawn randomly from a distribution φ(t); here we focus on the case where this is discretised into timesteps of size δt. A resettable renewal process can accept input stimuli that trigger a 'reset' of the system to its post-event state, in effect triggering a phantom event and restarting the timer to the next event. We can describe the input stimulus by a two symbol alphabet: 0 (continue) and 1 (reset). Similarly, the output action alphabet can be described by two symbols: 0 for nonevents and 1 for events. Φ(t) := ∞ t φ(t )dt (and discrete analogue thereof) represents the so-called survival probability of the process. Such resettable renewal processes correspond to the strategy of resettable stochastic clocks.
It is clear that since the agent will always behave the same on stimulus 1, the groupings of pasts into causal states depends only on their response to stimulus 0. This recovers the vanilla renewal process case, and we obtain the same causal states as in such settings [15,19,36,73]: outside of specific forms of φ(t) -that we shall ignore here, noting that the following analysis can straightforwardly be generalised to encompass them -the causal states s n of a renewal process describe the number of timesteps n since the last event (in our case, this also includes the phantom events from resets).
The steady-state distribution of the causal states can be readily calculated for any resettable renewal process where the input stimuli are themselves driven by an input renewal process that resets upon events from either process. Label the event distribution and survival probability of the input process as φ I (t) and Φ I (t) respectively, and similarly φ O (t) and Φ O (t) for the strategy renewal process. For a pure renewal process without resettability, the steady-state distribution is given by µΦ(t), where the normalisation µ : (replace integrals with sums for the discrete-time case) is called the mean-firing rate, and represents the average number of events per unit time/timestep [15,36,73].
Since both processes are reset by events on the strategy process, we can view the pure output action process without reference to the input as a renewal process in its own right, with an effective event distribution being a function of both stimulus and action event distributions. The effective survival probability is the product of the survival probabilities, as the pure output process will only survive up to a given time provided that neither the underlying renewal process or the input renewal process have fired. Thus, the steady-state probabilities will be proportional to Φ I (t)Φ O (t), and normalised by their sum/integral, which yields the effective mean firing rate. With these probabilities, the (input-dependent) minimal classical memory cost can be straightforwardly calculated.
To determine the corresponding memory measure for our quantum agent we must also calculate the memory state overlaps. Using Eq. (7), and noting that all causal states behave identically on input 1 we obtain From these iterative equations we obtain These overlaps saturate the fidelity bound Eq. (6). To-gether with the steady-state probabilities, we can calculate the input-dependent memory cost of our agent. We also compare with the prior proof of principle quantum agent [11]. To determine the overlaps of its memory states {|S s := x |S x s } we recast this agent in terms of our general form Eq. (3): U q1 |S s |x |0 |0 = y P (y|x, s)|S λ(z,s) |x |y x =x |S x s |λ(z, s) .
We note that this formulation in terms of unitary evolution differs from the original presentation, though yields identical memory states. Expressed in this way it is clear where the deficiency of this agent relative to ours liesin announcing the next causal state in the junk. The corresponding overlaps between the memory states for resettable stochastic clocks are given by (28) With these overlaps, we can calculate the memory requirement of the agent. It can be seen that the overlaps rely not only on overlap in the output action statistics, but also that the immediately subsequent causal state into which the system transitions is the same. In contrast, our agent relies only on the overlap of output action statistics (over arbitrarily long horizons), which due to asymptotic synchronisation to a causal state over sufficiently long pasts requires that the transition into the same causal state is mandated only for arbitrarily far into the future. It is for this reason that we label the quantum agents with subscripts 1 and ∞, and it becomes clear why our new agent drastically outperforms the prior agent for processes with long historical dependence.
Indeed, the example presented in the main text is not an isolated case of the scaling advantage for a particular resettable stochastic clock. Theorem 4 provides us with a sufficiency condition against which we can verify that typical resettable stochastic clocks with a smooth distribution φ(t) requires only a bounded amount of memory to execute when driven by a smooth renewal process. Corollary 1: Consider a resettable stochastic clock with distribution φ O (t) that is either of finite domain or takes the form of a Poisson process at long times. Suppose that it is driven by a renewal process with distribution φ I (t) that resets upon clock ticks. If Φ O (t) and Φ I (t) are infinitely differentiable, then a quantum agent encoded using Algorithm 1 can execute the strategy to arbitrary precision with only a bounded memory cost.
The steady-state probabilities are given by ). For sufficiently large n, it follows that µ IO ∼ δt (n) . It can then readily be verified that distributional convergence is satisfied.
Hence, the convergence conditions of Theorem 4 are satisfied, and it therefore follows that the memory cost remains bounded, irrespective of the precision.
More generally, when φ O (t) takes the form of a Poisson process at long times (say, for t > τ 0 ), we can partition the memory states in two, according to whether t is below or above τ 0 . This binary classification defines two sectors of memory states. For the former, we can apply the above arguments to show that in this sector the convergence conditions are satisfied. Meanwhile, we can apply known properties of the causal states of renewal processes [36] to see that all memory states in the latter sector belong to the same causal state, and hence Algorithm 1 maps them all to identical states. Thus, the convergence conditions are also satisfied in this sector, and hence Theorem 4 can again be applied.