Quantum speedup for active learning agents

Can quantum mechanics help us in building intelligent robots and agents? One of the defining characteristics of intelligent behavior is the capacity to learn from experience. However, a major bottleneck for agents to learn in any real-life situation is the size and complexity of the corresponding task environment. Owing to, e.g., a large space of possible strategies, learning is typically slow. Even for a moderate task environment, it may simply take too long to rationally respond to a given situation. If the environment is impatient, allowing only a certain time for a response, an agent may then be unable to cope with the situation and to learn at all. Here we show that quantum physics can help and provide a significant speed-up for active learning as a genuine problem of artificial intelligence. We introduce a large class of quantum learning agents for which we show a quadratic boost in their active learning efficiency over their classical analogues. This result will be particularly relevant for applications involving complex task environments.


I. INTRODUCTION
The discovery that the laws of quantum physics can be employed for new and dramatically enhanced ways of information processing has revolutionized research in physics and theoretical computer science [1][2][3][4][5][6], and established quantum information science as a new interdisciplinary field of research.
However, advances employing quantum mechanics in the field of artificial intelligence (AI) have insofar been modest. The theory of quantum information has been only employed, with varying degrees of success, to specific algorithmic AI tasks such as data clustering, pattern matching, binary classification, and similar [7][8][9][10][11][12], while the problem of general artificial intelligence, that is, the problem of designing autonomous intelligent agents, enhanced through the laws of quantum mechanics has, up to now, not been addressed.
Partial reason for this is that the existing results in the field of quantum computing cannot be straightforwardly exploited in the latter tasks because a computer is not the same as an agent. The framework of embodied cognitive sciences [13][14][15][16][17][18] promotes a behaviorbased approach to intelligence which puts a strong emphasis on the physical aspects of an agent. It regards an agent as a physically embodied entity with respect to both its internal constitution and its interactions with its environment, and studies the role and implications of this embodiment for learning and intelligence [15].
In this paper, we engage the embodied agent perspective and treat the observation that an agent is an embodied entity, that is ultimately to be described by the laws of physics, as fundamental. Physics does not * The first two authors have contributed equally to this work.
only provide constraints to any agent architecture, it also may open entirely new avenues and tell us what is possible in principle, in a positive sense, in particular when we take the laws of quantum mechanics into account.
We introduce a class of quantum agents all of which are active learning agents: they operate in an unknown environment which they can actively explore, and they have access to a certain type of quantum memory that helps them to process previous experience. The quantum memory we are describing will involve quantum Markov chains over graphs that encode percept-actionreward correlations in their previous experience. These quantum models can be seen as quantum analogues of certain classical learning models based on classical Markov chains. A particular instance of the latter is the model of projective simulation (PS) introduced in [19]. While we discuss most of our work in the framework of the PS model, the results will be more general.
The notion of active learning is well-motivated both from a biological [20,21] and a contemporary educational perspective [22], where animal or human agents actively explore their environment. In the case of static environments, active agents can, e.g. control the frequency with which they intervene with the environment and thus expose themselves to new perceptual input. Conversely, active agents should also be able to handle dynamic environments which change, and can impose variable interaction tempos 1 .
Modern examples of active scenarios are the so-called real-time computer games, as opposed to the static FIG. 1. An agent is always situated in an environment. It is equipped with sensors, through which it receives percepts from the environment, and with actuators, through which it can act on the environment. Based on perceptual input, the agent will after some internal processing engage its actuators and output an action. Adapted and modified from [23].
turn-based games. As a main result of our paper we show that in such an active learning context, the learning efficiency of a large class of agents, whose program is based on Markov chains, can be quadratically enhanced by our quantization procedure.

II. LEARNING AGENTS AND QUANTUM PHYSICS
Embodied agents are best thought of as robots or biological systems, including animals and humans, but they also include so-called internet robots [13,14,16,17]. They are entities situated in an environment with which they communicate. Viewed externally, an agent can receive a sensory input (percept) from the environment and, based on this, produce an action, see Fig. 1.
However, intelligent behavior demands more: the agent must learn from its experiences. A learning agent also receives feedback from the environment in form of rewards (which can be seen as special types of percepts), given when the produced action, for a given percept, was right 2 . The agent commits its experiences to its memory and uses the acquired knowledge in the subsequent steps of interactions. This is formalized in the framework of reinforcement learning agents [23,24].
A reinforcement learning agent is, at every instance in some internal state which comprises its memory and reflects the agent's experience, that is, elapsed sequences of percepts, actions and rewards. Each sequence of percept-action-reward constitutes an external time-step (or cycle) of the activity of an agent. Formally, we can distinguish two characterizing elements of such agents: the decision function D which defines which action will be output given the current internal state and the received percept. This function defines the policy of the agent (as given in e.g. [24]) at each time-step. The learning strategy of the agent is characterized by an update function U which defines how the internal state changes, based on the percept-action pair, and the ensued reward. A successful agent will, over time, increase its reward frequency.
The agent's learning process is reminiscent to computational oracle query models, in which an unknown oracle (environment), is queried (via an action) by the agent, in an iterative quest for the best responses. It is tantalizing to consider employing the powerful quantum searching machinery [3,25,26], which has been proven to outperform classical algorithms in computational settings, in an attempt to improve the agent. Here, one may think of simply applying quantum analogs of the internal decision and update functions, and sequentially perform quantum searches for the optimal action, and thereby achieve better than classical performance.
However, contrary to computer algorithms, an embodied agent, such as a robot, operates in a physical environment which is, for most existing applications, classical 3 . This prohibits querying in quantum superposition, a central ingredient to all quantum search algorithms. Thus, naïve approaches to quantizing learning agents are doomed to fail 4 .
Nonetheless, while the physical nature of the agent and the environment prohibits speed-up through quantum queries of the environment, the physical processes within the agent, which lead to the performed actions, can be significantly improved by employing full quantum mechanics 5 . In particular, the time the agent requires for these processes (the internal time) can be sped up.
In active learning settings, this alone will constitute an overall improvement of performance, for instance when the environment imposes its own required interaction pace, or where the environment simply changes on 3 Learning agents that operate in a quantum environment are conceivable and not without interest. They may have interesting applications in future quantum physics laboratories, but we will not consider such a scenario in this paper. 4 Even if we were to allow superpositions of actions, the amount of control the agent must have over the degrees of freedom of the environment, in order to apply quantum query algorithms, may be prohibitive. This constitutes one of the fundamental operative distinctions between quantum algorithms, where full control is assumed, and quantum agents, where such control is limited. 5 In embodied agents, these physical processes will realize some internal model of the environment, which the agent itself has to develop as it interacts with the environment. For example, in the context of artificial neural networks such internal models are known as self-organizing maps and, more specifically, sensorimotor maps [27,28].
time-scales not overwhelmingly larger than the agent's internal 'thinking' time. In the latter case the quantum agents may even develop behavioral patterns different from their their classical counterparts, which we will comment on later.
For formal definitions of reinforcement learning agents, we refer the reader to the Appendix, section VI A and proceed by further formalizing the behavioral characteristics of agents.

A. Behavioral equivalence and active time performance comparison
In the algorithmic tradition of machine learning, the learning pace is measured by external time (steps) alone, and the typical figure of merit is the percentage of awarded actions of the agent, as the function of external time. From an embodied agent perspective, this corresponds to a special passive setting where a static environment always waits for the responses of the agent. This constraint imposes a restriction on the universality of statements which can at all be made about the performance of an agent. In particular, in that setting it is well-known that no two agents can be meaningfully compared without reference to a specific (or a class of) learning tasks -a collection of results dubbed 'no free lunch theorems', and 'almost no free lunch theorems' [29,30] 6 . These results prove that when one agent outperforms another in a certain environment, there exists a different environment where the ranking according to performance is reversed. This, for instance, implies that every choice of environment settings, for which results of agent performance are presented, must be first well justified. More critically for our agenda, in which we wish to make no assumptions on the environment, passive settings would imply no comparative statements relating the performances of agents are possible.
Here, we consider active scenarios where internal time does matter, but nonetheless the passive setting plays a part. It provides a baseline for defining a passive behavioral equivalence of agents, which will be instrumental in our analysis of active scenarios.
Let us denote the elapsed sequence of triplets (percept, action, reward) which had occurred up to time-step k (the history of the agent) with H k , for two agents A and A who can perceive and produce the same sets of percepts (S) and actions (A), respectively. Then we will say that A and A are passively ( −)equal if at every external time step k the probabilities P A and P A of agents A and A , respectively, outputting some action a ∈ A, given every percept s ∈ S, and given all possible identical histories H k are ( −)equal, in terms of the variational distance on distributions: which we abbreviate with The relation above induces passive behavioral equivalence classes for fixed sets of possible percepts and actions. In passive settings, by definition, two passively equal agents perform equally well, and comparison of agents within a class is pointless. However, in the active scenario, with the classes in place, we can meaningfully compare agents within the same class, with no assumptions on the environment 7 . Indeed, in an active learning setting, two passively equal agents A and A may have vastly different success chances. To see this, suppose that the environment changes its policies on a timescale that is long compared to the internal timescale of agent A, but short relative to the internally slower agent A . The best policy of both agents is to query the environment as frequently as possible, in order to learn the best possible actions. However, from the perspective of the slow agent, the environment will look fully inconsistent -once rewarded actions are no longer the right choice, as that agent simply did not have the time to learn. In active scenarios, internal speed of the agent is vital, which is a property we shall exploit further.

B. How fast is my embodied agent
As we have mentioned earlier, the behavior of an agent, once the space of percepts it can perceive, and actions it can perform is fixed, is characterized by its decision function D and the update function U. It will be beneficial to see this in the light of behavioral equivalence classes we introduced above. In particular, it is straightforward to see that if two agents realize identical update functions, and similar decision functions, then they will be behaviorally similar as well 8 . In fact, this characterization will give us a simple means of establishing behavioral equivalence between classical agents, and their quantum counterparts. To establish 7 That is, no further assumptions beyond the trivial ones -that the percept and action sets are compatible with the environment and that the environment provides a rewarding scheme. 8 Converse does not hold however. In principle, significantly different agents, with differing sets of internal states, and differing update and decision functions can nonetheless be behaviorally passively equivalent.
that, given two agents in the same equivalence class, one is actively faster than another, one must show that the internal times of evaluating the decision and the update function favor one of the two. Given that two agents may be implemented in significantly different physical systems 9 , properly gauging speed may be difficult 10 . The standard approach in addressing this issue is by defining primitive processes, like 'gates' or 'Turing machine steps' 11 , and as we describe next, in our case they will effectively be evolutions of Markov chains.
The AI model for the agents we consider in the following is the so-called Projective Simulation (PS) model [19]. Here, to percepts and actions, the agent assigns internal representations, i.e. memories, of percepts and actions, and it can also represent sequences of such memories. We will refer to these objects as clips [19], and we envision the implementations of clips as excitations of physical systems. The elementary internal processes of the agent, which implement the transitions from clip to clip internally, are discrete-time stochastic diffusion processes. In particular, the internal states of the agents are represented by weighted directed graphs (digraphs) over the clips and the decision making involves diffusion, as dictated by the digraphs, a sufficient number of times. Mathematically, these diffusion processes are described by Markov chains (MC), the transition probabilities by the transition matrix of the MC, and are standardly formulated in terms of random walks over weighted directed graphs. We will inherit the notion of quantizing these diffusion processes (and hence the agents themselves) from the theory of quantum random walks on directed graphs [25,26,34]. Consequently, in our setting, the primitive processes will be represented by the steps of classical and quantum random walks, for the classical and quantum agents, respectively. In the following section we expose the PS model for the readers benefit, and then give the main results of this paper. For more details on the theory of discrete-time classical and quantum walks see VI B. 9 Which also implies the natural laws which govern the dynamics so such systems which the agent uses to realize the decision and update functions. 10 In fact, for some pairs of implementations, like the biological brain versus a classical computer running an algorithm, even showing behavioral equivalence may be difficult. 11 In general, the internal process of the agent may be inherently time-continuous. An example where similar problems were successfully mitigated is that of adiabatic quantum computation [32,33].

III. QUANTIZATION OF LEARNING AGENTS BASED ON PROJECTIVE SIMULATION
A. The PS agent model The PS model is based on a specific memory system, which is called episodic and compositional memory (ECM). This memory provides the platform for simulating future action before real action is taken. The ECM can be described as a stochastic network of the previously mentioned clips, which constitute the elementary excitations of episodic memory. In the following, we formalize the PS model in terms which are slightly more general than given in [19]; this will simplify the construction of the desired quantum PS agents later. For the original construction, and the explicit description of the generalization, we refer the reader to the Appendix, section VI C .
In formal terms the general PS model comprises the spaces of percepts S = {s 1 , s 2 , . . .} and actions A = {a 1 , a 2 , . . .}, which constitute the space of environmental stimuli the agent can perceive and the space of actions it can effect, respectively. The basic actions and percepts, along with sequences thereof, are represented within an agent as clips c, and the set of these comprises the clip space C = {c = (c (1) , c (2) , . . .)|c (k) ∈ S ∪ A}.
In this work, we focus on PS agents in which clips are unit length sequences, representing a memorized percept or an action, which we denote using the same symbols, so C = S ∪ A 12 . Each agent is equipped with input and output couplers, which translate, through sensors and actuators (see Fig. 1), real percepts to the internal representations of percepts, and internal representations of actions to real actions, respectively. The internal states of the agent, i.e. the total memory, comprise weighted graphs G s over a subset of the clip space, which are assigned to each percept s. This graph dictates the hopping probabilities from one clip to another, and is a Markov process, i.e., a MC. The hopping probabilities themselves are encoded in the socalled h−matrix, defining un-normalized relative hopping probabilities. This matrix encodes G s for every s. Each agent can also perceive rewards Λ = {0, 1} and, based on the received percept s, realized action a, and the resulting reward, the h−matrix is updated via simple rules, which in turn, incurs a change in the transition probabilities. For a more detailed description of the PS model, and an example of previously studied update rules see the Appendix, section VI C.
In the process of deliberation (which realizes the agent's decision function D), the relevant Markov chain is diffused a particular number of times (depending on the particular PS model) until an action is output. The choice of the action -and thereby the policy of the agent -is dictated by the probability distribution over the actions, which is realized by the diffusion process. The latter depends on the agent's experience manifest in the matrix h. This distribution represents the agent's state of belief on what is the right action in the given situation. Conceptually, projective simulation thus consists of a random walk through episodic memory -a stochastic network of clips -which serves the agent to re-invoke past experience and to compose fictitious experience before real action is taken. Learning is then achieved by updating and reconfiguring episodic memory under environmental feedback.
Aside from the basic structural and diffusion rules, the PS model allows for additional structures, some of which we briefly address here. 1) Emoticons -a form of short-term memory, i.e., flags which notify the agent whether the currently found action, given a percept, was previously rewarded or not. For our purposes, we shall use only a rudimentary mode of flags, which designate that the particular action (given a particular percept) was not already unsuccessfully tried before. If it was, the agent can 'reflect on its decision' and reevaluate its strategies, by re-starting the diffusion process. We make this more formal presently. 2) Edge and clip glow -mechanisms which allow for establishing additional temporal correlations. 3) Clip compositionthe PS model based on episodic memory allows for the creation of new clips under certain variational and compositional principles. These enable the agent to develop new behavioral patterns under certain conditions, and allow for a dynamic reconfiguration of the agent itself. For more information on these additional structures, we refer the reader to [19,35]. In this work we will focus on the bare PS model with the most rudimentary emoticon structure, but note that most of the additional structures are compatible with the quantization procedure we explain next.
Next, we specify a particular broad class of PS agents, called reflecting agents, for which we give a generic quantization procedure, which yields a quadratic speed-up.

B. Reflecting PS agents
As we have previously mentioned, the diffusion process within the agent is the constitutive part of the projection (simulation of future action) before real action is taken. From an operational perspective, the same process realizes some distribution over the action space, from which the actual action is sampled. This distribution represents the agent's state of belief about the optimal response, given its accumulated knowledge.
Reflecting PS agents draw their name from the reflection process [19] in which the diffusion processes are repeated multiple times -in this model the agents approximate the complete mixing of their diffusion processes, simulating infinite deliberation times. While this may suggest that reflecting agents are comparatively slow (in terms of internal time), we will show in section IV that they are often on par with 'more eager' models. Concretely, the reflecting PS agent is a PS agent, where the internal states (the memory) are, for each percept, irreducible, regular MC's over subsets of the clip space, such that all actions are included in the subsets. The additional structure we will consider consists of percept-specific flags (corresponding to rudimentary emoticons in [19]) which are just subsets of actions assigned to each percept, formally F = {f (s)|f (s) ⊆ A, s ∈ S}. These constitute the agent's short-term memory.
For example, one can consider a possible mechanism in which for each percept, all actions are initially flagged. If the agent outputs some action a, given a percept s, and this action is not rewarded, a is removed from f (s). The meaning of flags may, however, in principle be more general.
Recall, the policy of the agent is defined by the decision function D. For ideal reflecting agents, given a percept s, and the relevant part of the internal state, that is, the corresponding Markov chain P s , D outputs an action a distributed according toπ s given as follows: let π s be the stationary distribution of P s , and let f (s) be the subset of flagged actions, theñ that is the re-normalized stationary distribution π s modified to have support only over flagged actions. We will often refer toπ as the tailed distribution. The reflecting agents also have an update function, which defines the agent's learning strategies by updating the internal states (including the flag assignments), but the reflecting agent model imposes no restrictions on the update functions 13 . The abstractly given decision function above has a natural interpretation within the conceptual framework of PS: the process of projective simulation represents a stochastic replay of episodic memory, in particular of percept-to-action sequences, the transition probabilities of which are dictated by previous experiences. The (ideal) reflecting agent is defined as the limit in which the agent allows this process to run forever, until an equilibrium distribution over the clips is found. This distribution may have support over the entire clip space and, since only actions can be output, it must be truncated to the action space alone. In our case, due to the additional structure of flags, we truncate it further to only actions which are not known to have been incorrect in the recent past. For a full formal specification of the ideal reflecting agent we refer the reader to the Appendix, section VI D.
Any embodied agent will, however, have to stop the deliberation in some finite time, which constitutes the effective internal time of the agent, vital in active scenarios. As we have explained, the elementary processes of the agents we consider are diffusion processes governed by the MC's. The classical reflecting agent will then be defined by a process realizing the decision function which we give next. In the following we will be using standard results from the theory of classical and quantum random walks [25,26,34], and refer the reader to the Appendix, section VI B, for more details. Let, for a given percept s, δ s be the spectral gap of MC specified by its transition matrix P s (defined as δ s = 1−|λ 2 | where λ 2 is the second largest eigenvalue of P s in absolute value) and let s ≥ i∈f (s) π(i) be the estimate of the upper bound of the probability of sampling a flagged action from the stationary distribution of P s 14 . The agent's decision-making process (implementing D), given percept s, is given by the following steps. Let for some fixed distribution π 0 .
2. For t 2 time steps do: (a) Check: Sample y from π. If y is a flagged action, break and output y.
(b) Diffuse: re-mix the Markov chain by π = P t1 s y.
3. If no flagged action is encountered by the diffusion processes above, output an action uniformly at random.
14 In the following we will use the label Ps of the transition matrix as a synonym for the MC itself. 15 In this paper we do not consider logarithmically contributing terms in the complexity analysis, thus we use theÕ-level analysis of the limiting behavior (instead of the standard 'big O' O).
We note that some of the logarithmically contributing terms which appear, can be further avoided using more complicated constructions as has, for instance, been done in [26].
As we can see, the internal time of the classical agent, i.e. the number of primitive processes (diffusion steps), is fully dictated by the quantities δ s and s .
At this point, the theory of quantum walks provides us with analogs of discrete-time diffusion processes, using which we can describe the quantum analog of the agent above. The process we will describe here closely follows the quantum walk approaches to the algorithmic problems of element presence detection, and marked item search, as given in [26] but to a fundamentally different end. Standardly, quantum search algorithms have been employed in computer science to reduce the number of oracular function evaluations with the task of efficiently outputting an item satisfying some logical predicate [34,36]. In contrast, the goal of the protocol which the agent follows is more general. The task here is to output a (fully known) flagged action, but according to a specific probability distribution: a good approximation of the aforementioned tailed distribution, see Eq. (3). We note that using quantum walks for the purpose of sampling from desired distributions has been studied previously, predominantly in the context of quantum counterparts of Markov chain Monte Carlo methods [37][38][39][40]. There the quantum walks were mostly used for the important purpose of sampling from Bolzmann-Gibbs distributions of (classical and quantum) Hamiltonians. Aside from the construction we provide next, concrete applications of quantum walks specified by non-symmetric Markov chains have, to our knowledge, only been studied in [41,42], by two of the authors and other collaborators.
In the following, we will use the standard quantum discrete time diffusion operators U Ps and V Ps which act on two quantum registers, sufficiently large to store the labels of the nodes of the Markov chain P s . The construction we present follows the one given in [26]. The diffusion operators are defined as U Ps |i |0 = |i |p i and Here, P * s is the time reversed MC defined by π i P ji = π j P * ij , where π = (π i ) i is the stationary distribution of P s . Using four applications of the diffusion operators above, it has been shown that one can construct the standard quantum walk operator W (P s ), which is a composition of two reflections in the mentioned two-register state space. Let Π 1 be a projection operator on the space Span{|i |p i } i and Π 2 be the projector on Span{ p * j |j } j . Then W (P s ) = (2Π 2 − I)(2Π 1 − I).
Using the quantum walk operator and the wellknown phase detection algorithm the agent can realize the R(P s )(q, k) subroutine which approximates the reflection operator 2 |π s π s | − 1, where |π s = i π s (i) |i is the coherent encoding of the stationary distribution π s . The parameters q and k control the fidelity (and the time requirement) of this process -how well the reflection is approximated as a function of the number of applications of the quantum walk operator.
The fidelity can be shown to approach unity exponentially quickly in k if q ∈ O(1/ √ δ s ) [26]. For more details on the basics of classical and quantum walks we refer the reader to the Appendix VI B where we also present an explicit treatment of the approximate reflection operator. Now, we can define the quantum reflecting agent's deliberation process. Let where we adhere to the notation used for the classical reflecting agent, and ∈ R denotes that the relevant choice is performed uniformly at random.
Then, the steps of the quantum agent are as follows: 1. Initialize: Prepare 3. Measure the first register, and if it is a flagged action, output it, else output a random action.
For the first step above we assume that the agent can prepare the initial state |π init which requires just one application of the diffusion operator U Ps , provided the state |π s = i π s (i) |i is available. In this framework, like in the standard frameworks of algorithms based on quantum walks with non-symmetric Markov chains, we assume this is the case. Later we will provide an example of reflecting classical and quantum agents where this is easily achieved. We also assume that the agent can perform the phase flip operator V f (s) defined with V f (s) |i = |i if i ∈ f (s), and − |i otherwise, which is easy given that the agent has full access to f (s). Finally, we mention that the quantum agent may repeat the entire procedure a logarithmic number of times, but this overhead will not significantly contribute to our analysis 16 . We elaborate further on this point in section VI E of the Appendix.
We can immediately see that the quantum agent is, by construction, quadratically faster than the classical counterpart, in terms of the figures δ s and s , which alone determine the number of applications of the agents' respective elementary processes. What constitutes our main result is the fact that both the presented classical and the quantum reflective agent (defined over the same percept and action sets) belong to the same passive behavioral class within distance α, where the class is defined by the the ideal reflecting agent (see paragraph before Eq. (3)). To state this more formally, if A ideal , A c and A q denote the ideal, the classical and the quantum reflecting agents, respectively, with the same sets of percepts and actions, then where A is any agent with the same set of percepts and actions. Crucially, we show that α can be made, in both the classical and the quantum case, exponentially small within the same limiting behaviors given with O(1/δ s ) andÕ(1/ s ), for the classical, andÕ(1/ √ δ s ) andÕ(1/ √ s ), for the quantum agent. That is, the two proposed agents are passively approximately equivalent, and the quantum agent is quadratically faster than the classical agent. We summarize this result in the following theorem: Theorem 1. (main) The proposed classical and quantum agents are α−passively equivalent. Moreover, they both approximate the passive behavior of the ideal reflecting agent within α. The parameter α can be made arbitrarily small with at most a logarithmic overhead in the number of the required elementary processes for both agents. As a consequence of the given constructions of the agents, the quantum reflecting agents are generically quadratically faster than their classical counterparts.
For a proof of this theorem, we refer the reader to the Appendix, section VI E, but note that the proof builds on the known results for speed-ups in quantum search algorithms proposed in [3,25,26]. Thus, what we have presented is a method for generic quantization of reflecting PS agents, which maintains the behavior of the agents, and provably yields a quadratic speed-up in the active setting.
In the following section we compare a well-studied class of PS agents (which we refer to as standard PS agents) with the proposed 'quantizable' reflecting PS agents. We show that the classical reflecting PS agent performs equally well as the standard model. This immediately implies that the quantum reflecting agent outperforms standard constructions quadratically as well, in terms of internal time.

IV. COMPARISON OF PS AGENT MODELS
While the definition of reflecting PS agents we introduced in the previous section fits into the broad framework of PS agents, it still differs from the standard PS agents, which have been previously studied [19,35]. The main difference between the standard and the reflecting model is that the standard agents evolve their Markov chain until the first instance where an action clip has been hit, whereas the reflecting agents allow for the Markov chain to fully mix. In the standard PS model, the underlying Markov chain should never mix, as indeed in that case, the construction will guarantee that the actual action performed is independent from the received percept (if the Markov chain is irreducible). This implies that large clip networks of high connectivity should not be used in the standard PS, although, from the conceptual perspective, they could appear in agents which have undergone a rich complex set of episodic experiences (for instance any type of clip compositions we mentioned earlier). The reflecting agents could be expected to yield behavioral advantages in such scenarios. Conversely, since the reflecting agent must always fully mix its Markov chain, and the standard does not, it would seem that the standard model should typically outperform the reflecting model on simple clip networks (as the hitting time may be much smaller than the mixing time). This would suggest that the standard and reflecting PS models should not be compared on simple structures, as no quantum advantage can be demonstrated.
In contrast to this, here we show that even for the simplest (but non-trivial) standard PS agent construction [19,35], there exists a natural reflecting agent analog, where the performance of the classical reflecting agent matches the standard PS construction, and consequently, the quantum reflecting agent yields a quadratic speed-up over both.

A. Simple standard PS agent with flags
Recall that, in the standard PS model, the internal state of the agent is defined by an h−matrix, which defines the transition probabilities of the directed Markov chain over the clip space. In the simplest case, the clip space comprises only percepts and actions, and the agent is initialized so that each percept is connected to each action with unit weight (implying equiprobable transitions), and no other connections exist. Furthermore, the update function we assume is the standard update function as defined in [19] and repeated in the Appendix VI C. In this particular model, the internal elementary time-step is one evolution of the Markov chain, so the agent always decides in one step, and there is nothing to improve. However, if we introduce short term memory, which significantly improves the perfor- mance of the model [19], then the fact that the agent may hit an un-flagged item implies it may have to walk again. If s denotes the probability (for the clip s) that the agent hits a flagged item, then the expected number of elementary transitions, and checks, the agent will perform is O(1/ s ) (see Fig. 2 for an illustration). For this model, we can introduce a direct reflecting agent analog.

B. Simple reflecting agent with flags
It is relatively straightforward to construct a reflective PS agent which is behaviorally equivalent to the simple standard agent with flags, given above. To each percept we assign a Markov chain over just the action space, such that the transition matrix is columnconstant, and the column simply contains the transition probabilities for that particular percept in the standard model above, see Fig. 2 for an illustration. The stationary distribution of this Markov chain then matches the desired transition probabilities. The update function for this agent is then exactly the same as in the standard model (that is, internally, the reflecting agent can also keep and update the same h−matrix which induces the Markov chains), and the flags are also treated in exactly the same way. Note that the transition matrix of such a Markov chain is rank-1, meaning it has only one non-zero eigenvalue, and since the trace of this matrix is unit (and trace is invariant under diagonalization/basis change), this implies the largest eigenvalue is 1 and all the others are zero. Thus we have that the spectral gap is always 1. This immediately implies that the Markov chain is fully mixed after just a single transition (as δ = 1). Thus the classical reflecting agent will also perform O(1/ s ) transitions and checks until an action is performed, given the percept s.
However, the quantum agent, will only require O(1/ √ s ) calls to the quantum walk operator. Thus, even in the very simple setting we see that the reflecting agents can be compared to the standard PS agent model, and maintain a generic quadratic speed-up when quantized.
While the scenario considered in this section is somewhat restricted, it is not without importance as for the standard PS agents we already have a body of results in the classical case [19,35]. We emphasize however that the quadratic speed-up proven in Theorem 1 is not restricted to this scenario.

V. DISCUSSION
We have presented a class of quantum learning agents that use quantum memory for their internal processing of previous experience. These agents are situated in a classical task environment that rewards a certain behavior but is otherwise unknown to the agent, which corresponds to the situation of conventional learning agents.
The agent's internal 'program' is realized by physical processes that correspond to quantum walks. These are derived from classical random walks over directed weighted graphs, which represent the structure of its episodic memory. We have shown how, using quantum coherence and known results from the study of quantum walks, the agent can explore its episodic memory in superposition in a way which guarantees a provable quadratic speedup in its active learning time over its classical analogue.
Regarding potential realizations for such quantum learning agents, modern quantum physics laboratories offer a variety of systems. Quantum walks can naturally be implemented by linear optics setups with single photons propagating in arrays of polarizing beam splitters (PBS) [44]. Feedback from the environment can, for example, change the polarization of the photons in each path of such an interferometer which in turn updates the probability amplitudes for taking certain subsequent paths. Highly versatile setups can also be realized using internal states of trapped ions [45] or atoms as the quantum memory, and lasers to control the transitions between different internal states. Feedback from the environment may here change the laser intensity and/or phase. These are just two out of several systems that are studied in the field of quantum computing and simulation [46], all of which allow, in principle, a level of coherent control that seems to be necessary to implement quantum walks as described in this paper. An entirely different route towards realizing the proposed quantum (and classical) learning agents might employ condensed matter systems in which the proposed Markov chains could e.g. be realized through cooling or relaxation processes towards target distributions that then encode the state of belief of the agent. Here we envision rather non-trivial cooling/relaxation schemes in complex many-body systems, the study of which is a prominent topic in the field of quantum simulation.
In conclusion, it seems to us that the embodied approach to artificial intelligence acquires a further fundamental perspective by combining it with concepts from the field of quantum physics. The implications of embodiment are, in the first place, described by the laws of physics, which tell us not only about the constraints but also the ultimate possibilities of physical agents. In this paper we have given one example how the laws of quantum physics can be fruitfully employed in the design of future intelligent agents that will outperform their classical relatives in complex task environments. Science Fund (FWF) through the SFB FoQuS: F 4012.

A. Formal definitions of reinforcement learning agents
Here we formally define the model of reinforcement learning agents as employed in this work. • Λ = {0, 1} is the set of rewards, offered by the environment.
• C = {C 1 , . . . , C p } is the set of possible internal states of the agent.
• D : S ×C → A is the decision function, which outputs some action given a percept and the internal state.
• U : S×A×Λ×C → C is the update function, which updates the internal state based on the success or failure of the last percept-action sequence.
A few comments are in order. In this work, the sets of percepts, actions and internal states are defined to be finite, but, in general, this need not be the case. The set of rewards is binary, and this can again be generalized. In our settings, however, we assume a binary rewarding system. The update function may take additional information into account, based on additional outputs of the decision function, which are only processed internally, but this does not occur in the models we consider.
The decision function is not necessarily deterministic. In the non-deterministic case it can be formally defined as that is, a function which takes values in the set of distributions over A. In this case we also assume that this distribution is sampled before actual output to the environment is produced, and the realized sampled action is the input to the update function.

B. Classical and quantum walk basics
A random walk on a graph is described by a MC, specified by a transition matrix P which has entries P ji = Prob(j|i). For an irreducible MC there exists a stationary distribution π such that P π = π. For an irreducible, regular MC this distribution can be approximated by P t π 0 i.e. by applying, to any initial distribution π 0 , the MC t number of times where t ≥ t mix . This time is known as mixing time and is defined as follows.

Definition 2. (Mixing Time).
The mixing time is: The latter can be related to the spectral properties of the MC P via the following theorem [47]: Theorem 2. The mixing time satisfies the following inequalities: In the above we use instead of the standard for consistency with the rest of the Appendix, as has a reserved meaning.
For the purpose of clarity let us introduce some definitions and theorems, originally provided in [25,26] that will be useful to introduce the notation and to prove the main results for the speedup of quantum agents.
The quantum analog of the applying the MC P is given by: The quantum diffusion operators, the analogs of the classical diffusion operators are given by the the following transformations: where |p i = j P ji |j , p * j = i P * ij |i and P * is the time reversed MC defined by π i P ji = π j P * ij . We will consider the application of the MC P (for the classical agent) and the quantum diffusion operators (for the quantum agent) as the (equally time consuming) primitive processes, as is done in the theory of quantum random walks [26]. Next, we can define the quantum walk operator W (P ), for the Markov chain P.

Definition 4. (Walk Operator or Quantum Markov Chain).
The walk operator or Quantum Markov Chain is given by where Π 1 is the projection operator onto the space Span{|i |p i } i and Π 2 is the projection operator onto Span{ p * j |j } j . The quantum walk operator can be easily realized through four applications of the quantum diffusion operators, see e.g. [26] for details. Thus, it too can be considered a primitive process of the agent.
Another operation which both the classical and the quantum agents do is checking whether the clip found is flagged. The quantum check operator is defined as follows.

Definition 5. (Check). The quantum check operator is the reflection denoted as ref(f (s)) performing
where f (s) denotes the set of flagged actions corresponding to the percept s.
In order to prove our main theorems we will be using the ideas introduced in the context of quantum searching [26]. We will now briefly expose some of the ideas introduced in [25,26] in that context. In the quantum walk over graphs approach to searching, one defines an initial state, which encodes the stationary distribution of a MC, and performs a rotation onto the state containing the 'marked items' where Π f (s) is the projector on the space of marked items i.e Π f (s) = i∈f (s) |i i| ⊗ I. Let us point out that Span{|π , |π } ⊆ Span{|i |p i } i +Span{ p * j |j } j . In order to achieve this rotation one makes use of two reflections. The first is the reflection over π ⊥ (denoted ref( π ⊥ ) ), the state orthogonal to |π in Span{|π , |π }. This operator can be realized using the primitive of checking. Indeed, we have the following claim (stated in [26]) given by: Proof. Let α |π + β |π be a vector in Span{|π , |π }. We have that: where Π f (s) is the projector on the set f (s). The result easily follows by noting that and that On the other hand, the reflection over |π is not straightforward. One can devise an approximated scheme to implement this reflection using the phase estimation algorithm. Indeed, one can build a unitary operator, using phase estimation applied to the quantum walk operators, which approximates the reflection over |π . Before we state the theorem regarding this approximate reflection operator (constructively proven in [26]), we will first give another result regarding the spectrum of the quantum walk operator, which will be relevant to us presently. Theorem 3. (Szegedy [25]) Let P be an irreducible, reversible MC with stationary distribution π. Then the quantum walk operator W (P ) is such that: 1. W (P ) |π = |π .

W (P ) has no other eigenvalue in Span{|i |p
Let us note that the phase gap ∆ i.e. the minimum nonzero 2θ is such that cos ∆ = |λ 2 |. One can then, with some algebra, conclude that ∆ ≥ 2 √ δ. Let us note that any unitary able to approximately detect whether the eigenvalue of W (P ) of a state in Span{|i |p i } i + Span{ p * j |j } j is different from one (or equivalently, its eigenphase is different from zero) and conditionally flip the state, will do. We will use such a unitary to approximate ref(|π ). Let us use this intuition to build such a unitary, R(P ), that takes as a parameter the precision s and refer to it in the following as the approximate reflection operator R(P ) [26]: Let P be an ergodic, irreducible Markov chain on a space of size |X| with (unique) stationary distribution π. Let W (P ) be the corresponding quantum Markov Chain with phase gap ∆. Then, if s is chosen in O(log 2 (1/∆)) , for every integer k there exist a unitary R(P ) that acts on 2 log 2 |X| + ks qubits, such that: 1. R(P ) makes at most k2 s+1 calls to the (controlled) W (P ) and W (P ) † .
By the approximate reflection theorem above, there exists a subroutine R(P )(q, k), where s from the statement of Theorem 4 is taken as log 2 (q), and k explicitly controls the fidelity of the reflection. Note that, in the definition of the quantum reflecting agent from the main text, the parameter q was chosen inÕ(1/ √ δ), and since by a Theorem of Szegedy [25], as we have commented, it holds that 1/ √ δ ∈ O(1/∆), we have that the fidelity of the approximation reflection approaches unity exponentially quickly in k. For the explicit construction of the approximate reflection operators, we refer the reader to [26].

C. The PS model
The PS model is a reinforcement learning agent model, thus it formally fits within the specification provided with Def. 1. Here we will recap the formalization standard PS model introduced in [19] but note that the philosophy of the projective simulation-based agents is not firmly confined in the formal setting we will provide here, as it is more general. PS agents are defined on a more conceptual level as agents whose internal states represent episodic and compositional memory and whose deliberation comprises an association driven hops between memory sequences -so called clips. Nonetheless, the formal definitions we give here allow us to precisely state our main claims. Following this, we provide a formal treatment of a slight generalization of the standard PS model which subsumes both the standard and the reflecting agent model we refer to in the main text, and formally treat in this Appendix later.
The PS model comprises the percept and action spaces as given in Def. 1. The central component of PS is the so-called episodic and compositional memory (ECM), and it comprises the internal states of the agent. The ECM is a directed weighted network (formally represented as a directed weighted graph) the vertices of which are called clips.
Each clip c represents fragments of episodic experiences, which are formally tuples c = (c (1) , c (2) , . . . , c (L) ), (16) where each c k is an internal representation of a percept or an action, so where µ is a mapping from real percepts and actions to the internal representations. We will assume that each ECM always contains all the unit-length clips denoting elementary percepts and actions. Within the ECM, each edge between two clips c i and c j is assigned a weight h(c i , c j ) ≥ 1 and the weights are collected in the h−matrix.
The elementary process of the PS agent is a Markov chain, in which the excitations of the ECM hop from one clip to another, where the transition probabilities are defined by the h-matrix: thus the h−matrix is just the non-normalized transition matrix.
In the standard PS model, the decision function is realized as follows: given a percept s, the corresponding clip in the ECM is excited and hopping according to the ECM network is commenced. In the simplest case, the hopping process is terminated once a unit-length action clip is encountered, and this action is coupled out and output by the actuator (see Fig. 1). The moment when an action is coupled out can be defined in a more involved way, as we explain presently.
Finally, the update rule, in the standard model necessarily involves the re-definition of the weights in the h-matrix.
A prototypical update rule, for a fully classical agent, defining an update from external time-step t to t + 1 depends on whether an action has been rewarded. If the previous action has been rewarded, and the transition between clips c i , c j had actually occurred in the hopping process then the update is as follows: where λ is the reward (often λ = 1) and 0 ≤ γ ≤ 1 a dissipation (forgetfulness) parameter. If the action had not been rewarded, or the clips c i , c j had not played a part in the hopping process then the weights are updated as follows: The update rule can also be defined such that the update only requires the initial and terminal clip of the hopping process, which is always the case in the simple PS model, where all the clips are just actions or percepts, and hopping always involves a transition from a percept to an action. This example was used in section IV. For that particular example, the update function can be exatly defined by the rules above. As mentioned, aside from the basic structural and diffusion rules, the PS model allows for additional structures,which we repeat here. 1) Emoticons -the agents short term memory, i.e. flags which notify the agent whether the currently found action, given a percept was previously rewarded or not. For our purposes, we shall use only the very rudimentary mode of flags, which designate that the particular action (given a particular percept) was not already unsuccessfully tried before. If it was, the agent can 'reflect on its decision' and re-evaluate its strategies, by re-starting the diffusion process. This is an example of a more complicated out-coupling rule we have mentioned previously. 2) Edge and clip glow -mechanisms which allow for the establishing of additional temporal correlations. 3) Clip composition -the PS model based on episodic and compositional memory allows the creation of new clips under certain variational and compositional principles. These allow the agent to develop new behavioral patterns under certain conditions, and allow for a dynamic reconfiguration of the agent itself. For more details we refer the reader to [19,35].
As illustrated, the PS model allows for a great flexibility. A straightforward generalization would allow for the ECM network to be percept-specific, which is the view we adopt in the definition of reflecting agents. However, here we will demonstrate that the same notion can be formalized without introducing multiple networks (one for every percept).
In particular, the ECM network allows for action and percept clips to occur a multiple number of times. Thus the ECM network can be represented as |S| disjoint networks, each of which comprises all elementary action clips, and only one elementary percept clip. This structure is clearly within the standard PS model, and it captures all the features of the reflecting PS agent model. A simple case of such a network, relative to the standard picture, is illustrated in Fig. 2, part b). Thus, the reflecting PS agent model is, structurally, a standard PS model as well.
However, we will additionally fix a particular process of deliberation for the reflecting PS agents, which we fully formally treat next.

D. Reflecting agents
Here we give the formal definition of the ideal reflecting agent. The classical and quantum realizations of the reflecting agent below differ in the actual method of approximating the decision function of the ideal reflecting agent.
each g is a set of ergodic regular Markov chains over the set of all actions and some subset of percepts, each specified for every percept: g = {P s } s∈S , and P s is an ergodic, regular Markov chain over S ∪ A, for S ⊆ S, and f is the set of configurations of flagged actions assigned to each percept: We will use the shorthand g(s) = P s in the following. G and F are just sets of all such sets of Markov chains and sets of percept-specific flagged actions.
• The decision function D given a percept s and an internal state C = (g, f ) returns a distribution (given as a probability mass function) over the actions f (s) defined as follows. Let π s be the stationary distribution of the Markov chain g(s), and let π s (i), for every i ∈ S ∪A denote the probability of sampling the element i from π s . Then, If no flagged actions exist, the agent will respond with an action chosen uniformly at random.
In the definition above we do not specify any particular update function as our results hold for all update functions. While the definition above may seem convoluted, conceptually it is relatively simple and intuitive. For each percept, the agent has an internal network/graph with actions, and other clips (some of the percepts) as vertices. They represent the associative memory of the agent. Instead of simply performing a random walk over this network until an action clip is hit (like in the standard PS model), this agent 'reflects deeper' and lets the walk fully mix -that is, obtains the stationary distribution over its clip network. Following this the agent samples from this distribution until a flagged item is hit. Recall, flagged actions correspond to actions that were either not tried or were rewarded by the environment in the previous steps the agent was presented with that same percept, and thus they represent short term memory which allows the agent to be efficient on the scales where the environment does not change its policies.
The output distribution realized by this process maintains the relative probabilities of flagged actions unperturbed. Formally, it can be seen as a normalized projection of the stationary distribution probability vector onto only flagged actions.

E. Proofs of main theorems
The definitions of the internal processes of classical and quantum reflecting agents are given in section III B of the main body of this paper. Here, we prove separately that both the classical and the quantum reflecting agent arbitrarily well approximate the ideal reflecting agent in terms of the passive behavior, which we used to define passive behavioral equivalence classes.
From these two theorems, the main theorem given in section III B immediately follows. In the following, when π and π are distributions then π − π denotes the standard variational distance (Kolmogorov distance) on distributions, so Theorem 5. (classical reflecting agents) Let P s be the transition matrix of the Markov chain associated to percept s, let f (s) be the (non-empty) set of flagged action clips. Furthermore, let π s (x) be the probability mass function of the stationary distribution π s of P s and letπ s be the renormalized distribution of π s , where the support is retained only over the flagged actions, so: Let κ be the probability distribution over the clips as outputted by the classical reflecting agent, upon receiving s. Then the distance κ −π s is constant (ignoring logarithmic factors), and can be efficiently made arbitrarily small.
Proof. Note that, since the Markov chain is regular, the distribution PÕ (1/δ) s π 0 , for any initial distribution π 0 is arbitrarily close to the stationary distribution of P s , more precisely, or, equivalently where k 0 = max i log(π(i) −1 ). While π(i) −1 can in principle be very large, it only logarithmically contributes to the overhead, so it can be effectively bounded, and omitted from the analysis. Thus, we can achieve an exponentially good approximation of the stationary distribution withÕ(1/δ) iterations. In the remainder we will denote π = PÕ (1/δ) s π 0 The reflecting agent mixes its Markov chain (achieving π ) and then samples from this distribution, iteratively until a flagged action is hit, but at most t 2 times. If no flagged action is hit, the agent outputs the uniform distribution over all actions (which we denote u).
Thus κ = (1 − p fail )π + p fail u, where p fail is the probability the agent fails to hit a flagged action when sampling from π on all t 2 attempts, andπ is the tailed π distribution in the sense of Eq. (24) (substituting π s with π ).
Hence we have that Next, we shall bound π −π s . Note that π −π s = 1 π π sub − 1 (π s ) sub (28) where and π are the probabilities of sampling a flagged action from π s and π , respectively, and π sub and (π s ) sub are the sub-normalized distributions π sub = π π and (π s ) sub = π s . Note that it holds that π sub − (π s ) sub ≤ π − π s . So we have that Next note that e −k1 ≥ π − π s ≥ | π − |, so finally we have Since we are interested in the analysis at theÕ level, we will, in the end, omit the logarithmically contributing factor above. Thus we have: Next, we will upper bound p fail . Note that p fail = (1 − π ) t2 . Since π = + ( π − ), and | π − | ≤ e −k1 , we have that We can already see that, for a choice of k 1 such that k 1 ≥ log(2/ ), we have that p fail ≤ (1 − /2) t2 , which clearly decays exponentially quickly in t 2 . In particular, if t 2 is chosen in O(2/ ) (which is equal to O(1/ )) say t 2 = 2k 2 / we have Hence, with a choice of k 1 ≥ log(2/ ) we have that κ −π s ≤ e −k2 + 2e −k1+log(1/ ) .
Thus, with choices of t 1 ∈Õ(1/δ) and t 2 ∈Õ(1/ ) we can have the tailed distribution of the classical reflecting agent efficiently arbitrarily close to the (ideal) tailed distribution, which proves this theorem. . In the proof above, as in the proof which follows, we have assumed that the flagged sets of actions are nonempty, as only in that case the tailed distribution is defined. Operatively, having no flagged actions would mean that the agent tried all the possible actions for a given percept, and none were rewarded. In this case, the agent should conclude (via a simple check of its memory) that the environment has changed, and redefine the flag status. In particular, it may reset the flags to all actions, but this depends on the details of the construction of the particular reflecting agent. Next, we prove an analogous theorem for the quantum agents. Theorem 6. (quantum reflecting agents) Let P s be the transition matrix of the Markov chain associated to percept s, let f (s) be the (non-empty) set of flagged action clips. Furthermore, let π s (i) be the probability mass function of the stationary distribution π s of P s and let π s be the renormalized distribution of π s , where the support is retained only over the flagged actions, so: , if i ∈ f (s) 0, otherwise (35) Let κ be the probability distribution over the clips as output by the quantum reflecting agent, upon receiving s. Then the distance κ −π is constant (up to logarithmically contributing terms), and can be efficiently made arbitrarily small.
Proof. The proof consists of two parts. First, we prove that, if the reflection operators (used in the diffusion step of the quantum agent) are ideal, then the claim follows. The remainder of the proof, which considers imperfect reflection operators which are actually used, follows from the proof of Theorem 7 in [48].
Assuming the procedure the agent follows starts from the perfect stationary distribution, and that the reflections over |π s are perfect then, by Lemma 6 we have that the state of the system never leaves the span of |π s and π ⊥ s where π ⊥ s is the state orthogonal to |π s in Span{|π s , |π s }.
Note that the agent outputs an action only conditional on the result being a flagged action. This is equivalent to first projecting the state of the system onto the subspace of flagged actions, using the projector Π f (s) = x∈f (s) |x x|, followed by renormalization and a measurement. The state after this projection is, for any state in Span{|π s , |π s } a (sub-)normalized state |π s , hence, after normalization, measurement outcomes always follow the distributionπ s .
However, any state in Span{|π s , |π s } which is not exactly |π s still has a non-zero support on the nonflagged clips. However, it is the key feature of Groverlike search algorithms (like the search algorithm in [48], the reflections of which make up the decision process of the quantum agent) that the reflection iterations produce a state which has a constant overlap with the target state (in our case |π s , which has support only over flagged actions). This implies that the probability of failing to hit a flagged action is at most some constant β < 1. Now, if the the deliberation process is iterated some constant k 3 number of times it is straightforward to see that κ −π s ≤ β k3 , thus decays exponentially quickly as desired.
Next, we need to consider the errors induced by the approximate reflections.
The analysis in [48] (proof of Theorem 7) directly shows that the terminal quantum (after t 2 ∈ [0, t 2 ] iterations) state |ψ is close to the state |φ the algorithm would produce had the reflections been perfect, formally, where c is a constant only additively increasing the internal parameter k of the approximate reflection subroutine.
Thus the error on the final state produced by the quantum agent induced by the approximate reflection algorithm can be made arbitrarily small within the allowed costÕ(1/ √ δ), which also means that all distributions obtained by measurements of these states will not differ by more than 4 × 2 1−c [49].
Returning to the inequality proven for the perfect reflections, we see that the error of magnitude 4 × 2 1−c can only increase the probability of failing to output a flagged action from β to 4 × 2 1−c + β. Thus, by tuning the parameters only linearly, we can make sure the agent produces an action with an exponentially high probability in terms of the parameter k 3 (the number of iterations of the deliberation process). If an action has been produced, then, as we have shown above, it holds that it has been sampled from a distribution within 4 × 2 1−c distance of the tailed distributionπ s , which again can be efficiently made exponentially small. Thus the theorem holds.
We note that if the repetition of the deliberation process is required then a way of re-creating the desired initial coherent encoding of the stationary distribution can be achieved by 'inverting' the quantum search algorithm in [26] by 'unsearching' for all all non-action clips. This will, with high fidelity, again recreate a good approximation of the initial stationary distribution 17 .
A corollary of the two theorems above is that the classical and quantum agents are passively approximately equal. But then, from the definitions of these embodied agents we can see that the quantum agent exhibits a consistent quadratic speed up, in terms of the required number of applications of their elementary operations. That is, the quantum agent is quadratically faster as claimed. This proves the main theorem stated in the main text.