Machine learning for long-distance quantum communication

Machine learning can help us in solving problems in the context big data analysis and classification, as well as in playing complex games such as Go. But can it also be used to find novel protocols and algorithms for applications such as large-scale quantum communication? Here we show that machine learning can be used to identify central quantum protocols, including teleportation, entanglement purification and the quantum repeater. These schemes are of importance in long-distance quantum communication, and their discovery has shaped the field of quantum information processing. However, the usefulness of learning agents goes beyond the mere re-production of known protocols; the same approach allows one to find improved solutions to long-distance communication problems, in particular when dealing with asymmetric situations where channel noise and segment distance are non-uniform. Our findings are based on the use of projective simulation, a model of a learning agent that combines reinforcement learning and decision making in a physically motivated framework. The learning agent is provided with a universal gate set, and the desired task is specified via a reward scheme. From a technical perspective, the learning agent has to deal with stochastic environments and reactions. We utilize an idea reminiscent of hierarchical skill acquisition, where solutions to sub-problems are learned and re-used in the overall scheme. This is of particular importance in the development of long-distance communication schemes, and opens the way for using machine learning in the design and implementation of quantum networks.


I. INTRODUCTION
Humans have invented technologies with transforming impact on society.One such example is the internet, which significantly influences our everyday life.The quantum internet [1,2] could become the next generation of such a world-spanning network, and promises applications that go beyond its classical counterpart.This includes e.g.distributed quantum computation, secure communication or distributed quantum sensing.Quantum technologies are now at the brink of being commercially used, and the quantum internet is conceived as one of the key applications in this context.Such quantum technologies are based on the invention of a number of central protocols and schemes, for instance quantum cryptography [3][4][5][6][7] and teleportation [8].Additional schemes that solve fundamental problems such as the accumulation of channel noise and decoherence have been discovered and have also shaped future research.This includes e.g.entanglement purification [9][10][11] and the quantum repeater [12] that allow for scalable longdistance quantum communication.These schemes are considered key results whose discovery represent breakthroughs in the field of quantum information processing.But to what extent are human minds required to find such schemes?
Here we show that many of these central quantum protocols can in fact be found using machine learning by phrasing the problem in a reinforcement learning (RL) framework [13][14][15], the framework at the forefront of modern artificial intelligence [16][17][18].By using projective simulation (PS) [19], a physically motivated framework for RL, we show that teleportation, entanglement swapping, and entanglement purification are found by a PS agent.We equip the agent with a universal gate set, and specify the desired task via a reward scheme.With certain specifications of the structure of the action and percept spaces, RL then leads to the re-discovery of the desired protocols.Based on these elementary schemes, we then show that such an artificial agent can also learn more complex tasks and discover long-distance communication protocols, the so-called quantum repeaters [12].The usage of elementary protocols learned previously is of central importance in this case.We also equip the agent with the possibility to call sub-agents, thereby allowing for a design of a hierarchical scheme [20,21] that offers the flexibility to deal with various environmental situations.The proper combination of optimized block actions discovered by the sub-agents is the central element at this learning stage, which allows the agent to find a scalable, efficient scheme for long-distance communication.We are aware that we make use of existing knowledge in the specific design of the challenges.Rediscovering existing protocols under such guidance is naturally very different from the original achievement (by humans) of conceiving of and proposing them in the first place, an essential part of which includes the identification of relevant concepts and resources.However, the agent does not only re-discover known protocols and schemes, but can go beyond known solutions.In particular, we find that in asymmetric situations, where channel noise and decoherence are non-uniform, the schemes found by the agent outperform human-designed schemes that are based on known solutions for symmetric cases.
From a technical perspective, the agent is situated in stochastic environments [13,14,22], as measurements with random outcomes are central elements of some of the schemes considered.This requires to learn proper reactions to all measurement outcomes, e.g., the required correction operations in a teleportation protocol depending on outcomes of (Bell) measurements.Additional elements are abort operations, as not all measurement outcomes lead to a situation where the resulting state can be further used.This happens for instance in entanglement purification, where the process needs to be restarted in some cases as the resulting state is no longer entangled.The overall scheme is thus probabilistic.These are new challenges that have not been treated in projective simulation before, but the PS agent can in fact deal with such challenges.Another interesting element is the usage of block actions that have been learned previously.This is a mechanism similar to hierarchical skill learning in robotics [20,21], and to clip composition in PS [19,23,24], where previously learned tasks are used to solve more complex challenges and problems.Here we use this concept for long-distance communication schemes.The initial situation is a quantum channel that has been subdivided by multiple repeater stations that share entangled pairs with their neighboring stations.Previously learned protocols, namely entanglement swapping and entanglement purification, are used as new primitives.Additionally, the agent is allowed to employ sub-agents that operate in the same way but deal with a problem at a smaller scale, i.e. they find optimized block actions for shorter distances that the main agent can employ at the larger scale.This allows the agent to deal with big systems, and re-discover the quantum repeater with its favorable scaling.The ability to delegate is of special importance in asymmetric situations as such block actions need to be learned separately for different initial states of the environment -in our case the fidelity of the elementary pairs might vary drastically either because they correspond to segments with different channel noise, or they are of different length.In this case, the agent outperforms human-designed protocols that are tailored to symmetric situations.
The paper is organized as follows.In Sec.II we provide background information on reinforcement learning and projective simulation, and discuss our approach on how to apply these techniques on problems in quantum communication.In Sec.III, we show that the PS agent can find solutions to elementary quantum protocols, thereby re-discovering teleportation, entanglement swapping, entanglement purification and the elementary repeater cycle.In Sec.IV we present results for the scaling repeater in a symmetric and asymmetric setting, and summarize and conclude in Sec.V.

II. PROJECTIVE SIMULATION FOR QUANTUM COMMUNICATION TASKS
In this paper the process of designing quantum communication protocols is viewed as learning by trial and error.This process is visualized in Fig. 1 as an interaction between an RL agent and its environment: by trial and error the agent is manipulating quantum states hence constructing communication protocols.At each interaction step the RL agent perceives the current state of the protocol (environment) and chooses one of the available operations (actions).This action modifies the previous version of the protocol and the interaction step ends.In addition to the state of the protocol the agent gets feedback at each interaction step.This feedback is specified by a reward function, which depends on the specific quantum communication task a)-d) in Fig. 1.A reward is interpreted by the RL agent and its memory is updated.
The described RL approach is used for two reasons.First, there is a similarity between a target quantum communication protocol and a typical RL target.A target quantum communication protocol is a sequence of elementary operations leading to a desired quantum state, whereas a target of an RL agent is a sequence of actions that maximizes the achievable reward.In both cases the solution is therefore a sequence, which makes it natural to assign each elementary quantum operation a corresponding action, and to assign each desired state a reward.Second, the way the described targets are achieved is similar in RL and quantum communication protocols.In both cases an initial search (exploration) over a large number of operation (or action) sequences is needed.This search space can be viewed as a network, where states of a quantum communication environment are vertices, and basic quantum operations are edges.The structure of a complex network, formed in the described way, is similar to the one observed in quantum experiments [24], which makes the search problem equivalent to navigation in mazes -a reference problem in RL [14,[25][26][27].
It should also be said, that the role of the RL agent goes beyond mere parameter estimation for the following reasons.First, using simple search methods (e.g, a bruteforce or a guided search) would fail for the considered problem sizes: e.g. in the teleportation task discussed in section III A, the number of possible states of the communication environment is at least 7 14 > 0.6 × 10 12 [28].Second, the RL agent learns in the space of its memory parameters, but it is not the case with optimization techniques (e.g, genetic algorithms, simulated annealing, or gradient descent algorithms) that would search directly in the parameter space of communication protocols.Optimizing directly in the space of protocols, which consist of both actions and stochastic environment responses, can only be efficient if the space is sufficiently small [14].Additional complication will be introduced by the fact that reward signals are often sparse in quantum communication tasks, hence the reward gradient is almost always zero giving optimization algorithms no direction for parameter change.Third, using an optimization technique for constructing an optimal action sequence, ignoring stochastic environment responses, is usually not possible in quantum communication tasks outcomes, there is no single action sequence that achieves an optimal protocol, i.e. there is no single point optimal point in the parameter space with which an optimization technique.Nevertheless, there is at least one point in the RL agent's memory parameter space that achieves an optimal protocol as the RL agent can choose an action depending on the current state of the environment rather than a whole action sequence.As a learning agent that operates within the RL framework shown in Fig. 1 we use the PS agent [19,29].PS is a physically-motivated approach to learning and decision making, which is based on deliberation in the episodic and compositional memory (ECM).The ECM is organized as an adjustable network of memory units, which provides flexibility in constructing different concepts in learning, e.g., meta-learning [30] and generalization [31].The deliberation within the ECM is based on a not computationally demanding random walk process, which in addition can be sped up via a quantum walk process [32,33], leading to a quadratic speedup in deliberation time [34], and makes the PS model conceptually attractive.Physical implementations of the quantumenhanced PS agent were proposed by using trapped ions [35] or superconducting circuits [36].The quantumenhanced deliberation was recently implemented, as a proof-of-principle, in a small-scale quantum information processor based on trapped ions [37].
The use of PS in the design of quantum communication protocols has further advantages compared to other approaches, such as standard tabular RL models, or deep RL networks.First, the PS agent was shown to perform well on problems that, from an RL perspective, are conceptually similar to designing communication networks.In the problems that can be mapped to a navigation problem [38], such as the design of quantum experiments [24] and the optimization of quantum error correction codes [39], PS outperformed methods that were practically used for those problems (and were not based on machine learning).In standard navigation problems, such as the grid world and the mountain car problem, the PS agent shows a performance qualitatively simi-lar to standard tabular RL models of SARSA and Qlearning [38].Second, as was shown in Ref. [38], the computational effort is one to two orders of magnitude lower compared to tabular approaches.The reason for this is a low model complexity: in static task environments the simple PS agent has only one relevant model parameter.This makes it easy to set up the agent for a new complex environment, such as the quantum communication network, where model parameter optimization is costly because of the runtime of the simulations.Third, by construction, the PS decision making can be explained by analyzing graph properties of its ECM.Because of this intrinsic interpretability of the PS model, we are able to properly analyze the outcomes of the learning process.
Next, we show how the PS agent learns quantum communication protocols.The code of the PS agent used in this context is a derivative of a publicly available Python code [40].

III. LEARNING ELEMENTARY PROTOCOLS
We let the agent interact with various environments where the initial states and goals correspond to wellknown quantum information protocols.For each of the protocols we will first explain our formulation of the environment and the techniques we used.Then we discuss the solutions the agent finds before finally comparing them to the established protocols.A detailed description of the environments together with additional results can be found in the Appendix.

A. Quantum teleportation
The agent is tasked to find a way to transmit quantum information without directly sending the quantum system to the recipient.As an additional resource a maximally entangled state shared between sender and recipient is available.The agent can apply operations from a (universal) gate set locally.This task challenges the agent without any prior knowledge to find the best (shortest) sequence of operations out of a large number of possible action sequences, which grows exponentially with a sequence length.
We describe the learning task as follows: There are two qubits A and A at the sender's station and one qubit B at the recipient's station.Initially, the qubits A and B are in a maximally entangled state The setup is depicted in Fig. 2a.For this setup we consider two cases: the agent is equipped with either a Clifford gate set or a universal gate set.In both cases the agent can perform single-qubit measurements, but multiqubit operations can only be applied on qubits at the same station (in this case, only between A and A ).The task is considered to be successfully solved if the qubit at B is in state |Ψ .In order to ensure that this works for all possible input states, instead of using random input states, we make use of the Jamio lkowski fidelity [41,42] to evaluate if the protocol proposed by the agent is successful.This means we require that the overlap of the Choi-Jamio lkowski state [

41] |Φ +
A A corresponding to the effective map generated by the suggested protocol with the Choi-Jamio lkowski state corresponding to the optimal protocol is equal to 1.
In Fig. 2b the learning curves, i.e., the number of operations the agent applies to reach a solution at each trial, are shown.If no solution is found, the PS agent tries a maximum of 50 operations before the environment is reset and a new trial is started.We see that the average number of operations decreases below 50, which means that the PS agent finds solutions.The average lengths of these solutions decrease over time as the agent keeps finding better and better solutions based on its experience.We observe that the learning curve converges to some average number of operations in both cases, using a Clifford (blue) and a universal (green) gate set.However, the mean squared deviation does not go to zero.This can be explained by looking at the individual learning curves of two example agents in Fig. 2c: the agent does not arrive at a single solution for this problem setup, but rather four different solutions.These solutions can be summarized as follows (up to different orders of commuting operations): • Apply CNOT A →A H A , where H is the Hadamard gate and CNOT is the controlled-NOT operation.
• Measure qubits A and A in the computational basis.
• Depending on the measurement outcomes, either apply 1, X, Y or Z (decomposed to the elementary gates of the used gate set) on qubit B.
We see four different solutions in Fig. 2c as four horizontal lines, which appear because of the probabilistic nature of the quantum communication environment.The agent learns different sequences of gates because different operations are needed, depending on measurement outcomes the agent has no control over.Four appropriate correction operations of different length (as seen in Fig. 2c), which are needed in order for the agent to successfully transmit quantum information at each trial, complete the protocol.This protocol found by the agent is identical to the well-known quantum teleportation protocol [8].
Note that because we used the Jamio lkowski fidelity to verify that the protocol implements the teleportation channel for all possible input states, it follows that the same protocol can be used for entanglement swapping if the input qubit at A is part of an entangled state.

B. Entanglement purification
Noise and imperfections are a fundamental obstacle to distribute entanglement over long-distances, so a strategy to deal with these is needed.One possible idea is to use a larger amount of entanglement in the form of multiple Bell pairs, each of which may have been affected by noise during the initial distribution, and try to obtain fewer, less noisy pairs from them.The agent again has to rely on using only local operations at the two different stations that are connected by the Bell pairs.Specifically, we provide the agent with two noisy Bell pairs ρ A1B1 ⊗ρ A2B2 as input, where ρ is of the form of ρ = Here |Φ ± and |Ψ ± denote the standard Bell basis and F is the fidelity with respect to |Φ + .This starting situation is depicted in Fig. 3a.The agent is tasked with finding a protocol that probabilistically outputs one copy with increased fidelity.However, it is desirable to obtain a protocol that does not only result in an increased fidelity when applied once, but consistently increases the fidelity when applied recurrently, i.e. on two pairs that have been obtained from the previous round of the protocol.In order to make such a recurrent application possible while dealing with probabilistic measurements, identifying the branches that should be reused is an integral part.
To this end, a different technique than before is employed.Rather than simply obtaining a random measurement outcome every time the agent picks a measurement action, instead the agent needs to provide potentially different actions for all possible outcomes.The actions taken on all the different branches of the protocol are then evaluated as a whole.This makes it possible to calculate the result of the recurrent application of that protocol separately for each trial.The agent is rewarded according to both the overall success probability of the protocol and the obtained increase in fidelity.
The agent is provided with a Clifford gate set and single-qubit measurements.Qubits labeled A i are held by one party and those labeled B i are held by another party.Multi-qubit operations can only be applied on qubits at the same station.The output of each of the branches is enforced to be a state with one qubit on side A and one on side B along with a decision by the agent whether to consider that branch success or failure for the purpose of iterating the protocol.Since this naturally needs two single-qubit measurements, with two possible outcomes each, there are four branches that need to be considered.
In Fig. 3b we see reward values that 100 agents obtained for the protocols applied to initial states with fidelities of F = 0.73.The reward is normalized such that the entanglement purification protocol presented in Ref. [10] would obtain a reward of 1.0.All the successful protocols found start the same way (up to permutations of commuting operations): they apply CNOT A1→A2 ⊗ CNOT B1→B2 followed by measuring qubits A 2 and B 2 in the computational basis.In some of the protocols two of the previously discussed four branches are marked as successful, while others only mark one particular combination of measurement outcomes.The latter therefore have a smaller probability of success, which is reflected in the reward.However, looking closely at the distribution in Fig. 3b we can see that these cases correspond to two variants with slightly different rewards.Those variants differ in the operations that are applied on the output copies before the next purification step.The variant with slightly lower reward applies the Hadamard gate on both qubits: H ⊗H. The protocol that obtains the full reward of 1.0 applies √ −iX ⊗ √ −iX and is depicted in Fig. 3c.This protocol is equivalent to the well-known DEJMPS protocol [10] for an even number of recurrence steps, but requires a shorter action sequence for the gate set provided to the agent.We discuss this solution in more detail, as well as an additional variant of the environment with automatic depolarization after each recurrence step, in Appendix B.

C. Quantum repeater
Entanglement purification alone certainly increases the distance over which one can distribute an entangled state of sufficiently high fidelity.However, the reachable distance is limited because at some point too much noise will accumulate such that the initial states will no longer have the minimal fidelity required for the entanglement purification protocol.Hence, one splits up the channels into smaller segments.Now the agent has to deal with two such channel segments that distribute noisy Bell pairs with a common station in the middle as depicted in Fig. 4a.In this scenario the challenge for the agent is to use the protocols of the previous sections in order to distribute an entangled state over the whole distance.
To this end the agent may use the previously discovered protocols for teleportation/entanglement swapping and entanglement purification as elementary actions, rather than individual gates.
The task is to find a protocol for distributing an entangled state between the two outer stations with a threshold fidelity of at least 0.9, all the while using as few initial states as possible.The initial Bell pairs are considered to have initial fidelities of F = 0.75.Furthermore, the CNOT gates used for entanglement purification are considered to be imperfect, which we model as local depolarizing noise with reliability parameter p acting on the two qubits involved followed by the perfect CNOT operation [11].The effective map M a→b CNOT is given by: where D i (p) denotes the local depolarizing noise channel with reliability parameter p acting on the i-th qubit: with X i , Y i , Z i denoting the Pauli matrices acting on the i-th qubit.While the point of such an approach only begins to show for much longer distances, which we take a look at in Sec.IV, some key concepts can already be observed at small scales.
The agent naturally tends to find solutions that use a small number of actions in an environment that is similar to a navigation problem.However, this is not necessarily desirable here because the resources, i.e. the number of initial Bell pairs, is the figure of merit in this scenario rather than the number of actions.Therefore an appropriate reward function for this environment takes the used resources into account.
In Fig. 4b the learning curve of the best of 128 agents in terms of resources used is depicted.Looking at the best solutions, the key insight is that it is beneficial to purify the short-distance pairs a few times before connecting them via entanglement swapping even though this way more actions need to be performed by the agent.This solution is in line with the idea of the established quantum repeater protocol [12].

IV. SCALING QUANTUM REPEATER
The point of the quantum repeater lies in its scaling behavior which only starts to show when considering longer distances than just two links.This means we have to consider starting situations of variable length as depicted in Fig. 1d using the same error model as described in section III C. In order to distribute entanglement over varying distances, the agent needs to come up with a scalable scheme.However, both the action space and the length of action sequences required to find a solution would quickly grow unmanageable with increasing distances.Furthermore, an RL agent learns a solution for a particular situation and problem size rather than finding a universal concept that can be transferred to similar starting situations and larger scales.
To overcome these restrictions, we provide the agent with the ability to effectively outsource finding solutions for distributing an entangled pair over a short distance and reuse them as elementary actions for the larger setting.This means that, as a single action, the agent can instruct multiple sub-agents to come up with a solution for a small distance and then pick the best action sequence among those solutions.This process is illustrated in Fig. 5a.
Again, the aim is to come up with a protocol that distributes an entangled pair over a long distance with The main agent that is tasked with finding a solution for the large scale problem (repeater length 4 in this example) delegates the solution of a subsystem of length 2 to another agent.That agent comes up with a solution for the smaller scale problem and that action sequence is then applied to the larger problem.This counts as one single action for the main agent.For settings even larger then this, the sub-agent itself can again delegate the solution of a sub-sub-system to yet another agent.b-c Scaling repeater with forced symmetric protocols with initial fidelities F = 0.75.Gate reliability parameter p = 0.99.The threshold fidelity for a successful solution is 0.9.The red line corresponds to the solution with approximately 0.518 × 10 8 resources used.b Best solution found by an agent with repeater length 8. c Relative resources used by the agent's solution compared to a symmetric strategy for different repeater lengths.d-e Scaling repeater with asymmetric initial fidelities (0.8, 0.6, 0.8, 0.8, 0.7, 0.8, 0.8, 0.6).The threshold fidelity for a successful solution is 0.9.d Best solution found by an agent with gate reliability p = 0.99 outperforms a strategy that does not take the asymmetric nature of the initial state into account (red line).e Relative resources used by the agent's solution compared to a symmetric strategy for different reliability parameters a .
a The jumps in the relative resources used are likely due to threshold effects or agents converging to a very short sequence of block actions that is not optimal.
sufficiently high fidelity, while using as few resources as possible.

A. Symmetric protocols
First, we take a look at a symmetric variant of this setup: The initial situation is symmetric and the agent is only allowed to do actions in a symmetric way.If it applies one step of an entanglement purification protocol on one of the initial pairs, all the other pairs need to be treated in the same way.Similarly, entanglement swapping is always performed at every second station that is still connected to other stations.In Fig. 5b-c the results for various lengths of Bell pairs with an initial fidelity of F = 0.75 are shown.We compare the solutions that the agent found with a strategy that repeatedly purifies all pairs up to a chosen working fidelity followed by entanglement swapping (see Appendix D 3).For lengths greater than 8 repeater links, the agent still finds a solution with desirable scaling behavior solution while only using slightly more resources.

B. Asymmetric setup
The more interesting scenario is when the initial Bell pairs are subjected to different levels of noise, e.g. when the physical channels between stations are of different length or quality.In this scenario symmetric protocols are not optimal.
We consider the following scenario: 9 repeater stations connected via links that can distribute Bell pairs of different initial fidelities (0.8, 0.6, 0.8, 0.8, 0.7, 0.8, 0.8, 0.6).In Fig. 5d the learning curve in terms of resources for the agent that can delegate work to sub-agents is shown.The gate reliability of the CNOT gates used in the entanglement purification protocol is p = 0.99.The obtained solution is compared to the resources needed for a protocol that does not take into account the asymmetric nature of this situation and that is also used as an initial guess for the reward function (see Appendix D 3 for additional details of that approach).Clearly the solution found by the RL agent is preferable to the protocol tailored to symmetric situations.Fig. 5e shows how that advantage scales for different gate reliability parameters p.

V. SUMMARY AND OUTLOOK
We have demonstrated that reinforcement learning can serve as a highly versatile and useful tool in the context of quantum communication.When provided with a sufficiently structured task environment including an appropriately chosen reward function, the learning agent will retrieve (effectively re-discover) basic quantum communication protocols like teleportation, entanglement purification, and the quantum repeater.We have developed methods to state challenges that occur in quantum communication as RL problems in a way that offers very general tools to the agent while ensuring that relevant figures of merit are optimized.
We have shown that stating the considered challenges as an RL problem is beneficial and offers advantages over using optimization techniques as discussed in section II.
Regarding the question to what extent programs can help us in finding genuinely new schemes for quantum communication, it has to be emphasized that a significant part of the work consists in asking the right questions and identifying the relevant resources, both of which are central to the formulation of the task environment and are provided by researchers.However, it should also be noted that not every aspect of designing the environment is necessarily a crucial addition and many details of the implementation are simply an acknowledgment of practical limitations like computational runtimes.When provided with a properly formulated task, a learning agent can play a helpful, assisting role in exploring the possibilities.
In fact, we used the PS agent in this way to demonstrate that the application of machine learning techniques to quantum communication is not limited to rediscovering existing protocols.The PS agent finds adapted and optimized solutions in situations that lack certain symmetries assumed by the basic protocols, such as the qualities of physical channels connecting different stations.We extended the PS model to include the concept of delegating parts of the solution to other agents, which allows the agent to effectively deal with problems of larger size.Using this new capability for long-distance quantum repeaters with asymmetrically distributed channel noise the agent comes up with novel and practically relevant solutions.
We are confident that the presented approach can be extended to more complex scenarios.We believe that reinforcement learning can become a practical tool to apply to quantum communication problems that do not have a rich spectrum of existing protocols such as designing quantum networks, especially if the underlying network structure is irregular.available, thereby reducing the number of actions that the agent can choose.In order to reduce the huge action space further, the requirement that the final state of one sequence of gates needs to be a two-qubit state shared between A and B is enforced by removing actions that would destructively measure all qubits on one side.The accept and reject actions are essential because they allow identifying successful branches.
Percepts: The agent only uses the previous actions of the current trial as a percept.Z-measurements with different outcomes will produce different percepts.
Reward: The protocol suggested by the agent is performed recurrently for ten times.This is done to ensure that the solution found is a viable protocol for recurrent application because it is possible that a single step of the protocol might increase the fidelity but further applications of the protocol could undo that improvement.The reward function is given by R = max 0, const × 10 10 i=1 p i ∆F where p i is the success probability (i.e. the combined probability of the accepted branches) of the i-th step, ∆F is the increase in fidelity after ten steps and the constant is chosen such that the known protocols [9] or [10] would receive a reward of 1.
Problem-specific techniques: To evaluate the performance of an entanglement purification protocol that is applied in a recurrent fashion it is necessary to know which actions are performed and especially whether the protocol should be considered successful for all possible measurement outcomes.Therefore, it is not sufficient to use the same approach as for the teleportation challenge and simply consider one particular measurement outcome for each trial.Instead, the agent is required to choose actions for all possible measurement outcomes every time it chooses a measurement action.This means we keep track of multiple separate branches (and the associated probabilities) with different states of the environment.The average density matrix of the branches that the agent decides to keep is the state that is used for the next purification step.We choose to do it this way because it allows us to obtain a complete protocol that can be evaluated at each trial and the agent is rewarded according to the performance of the whole protocol.

Discussion
As discussed in the main text, the agent found an entanglement purification protocol that is equivalent to the DEJMPS protocol [10] for an even number of purification steps.
Let us briefly recap how the DEJMPS protocol works: Initially we have two copies of a state ρ that is diagonal in the Bell basis and can be written with coefficients λ ij : The effect of the multilateral CNOT operation CNOT A1→A2 ⊗ CNOT B1→B2 followed by measurements in the computational basis on A 2 and B 2 and postselected for coinciding measurement results is: where λ ij denote the new coefficient after the procedure and N = (λ 00 + λ 10 ) 2 + (λ 01 + λ 11 ) 2 is a normalization constant and also the probability of success.Without any additional intervention applying this map recurrently, not only the desired coefficient λ 00 (the fidelity) will be amplified, but both λ 00 and λ 10 .
To avoid this and only amplify the fidelity with respect to |Φ + , the DEJMPS protocol calls for the application of √ −iX ⊗ √ iX on both copies of ρ before applying the multilateral CNOTs and performing the measurements.The effect of this operation is to exchange the two coefficients λ 10 and λ 11 thus preventing the unwanted amplification of λ 10 .So the effective map at each entanglement purification step is the following: with N = (λ 00 + λ 11 ) 2 + (λ 01 + λ 10 ) 2 .In contrast, the solution found by the agent calls for √ −iX ⊗ √ −iX to be applied, which exchanges two different coefficients λ 00 and λ 01 instead for an effective map: and N = (λ 01 + λ 10 ) 2 +(λ 00 + λ 11 ) 2 .Note that the maps (B3) and (B4) are identical except that roles of λ k0 and λ k1 are exchanged.It is clear that applying the agent's map twice will have the same effect as applying the DE-JMPS protocol twice, which means that for an even number of recurrence step they are equivalent.As a side note, the other protocol that was found by the agent as described in the main text, applies such an additional operation before each entanglement purification step as well: Applying H ⊗H on ρ exchanges λ 10 and λ 01 .This also yields a successful entanglement purification protocol, however with a slightly worse performance.

Automatic polarization variant
We also investigated a variant where after each purification step, the state is automatically depolarized  before the protocol is applied again.That means if the first step brought the state up to the new fidelity F it is then brought to the form: . This can always be achieved without changing the fidelity [9].
In Fig. 6 the obtained reward for 100 agents for this alternative scenario is shown.The successful protocols consist of applying CNOT A1→A2 ⊗CNOT B1→B2 followed by measuring qubits A 2 and B 2 in the computational basis.Optionally, some additional local operations that do not change the fidelity itself can be added as the effect of those is undone by the automatic depolarization.Similar to the scenario in the main text, there are some solutions that only accept one branch as successful, which means they only get half the reward as the success probability at each step is halved (center peak in Fig. 6).The protocols for which two relevant branches are accepted are equivalent to the entanglement purification protocol presented in [9].stages of the protocol.This means the sub-agents are tasked with finding block actions for a wide variety of initial fidelities, so a new problem needs to be solved for each new situation.In order to speed up the trials we save situations that have already been solved by subagents in a big table and reuse the found action sequence if a similar situation arises.

a. Symmetric variant
We force a symmetric protocol by modifying the actions as follows: Actions: • Purify all pairs with one entanglement purification step.
• Entanglement swapping at every second active station • Block actions of shorter lengths, that have been obtained in the same, symmetrized manner.

Additional results and discussion
We also investigated different starting situations for this setup.Here we discuss two of them: First, we also applied the agent that is not restricted to symmetric protocols to a symmetric starting situation.The results for initial fidelities F = 0.7 can be found in Fig. 7a-b.In general the agent finds solutions that are very close but not equal to the working-fidelity strategy described in D 3. Remarkably, for some reliability parameters p the agent even finds a solution that is slightly better by switching around the order of operations a little, or a threshold effect, where omitting an entanglement purification step on one of the pairs is still enough to reach the desired threshold fidelity.
Finally, we also looked at a situation that is highly asymmetric with starting fidelities (0.95, 0.9, 0.6, 0.9, 0.95, 0.95, 0.9, 0.6).Thus there are high-quality links on most connections, but two links suffer from very high levels of noise.The results depicted in Fig. 7c-d show that the advantage over a working-fidelity strategy is even more pronounced.

Working-fidelity strategy
This is the strategy we use to determine the reward constants for the quantum repeater environments and was presented in [12].This strategy leads to a resource requirement per repeater station that grows logarithmically with the distance.
For repeater lengths with 2 k links it is a fully nested scheme and can therefore be stated easily: 1. Pick a working fidelity F w .
2. Purify all pairs until their fidelity is F ≥ F w .
3. Perform entanglement swapping at every second active station such that there are half as many repeater links left.
4. Repeat from step 2. until only one pair remains (and therefore the outermost stations are connected).
We then optimize the choice of F w such that the resources are minimized for the given scenario.
As we are dealing with repeater lengths that are not a power of 2 as part of the delegated subsystems discussed in the main text, the strategy is adjusted as follows for those cases.
1. Pick a working fidelity F w .
2. Purify all pairs until their fidelity is F ≥ F w .
3. Perform entanglement swapping at the station with the smallest combined distance of their left and right pair (e.g. 2 links + 3 links).If multiple stations are equal in that regard, pick the leftmost station.
4. Repeat from step 2. until only one pair remains (and therefore the outermost stations are connected).
Then, we again optimize the choice of F w such that the resources are minimized for the given scenario.

FIG. 2 .
FIG. 2. Reinforcement learning a teleportation protocol.a Initial setup: the agent is tasked to teleport the state of qubit A to B using a Bell pair shared between A and B. b Learning curves of the PS agents: average number of actions performed in order to teleport an unknown quantum state in the case of having Clifford gates (blue) and universal gates (green) as a set of available actions.The curves represent an average over 500 agents.The shaded areas show mean squared deviation ±σ/3.c Two learning curves (blue and green) of two independent PS agents.Four solutions of different lengths are found by the agents.

FIG. 3 .
FIG. 3. Reinforcement learning an entanglement purification protocol.a Initial setup of the quantum communication environment: two entangled noisy pairs shared between two stations A and B. b Cumulative reward obtained by 100 agents for their protocols found after 5 × 10 5 trials.c Illustration of the best protocol found by the agent: Apply bilateral CNOT operations and measure one of the pairs.If the measurement outcomes coincide, the protocol is considered successful and √ −iX is applied on both remaining qubits before the next entanglement purification step.

FIG. 4 .
FIG. 4. Reinforcement learning a quantum repeater protocol.a Initial setup for the length 2 quantum repeater environment.The agent is provided with many copies of noisy Bell states with initial fidelities F = 0.75, that can be purified individually or connected via entanglement swapping at station I. b Learning curve in terms of resources (used initial Bell pairs) for the best of 128 agents with gate reliability parameter p = 0.99.The known repeater solution (red line) is reached.

FIG. 5 .
FIG.5.Reinforcement learning a scalable quantum repeater protocol.a Illustration of delegating a block action.The main agent that is tasked with finding a solution for the large scale problem (repeater length 4 in this example) delegates the solution of a subsystem of length 2 to another agent.That agent comes up with a solution for the smaller scale problem and that action sequence is then applied to the larger problem.This counts as one single action for the main agent.For settings even larger then this, the sub-agent itself can again delegate the solution of a sub-sub-system to yet another agent.b-c Scaling repeater with forced symmetric protocols with initial fidelities F = 0.75.Gate reliability parameter p = 0.99.The threshold fidelity for a successful solution is 0.9.The red line corresponds to the solution with approximately 0.518 × 10 8 resources used.b Best solution found by an agent with repeater length 8. c Relative resources used by the agent's solution compared to a symmetric strategy for different repeater lengths.d-e Scaling repeater with asymmetric initial fidelities (0.8, 0.6, 0.8, 0.8, 0.7, 0.8, 0.8, 0.6).The threshold fidelity for a successful solution is 0.9.d Best solution found by an agent with gate reliability p = 0.99 outperforms a strategy that does not take the asymmetric nature of the initial state into account (red line).e Relative resources used by the agent's solution compared to a symmetric strategy for different reliability parameters a .

FIG. 6 .
FIG. 6. Entanglement purification environment with automatic depolarization after each purification step.Figure shows obtained rewards by 100 agents for their protocols found after 5 × 10 5 trials.
FIG. 6. Entanglement purification environment with automatic depolarization after each purification step.Figure shows obtained rewards by 100 agents for their protocols found after 5 × 10 5 trials.

FIG. 7 .
FIG.7.a-b Scaling repeater with 8 repeater links with symmetric initial fidelities of 0.7.a Best solution found by an agent for gate reliability p = 0.99.b Relative resources used by the agent's solution compared to the working-fidelity strategy for different gate reliability parameters.c-d Scaling repeater with 8 repeater links with very asymmetric initial fidelities (0.95, 0.9, 0.6, 0.9, 0.95, 0.95, 0.9, 0.6).c Best solution found by an agent for gate reliability p = 0.99.d Relative resources used by the agent's solution compared to the working-fidelity strategy for different gate reliability parameters.
Illustration of the reinforcement learning agent interacting with the environment.The agent performs actions that change the state of the environment, while the environment communicates information about its state to the agent.The reward function is customized for each environment.The initial states for the different environments we consider here are illustrated: a Teleportation of an unknown state.b Entanglement purification applied recurrently.c Quantum repeater with entanglement purification and entanglement swapping.d Scaling quantum repeater concepts to distribute long-distance entanglement.