Emergence of Exploitation as Symmetry Breaking in Iterated Prisoner's Dilemma

In society, mutual cooperation, defection, and asymmetric exploitative relationships are common. Whereas cooperation and defection are studied extensively in the literature on game theory, asymmetric exploitative relationships between players are little explored. In a recent study, Press and Dyson demonstrate that if only one player can learn about the other, asymmetric exploitation is achieved in the prisoner's dilemma game. In contrast, however, it is unknown whether such one-way exploitation is stably established when both players learn about each other symmetrically and try to optimize their payoffs. Here, we first formulate a dynamical system that describes the change in a player's probabilistic strategy with reinforcement learning to obtain greater payoffs, based on the recognition of the other player. By applying this formulation to the standard prisoner's dilemma game, we numerically and analytically demonstrate that an exploitative relationship can be achieved despite symmetric strategy dynamics and symmetric rule of games. This exploitative relationship is stable, even though the exploited player, who receives a lower payoff than the exploiting player, has optimized the own strategy. Whether the final equilibrium state is mutual cooperation, defection, or exploitation, crucially depends on the initial conditions: Punishment against a defector oscillates between the players, and thus a complicated basin structure to the final equilibrium appears. In other words, slight differences in the initial state may lead to drastic changes in the final state. Considering the generality of the result, this study provides a new perspective on the origin of exploitation in society.

game theory, asymmetric exploitative relationships between players are little explored. In a recent study, Press and Dyson [1] demonstrate that if only one player can learn about the other, asymmetric exploitation is achieved in the prisoner's dilemma game. In contrast, however, it is unknown whether such one-way exploitation is stably established when both players learn about each other symmetrically and try to optimize their payoffs. Here, we first formulate a dynamical system that describes the change in a player's probabilistic strategy with reinforcement learning to obtain greater payoffs, based on the recognition of the other player. By applying this formulation to the standard prisoner's dilemma game, we numerically and analytically demonstrate that an exploitative relationship can be achieved despite symmetric strategy dynamics and symmetric rule of games. This exploitative relationship is stable, even though the exploited player, who receives a lower payoff than the exploiting player, has optimized the own strategy. Whether the final equilibrium state is mutual cooperation, defection, or exploitation, crucially depends on the initial conditions: Punishment against a defector oscillates between the players, and thus a complicated basin structure to the final equilibrium appears. In other words, slight differences in the initial state may lead to drastic changes in the final state. Considering the generality of the result, this study provides a new perspective on the origin of exploitation in society.

I. INTRODUCTION
Equality is not easily achieved in society; instead, inequality among individuals is common.
Exploitative behavior, in which one individual receives a greater benefit at the expense of others receiving lower benefits, is frequently observed. Of course, such exploitation can originate from a priori differences in individual capacities or environmental conditions. However, such exploitation is also developed and sustained historically. Even when inherent individual capacities or environmental conditions are not different, and even when individuals are able to choose other actions to escape exploitation and optimize their benefits, exploitation somehow remains.
In this study, we consider how such exploitation emerges and is sustained. Of course addressing this question completely is too difficult, as the answer may involve economics, sociology, history, and so forth. Instead, we simplify the problem by adopting a game theoretic framework, and investigate whether exploitative behavior can emerge a posteriori as a result of dynamics in individuals' cognitive structures. We check whether "symmetry breaking" can occur when individuals have symmetric capacities and environmental conditions. Then, we investigate whether one player may choose an action to accept a lower score than the other even though both players have the same payoff matrix and even though the exploited player can potentially recover the symmetry and receive the same payoff as the exploiting player.
For this analysis, we adopt the celebrated prisoner's dilemma game, which can potentially exhibit the exploitation of one player by another. In this game, both players can independently choose cooperation or defection. Regardless of the other player's choice, defection is more beneficial than cooperation, but the payoff when both players defect is lower than that when both players cooperate. In this game, an exploitative relationship is represented by unequal cooperation probabilities between the players, as a defector can get higher benefit at the expense of a cooperator.
In the prisoner's dilemma game, the emergence and sustainability of cooperation, even though defection is any individual player's best choice, has been extensively investigated [2,3]. Cooperation can indeed emerge in repeated games in which each player chooses his/her own action (cooperation or defection) depending on the other's previous actions. In other words, a cooperative relationship emerges with the potential for punishment. Players cooperate conditionally with cooperators and defect against defectors (e.g., by a tit-for-tat (TFT) strategy). In evolutionary games, cooperation is known to stably emerge from the introduction of a "space structure" [4], "hierarchical structure" [5], or "stochastic transition of rule" [6], and so forth, in which a certain punishment mechanism against defection is commonly adopted.
In contrast to the intensive and extensive studies on cooperative relationships, however, studies on exploitative relationships (i.e., asymmetric cooperation between two players) are limited.
A recent study proposes zero-determinant strategies [1], classified as one-memory strategies, in which one player stochastically determines whether to cooperate or defect depending on the condition on the previous actions of both players. If a player one-sidedly adopts and fixes the zero-determinant strategy while the other accordingly optimizes his/her own strategy, the former player can exploit the latter. Here, however, the study focuses only on one-way learning. Hence, the two players have different ability in the beginning. Thus, whether reciprocal optimization between two symmetric players can generate an exploitative relationship remains unresolved.
Indeed, in the studies of evolutionary game with zero-determinant strategies, the cooperation [7] or generosity [8] is promoted, rather than the fixation of the exploitative relationship.
Besides the study of evolutionary game, a learning process, coupled replicator model, was introduced in the game theory for reciprocal changes in strategies [9][10][11]. Such models use a deterministic reinforcement learning process in which every player has a probability distribution that provides a probabilistic strategy for taking actions. During a repeated game, a player changes his/her strategy following the resulting payoff. Thus, if the other player's strategy is fixed, a player increases his/her own payoff throughout the repeated game. When this coupled replicator model is adopted for the prisoner's dilemma, however, neither exploitation nor cooperation emerges because the players in the model have no memories.
In this study, we extend the model in the context of the prisoner's dilemma such that the conditional strategy depends on the previous action. The reference of other's behavior is justified by an ability to make a model on the other's strategy [12][13][14]. Then, we discuss whether an exploitative relationship emerges regardless of reciprocal optimization. We also demonstrate that a small difference in initial strategies is amplified, leading to the exploitation of one player by the other.

II. MODEL
We study the well-known prisoner's dilemma (PD) game (see Fig. 1 for the payoff matrix), in which each of two players, referred to as players 1 and 2, chooses to cooperate (C) or defect (D). Thus, a game involves one of four possible actions, CC, CD, DC, and DD, where the right (left) index shows player 1's (2's) choice. For actions CC, CD, DC, and DD, player 1's score is given by R, S, T , and P , respectively. In the PD game, defection is more beneficial regardless of the other player's action, meaning that both T > R and P > S hold. In addition, mutual cooperation (CC) is more beneficial than the mutual defection (DD), meaning that R > P holds. A repeated game requires the additional condition that 2R > T + S. In other words, sequential cooperation (i.e., always choosing CC) is more beneficial than reciprocal defection and cooperation (i.e., repeatedly alternating between CD and DC). We next define a class of strategy (see Fig. 1), in which one player stochastically determines whether to choose C or D based on the other player's action in the previous round. Player 1's strategy is given by two variables that represent the probabilities of cooperation in the next round, x C and x D , when player 2 was previously a cooperator or defector, respectively.
Conversely, x C := 1 − x C (x D := 1 − x D ) indicates the probability that player 1's present action is D when the other's previous action is C (D). Throughout the this study, we use the definition X := 1−X. Similarly, player 2's strategy is given by y C and y D . These strategies include several well-known strategies, All-D (x C = x D = 0), All-C (x C = x D = 1), and TFT (x C = 1, x D = 0), as extreme cases.

A. Repeated game for fixed strategies
Before considering the dynamics of each player's strategy, we consider each player's resulting action and payoff when the strategies (i.e., x C , x D , y C , and y D ) are fixed. We assume that (CC, CD, DC, DD) is played with probability p := (p CC , p CD , p DC , p DD ) T in the previous period.
Then, the probabilities of the occurrence of (CC, CD, DC, DD) in the next round are obtained by operating the 4 × 4 Markov matrix M , which is given by (1) For a given fixed (x C , x D , y C , y D ), the probability is updated as p ′ = M p. Thus, after a sufficient number of iterated games, the probabilities converge to an equilibrium, p e . Here, this equilibrium state is uniquely defined at least when 0 < x C , x D , y C , y D < 1 is satisfied by the full connectivity of M . The equilibrium state, p e , is represented as the eigenvector of the above matrix corresponding to the 1-eigenvalue, which is written with only two variables, x e and y e , as p e = (x e y e , x e y e , x e y e , x e y e ) T (see the Supporting Information for the derivation). Here, note that each player unconditionally cooperates with probabilities x e and y e in the equilibrium state, which are given by At the equilibrium state, the payoff of player 1 (2), denoted by u e (v e ), is given by We emphasize that the equilibrium state for a repeated game is denoted by the subscript e, but it is unrelated to the equilibrium of learning dynamics discussed in the following subsection.

B. Learning dynamics of strategies
Next, we consider the dynamic changes in strategies created by a reinforcement learning process. During a repeated game, every player takes actions following his/her own strategy and reinforces the probability of cooperation or defection depending on the gained payoff. Here, we assume that the strategy updates occur much more slowly than the repetition of games does.
Under this assumption, every player can accurately evaluate the benefit gained by a single action and update his/her own strategy to increase his/her payoff under an assumption that the other player's strategy is fixed.
First, we compute player 1's payoff resulting from a cooperative action in a single game, which is denoted by u C . By assuming a repeated games equilibrium, we calculate the payoff using p = p 1C := (y e , y e , 0, 0) T , because CC (CD) occurs with probability y e (y e ) and neither DC nor DD is chosen. Note that p e is not updated during the repeated game. Then, we obtain In the same way, we obtain player 1's defecting probability p 1D and the resulting payoff u D as p 1D := (0, 0, y e , y e ) T , Second, we consider the update of x C by player 1 based on the above payoffs u C and u D .
The advantage of cooperation relative to the average is given by Then, x C increases proportionally. Note that since player 2's previous action and player 1's present action need to be C and C, respectively, the probability of using strategy x C is given by y e x C .
Then, we obtain the evolution of x C over time aṡ Here, (u C − u D ) is given by (see the Supporting Information for a detailed calculation). The dynamics of x D are similarly obtained asẋ In the same way, the dynamics of player 2's strategy are given bẏ Note that x e and y e are also time-dependent, because x e and y e are given as functions of timedependent variables (x C , x D , y C , y D ).
The above learning dynamics can be divided into three terms. For example, we focus on the dynamics of x C , given by Eq. 7. The first term, x C x C , represents frequency-dependent selection.
When x C is close to 0 or 1, evolution proceeds slowly over time because the non-dominant strategy rarely appears. Thus, the evolution to this strategy takes a long time under a biased population distribution. The second term, y e , represents the dependence of the evolutionary speed of x C upon its frequency of use, because the other player cooperates with the probability y e in the previous action. The third term, u C −u D , represents that the change rate of the strategy is proportional to the difference in resultant payoffs by C and D, due to the reinforcement learning.
The learning dynamics extend the previous "coupled replicator model" [9][10][11] to include memory of the other's previous action. Indeed in the coupled replicator model, reinforcement learning of conditional strategies is not adopted. The first term, the effect of frequency-dependent selection, is common to this model and previous models. However, the second term, i.e., the effect of conditional time evolution, is not found in the previous studies [9][10][11]. A term that corresponds to our third term, i.e., the effect of the payoff gap, exists therein, but the computation of the payoff differs. Specifically, in the previous studies, only the payoff in the present period is considered because the deviation from the equilibrium state is completely relaxed by a single game, and no conditional strategies are used. In contrast, in the present model, we need to consider the whole process by which a deviation from the equilibrium state affects future periods over the long term, as is shown in Eqs. 5 and 6.

C. Intuitive interpretation of the model
The above equilibrium state (Eq. 2) and learning dynamics (Eqs. 7, 8, and 10) seem complicated at first glance. However, we can intuitively interpret them by employing the concept of the response function [15].
First, we introduce the response function. We consider the situation in which player 2 cooperates with probability y independent of player 1's previous actions. Player 1, with strategies given by x C and x D , also becomes an unconditional cooperator with probability f x (y) = y(x C − x D ) + x D (see the Supporting Information for a detailed calculation). Indeed, against player y = 1 (i.e., a pure cooperator), f x (1) = x C holds, whereas, against a pure defector, f holds. Since f x is player 1's probability of cooperating given player 2's probability of cooperating, we call it the "response function", following the previous studies [15].
Second, the equilibrium probabilities of cooperation, x e and y e in Eq. 2, are interpreted as the crossing point of both the response functions, as shown in Fig. 2. In other words, y e = f y (x e ). (11) hold. Indeed, Eq. 11 is equivalent to Eq. 3. the strategies x C , x D , y C , and y D , respectively. Accordingly response functions f x (y) (f y (x)) is given . The crossing point of response functions (black dot) agrees with (x e , y e ), which is each player's probability to cooperate in the equilibrium of repeated game.
Third, the above learning dynamics (Eqs. 7, 8, and 10) can be easily written by using the response function (see the Supporting Information for a detailed calculation). Here, we only focus on Eq. 7 as an example. The second term, y e , corresponds to the contribution to a change in the crossing point against a change in x C . Thus, we obtain In addition, the third term, u C − u D , corresponds to the gradient of a player's payoff on the other player's response function. In other words, we obtain From Eqs. 12 and 13, with cancelling the extra components, we can rewrite Eq. 7 aṡ The same equation holds for the dynamics of x D , y C , and y D . The learning dynamics are interpreted by associating the frequency-dependent selection term, x C x C , and the adaptive learning term, ∂u e /∂x C .

III. ANALYSIS OF LEARNING EQUILIBRIUM
Now, we actually simulate the above learning dynamics. Fig. 3 shows the final states of (x * e , y * e ) given various initial states ( . Below, the superscript o ( * ) denotes an initial (a final) value of learning dynamics. Here, instead of directly plotting four-dimensional , we plot only their two-dimensional projection to (x * e , y * e ), which is the crossing point generated by their response functions.
From the figure, we see that in the case of T − R − P + S ≤ 0, only (1) pure DD (x * e = y * e = 0) and (2) pure CC (x * e = y * e = 0) strategies can be achieved. In the case of T − R − P + S > 0, however, (3) the intermediate states 0 < x * e , y * e < 1, which include the case of x * e = y * e , can also be achieved. We now analyze these fixed points mathematically.
and y * C are arbitrary. Then, the linear stability analysis shows that the fixed point is stable if u * C − u * D ≤ 0 and v * C − v * D ≤ 0 are additionally satisfied. These conditions are equivalent to x * C , y * C ≤ (P − S)/(T − P ). Thus, the pure DD fixed-point attractor exists on a two-dimensional plane with continuous values of x C and y C .
(2): The pure CC fixed point is given by Here, x * C = y * C = 1 are clearly satisfied from x * e = y * e = 1. Instead, x * D and y * D are arbitrary. Then, the fixed point is linearly . Thus, the pure CC fixed-point attractor also exists on a two-dimensional plane in which x * D and y * D continuously change. Note that hold, implying that both players sufficient punish the other's defection.
The pure DD and CC states are both well known as Nash equilibrium and as Pareto optimal, respectively. Because the dominance of these states has been extensively studied, their achievements here are not surprising. In these pure states, no exploitation appears, and both players' actions and payoffs are symmetric. Other states on the boundary of actions (such as x e = 1, y e = 0) cannot be stable fixed points (see the Supporting Information for details). The only other fixed points are given by the next case.
(3): When both x * e and y * e are neither 0 or 1, should hold to satisfy the fixed-point condition. Then,ẋ C =ẋ D =ẏ C =ẏ D = 0 is satisfied. In such cases, the condition of a fixed point for learning dynamics is From Eqs. 15, the set of (x * C , x * D , y * C , y * D ) achieving (x * e , y * e ) is uniquely given by , Note that as long as the two conditions u * is satisfied. Thus, the fixed points for learning dynamics exist again on a two(= 4 − 2)-dimensional space. Then, all such fixed points are represented just as two variables (x * e , y * e ). According to Eq. 16, there is a one-to-one correspondence between the 4-dimensional strategies of both players (x * C , x * D , y * C , y * D ) and (x * e , y * e ). Accordingly we use the plot (x * e , y * e ) in Fig. 3, instead of the four-dimensional space for the fixed points, and will be adapted later.
Although such two-dimensional fixed points exist for all sets of T, R, P, S, not all of them are always reachable from the initial conditions. We further study the stability of the fixed point by performing linear stability analysis around it. Here, we recall that there are only two constraints on the four-dimensional dynamics. Thus, two of four eigenvalues always are zero, and the stability is neutral in two-dimensional space. Now, we examine the stability by the other two eigenvalues, as seen in Fig. 4-(A). The figure shows that in the case of T − R − P + S ≤ 0, none of these novel fixed points has linear stability.
Thus, only the symmetric states, pure DD and CC, are achieved by learning dynamics.
In contrast, in the case of T − R − P + S > 0, the two-dimensional part of the fixed points satisfies linear stability. For almost all of these points, x * e = y * e holds. Because x * e = y * e is equivalent to the payoff inequality (u * e = v * e ), we refer to such states as exploitative relationships in which one player receives more benefit than the other. Such stable two-dimensional exploitation also appears even if the update speeds of the strategies are changed (see the Supporting Information for the detailed results).

B. Characterization of the exploitative relationship
We now characterize the exploitative state by comparing the payoffs for T − R − P + S > 0. Fig. 4-(B) shows both players' payoffs at the stable fixed points. Especially when the degree of exploitation, |x * e − y * e |, is large, the exploiting player obtains a higher payoff than that under the pure CC, that is, R. Thus, one player is motivated to exploit the other rather than reciprocally cooperate. However, the exploited player also receives a higher payoff than under the pure DD, that is, P . Here, P is the minimax payoff in prisoner's dilemma, and, thus, a player who optimizes his/her own strategy obtains at least P . Therefore, this player has a motivation to accept exploitation over mutual defection.
The exploitative relationship is characterized by the following two sets of equations (see the Supporting Information for a detailed derivation). First, both x C −x D > 0 and y C −y D > 0 hold.
Here, x C − x D is the difference in cooperativity against the other player's actions, which equals the gradient of player 1's response function. Because both values are positive, both players are less cooperative against defection in the last round. Thus, the exploitative relationship is supported by reciprocal punishments. Second, all of ∂x * C /∂x * e > 0, ∂x * D /∂x * e > 0, ∂x * C /∂y * e < 0, and ∂x * D /∂y * e < 0 hold. Thus, an increase in exploitation from player 1 to player 2 (i.e., a decrease in x * e or an increase in y * e ) leads to the decrease of x * C , x * D and the increase of y * C , y * D . To summarize, it should be noted that the exploitative relationship is stabilized by both players. The exploiting player guarantees that the other player receives a higher payoff than that from the pure DD through appropriate punishment with small x * D and x * C . On the other hand, the exploited player accepts the other player receiving a higher payoff than that under the pure CC but simultaneously secures a higher payoff than that under the pure DD by utilizing a weak punishment with large y * D and y * C . Importantly, this exploitative relationship is completely different from that observed by Press and Dyson because it is achieved as a result of both players' optimization.
The condition T − R − P + S > 0 can be intuitively interpreted from the perspectives of both the exploiting and exploited players. From the perspective of exploiting player, the condition written as T − R > P − S implies that a player's change of action from C to D is more beneficial when the other is C than D. In other words, the exploiting player is more motivated to defect than the exploited player is. In contrast, from the perspective of the exploited player, the condition written as R − T < S − P means that a player's change of action from D to C is more beneficial when the other is D than C. In other words, the exploited player is more motivated to cooperate than the exploiting player is. Thus, the exploitative relationship is stabilized by both the players; the exploiting (exploited) one's motivation to defect (cooperate) is more. This condition T − R − P + S > 0 is known as "submodular PD" in economics [16], so that we use this term for this condition. In addition, the same condition is also observed in a biological study [17]. However, why and how such a condition leads to the exploitation is first noted here.

IV. TRANSIENT DYNAMICS TO THE LEARNING EQUILIBRIUM
In § 3, we analyzed the fixed points and the linear stability in their neighborhoods. However, this analysis is limited to only a small partition (i.e., the neighborhood of a two-dimensional space at most) of the whole four-dimensional phase space given by (x C , x D , y C , y D ). We now study the transient dynamics to reach the learning equilibrium from arbitrary initial conditions of the two players (

A. Characterization of transient dynamics
Despite that the attractors consist of the pure DD, CC, and various degrees of exploitative state with two-dimensionality, the transient dynamics are categorized into the following several cases.
Case (1): Direct convergence to a cooperative relationship. As easily guessed, a large x C and a small x D encourage the other player to cooperate by punishing the other's defection. Thus, as Here, we emphasize that the extreme limit of the punishment strategy is given by x C = 1 and x D = 0, which is the TFT strategy. In general, strategy (x ′ D , x ′ C ) is closer to TFT than strategy (x C , x D ) is when both of x ′ C ≥ x C and x ′ D ≤ x D are satisfied. When only one of the inequalities holds, however, the strategy that is closer to TFT is not defined. At time 10, however, player 1 takes advantage of player 2's generous strategy (i.e., too much unconditional cooperation) and increases his/her probability of defection. Against player 1's behavior, player 2 does not increase punishment to maintain the previous high probability of cooperation, which further increases player 1's defection probability. The finite degree of exploitation from player 1 to player 2 is thus fixed.  Case (2): Exploitative relationship as a failure to reach cooperation. Fig. 5-(B) shows an example of trajectory that reaches an asymmetric relationship in which one player exploits the other. Initially, one player is closer to TFT than the other is. Both players pursue a cooperative relationship by punishing each other (as in case (1)) in the beginning, but the latter player becomes too cooperative to punish the other. Thus, the former player switches to defection, and the latter player's strategy conversely increases the probability of cooperation regardless of the former player's defection. Thus, an exploitative relationship is achieved.
Case (3): Cooperative relationship recovered from exploitation. As seen in Fig. 5-(C), the initial difference in the strategies is larger than that in Case (2). The player closer to TFT initially starts to exploit the other (as in Case (2)). This exploitation, however, is too strong to become stable, and the latter player increases punishment, leading to the cooperative relationship found in case (1).
Case (4): Reversed exploitative relationship. An exploitative relationship is constructed between asymmetric players as in Case (2), but now the relationship is reversed. Instead, the player who is initially farther from TFT exploits the closer player, as seen in Fig. 5-(D).
The degree of punishment oscillates over time, and the player who more cooperative switches.
If the difference in initial strategies increases further, the oscillation lasts longer, and which player exploits the other follows a complicated switching pattern. Finally, a reverse exploitative relationship is achieved.
also reaches the pure CC, as shown in Fig. 6. Thus, the basin structure, how each initial state reaches a final state, is simple. On the other hand, when PD is submodular, the basin structure is complicated as seen in Fig. 7, in which pure CC and DD strategies and various degrees of exploitative relationships are achieved. Slight differences in initial states lead to changes in the final state, especially near the boundary of the basin to the pure DD.
From Fig. 7-(C), we observe successive changes of cases (1)-(4) and further oscillation of punishments, with the difference between both players' initial strategies getting large. In addition, note that the payoff (and action) at the basin boundary between the pure CC and exploitation (i.e., case (1) and (2)) is discontinuous. The collapse of cooperation results in a rather large degree of asymmetry in the payoffs. This discontinuous transition is due to positive feedback.  In this study, we formulated novel learning dynamics in which two players mutually update their probabilistic conditional strategies through a repeated game. This learning process is decomposed into frequency-dependent selection (i.e., the term x C x C ) and adaptive learning (i.e., the term ∂u e /∂x C ).
We analyzed the fixed-point attractors of a dynamical system of strategies. Interestingly, in addition to pure DD and CC strategies, two-dimensional neutral fixed points with an exploitative relationship can be stably reached if PD is submodular. Even though the two players have the same learning dynamics and intend to optimize their payoffs, an asymmetric relationship can be achieved under certain conditions. Accordingly, when we observed exploitative relationship, it is difficult to reason why one side is exploited by the other.
Our novel finding is that the exploitative relationship is stabilized by both the exploiting and exploited players. The exploiting player receives a higher payoff than the other player does and often receives a higher payoff than that under the pure CC. The exploited player receives a lower payoff than the other player does but secures at least the minimax payoff, which is obtained under the pure DD. In addition, this exploitative relationship is structured by asymmetric punishments against the other player's defection. Both players punish each other, but the exploiting (exploited) player defects (cooperates) more than the other does.
We then analyze the transient dynamics for reaching the exploitative state. For submodular PD, the feedback of punishment leads to the temporal oscillation of final state from cooperation to exploitation, cooperation, exploitation by the other player, and so forth, depending on how close the initial strategies are to TFT. The basin structure is complicated, and slight differences in the initial strategies can lead to the drastic changes in the final state.
Complicated strategies with memories over many previous actions are sometimes studied by using multi-agent learning models, such as the coupled neural networks. As a result of reciprocal learning, an emergence of exploitative relationship [18] and the endogenous acquisition of punishment [19] are observed at some stage in the iterated PD. However, whether the state is stable or transient is not explored. To analyze such state is rather difficult, because the dynamics are nondeterministic and extremely high-dimensional, as is generally seen in machine learning studies. In contrast, our model is deterministic and low-dimensional, so that the stationary exploitative state is clearly analyzed, which will also provide a basis to study the behavior of complicated multi-agent systems.
Note that the PD game is the classic paradigm for the study of cooperation and defection.
Thus, the results of this study have general implications for the issues of cooperation, exploita-tion, and defection. Here, it is interesting to note that the emergence of exploitation depends on the payoff matrix (T, R, P, S). We have shown that submodular PD (which includes the standard case adopted in most previous studies, i.e., the matrix (5, 3, 1, 0)) generally justifies exploitation from both the exploiting and exploited player's perspectives.
It is often thought that exploitative relationships result from differences in players' abilities or environmental conditions. Whether and how players with the same learning abilities evolve toward the "symmetry breaking" associated with exploitation remains unknown. We have shown that exploitation can emerge even between players with same learning rule and the same payoff based on differences in their initial strategies. Furthermore, the complicated basin structure that we observe implies that slight difference in the initial strategies can lead to an unexpected exploitation relationship with regard to which player exploits the other. This result provides a novel perspective on the origins of exploitation and complex societal relationships.

VI. ACKNOWLEDGEMENT
The authors would like to thank E. Akiyama, T. In this section, we compute p e , the equilibrium state for the repeated game, in the main manuscript, which satisfies Assuming 0 < x C , x D , y C , y D < 1, there exists only one equilibrium state p e by the full connectivity of M . Then, we can separate M as Here, · is defined as (1 − ·) as in the main manuscript. Note that ⊗ never represents the usual Cartesian product but periodically operates between Y and X. In other words, we can also separate p e as and Y (X) operates a (b) with the output into b (a). Regardless of such a complicated operation rule, since the equilibrium state is unique and fixed, we simply get In final, we obtain the equilibrium state of repeated game as p e = (x e y e , x e y e , x e y e , x e y e ) T .
A. Introduction of response function and interpretation of equilibrium state The above separation of M is useful for understanding not only the equilibrium state but also the dynamical process itself with an idea of response function [Fujimoto2019]. We generally assume a situation that every player makes an action unconditionally on the other's previous action, in other words, p is written by (x, x) T ⊗ (y, y) T with 0 < x, y < 1. Then, we can also separate the next period state p ′ as Then, from Eqs. 18 and 22, we derive f x and f y as Thus, f x (i.e., 1's probability to cooperate in the next period) is given by the linear function to 2's probability to cooperate in the present period y, with the segment of x C (x D ) for y = 1 (0). x e = f x (y e ), In this section, we derive the learning dynamics represented by Eqs. 7, 9, and 10 in the main manuscript. For it, we need to compute u C − u D , which is given by Here, we define p 2C := (x e , 0, x e , 0) and p 2D := (0, x e , 0, x e ). Furthermore, from line 1 to 2, we used which are straightforwardly derived. In the same way, we obtain A. Interpretation of learning dynamics The learning dynamics of x C , x D , y C , and y D are intuitively interpreted by using the above response function. We focus on the second term, in other words, y e , y e , x e , and x e in equations ofẋ C ,ẋ D ,ẏ C , andẏ D , respectively. We directly obtain ∂x e ∂x C , ∂x e ∂x D , ∂y e ∂y C , ∂y e ∂y D = (y e , y e , x e , These equations show that the movement of crossing point (x e , y e ) by the change of x C , x D , y C , and y D is proportional to (y e , y e , x e , x e ), i.e., the second term in Eqs.7, 9, and 10 in the main manuscript, respectively.
Next, we also focus on the third term, in other words, u C − u D inẋ C ,ẋ D and v C − v D iṅ y C ,ẏ D , respectively. It is directly obtained that du(x e , f y (x e )) dx e = ∂u ∂x eq + (y C − y D ) ∂u ∂y eq = −{y e (T − R) + y e (P − S)} + (y C − y D ){x e (R − S) + x e (T − P )} Here, u(x, f y (x)) and v(y, f x (y)) indicate 1's and 2's own payoff with recognizing the other's response function, respectively. Thus, Eqs. 29 show that each player's gap of payoffs between C and D is proportional to the gradient of the own payoff on the other's response function.

IX. ANALYSIS OF FIXED POINTS
In this section, we prove that the stable fixed points on the boundary of actions, i.e., either with x e = 0, 1 or y e = 0, 1, are only pure CC (x e = y e = 1) and DD (x e = y e = 0). In the beginning, we categorize fixed points on the boundary actions into (1). x e = y e = 0, (2).
(2): From x e = 0 and 0 < y e < 1, we get x C = x D = 0 and y D = y e . However, since v C − v D < 0 always holds,ẏ D is negative, which results in the decrease of y e . In other words, all the states of x e = 0, 0 < y e < 1 are not fixed points themselves.
(3): From x e = 0 and y e = 1, we get x C = 0 and y D = 1. Then, sinceẋ C =ẋ D =ẏ C =ẏ D = 0 hold, the states can be fixed points. However, since v C − v D < 0 always holds,ẏ D is negative in the neighbor of all the fixed points. Therefore, all the states of x e = 0 and y e = 1 are unstable, and resulting in the deviation from them.
(4): From 0 < x e < 1 and y e = 1, we get x C = x e and y C = y D = 1. However, since x C − x D < 0 always holds,ẋ C is negative, which results in the decrease of x e . In other words, all the states of 0 < x e < 1 and y e = 1 are not fixed points themselves.

X. CHARACTERIZATION OF EXPLOITATIVE RELATIONSHIP
In this section, we derive the character of exploitative relationship, where (1) every player punishes the defector to some degree, and (2) one player gets closer to defector with the extension of exploitative relationship (x e − y e ).
First, (1) needs x C − x D > 0 in equilibrium for all 0 ≤ x e , y e ≤ 1. These are proven as Here, we use the condition for prisoner's dilemma (T > R > P > S).
Second, (2) needs all of ∂x C /∂y e < 0, ∂x D /∂y e < 0, ∂x C /∂x e > 0, and ∂x D /∂x e > 0. We can prove them as Here, from line 5 to 6 and from line 7 to 8, we additionally use the condition T + S > 2P and T + S < 2R, respectively, which are derived from submodular condition.

A. Analysis of exploitation
In this section, we derive the boundaries of exploitative relationship. From the Eqs. 16 in the main manuscript, we get the boundary conditions for (x e , y e ) as The boundary conditions for y * C = 1 and y * D = 0 are obtained in the same way. Fig. S1 shows each of the boundary conditions of exploitative relationship on the (x e , y e )-plane. Furthermore, we also compute the most exploitative state from 1 to 2, where y * e − x * e is maximized within the region of stable fixed points. As shown in Fig. S1, such a state is equivalent to the crossing point of the lines (x e , y e ) given by the conditions y * C = 1 and x * D = 0. This is obtained as . ( The most exploitative relationship from 2 to 1 is obtained in the same way.
As an example, we concretely obtain the maximal degree of exploitation that can be estab- This demonstrates, we can confirm that the exploiting side 1 gets more payoff than pure CC (u * e > R), and the exploited side gets more payoff than (minimax) pure DD (v * e > P ).

XI. DEPENDENCE ON LEARNING SPEEDS
In this section, we study how the region of stable fixed points changes, when the difference in learning speeds between the players is introduced. Here we define S 1C , S 1D , S 2C , and S 2D as the speeds for the relaxation of x C , x D , y C , and y D , respectively. Thus we extend our learning dynamics (Eqs. 7, 9, and 10 in main manuscript) aṡ x C = S 1C x C x C y e (u C − u D ), x D = S 1D x D x D y e (u C − u D ), Of course, the case of S 1C = S 1D = S 2C = S 2D (i.e., symmetric learning speeds) is equivalent to the original model. Now we consider two kinds of asymmetry: One is the asymmetry between C and D, i.e., S C := S 1C = S 2C and S D := S 1D = S 2D but S C = S D . The other is the asymmetry, between players, i.e., S 1 := S 1D = S 1C and S 2 := S 2D = S 2C but S 1 = S 2 .  Here, recall that the dynamics of x C (y C ) are slower near the boundary x C = 1 (y C = 1) because of the frequency-dependent selection term.
In addition, the faster the learning of cooperation is, the more difficult to achieve a state with 0 < x C < 1 and 0 < y C < 1 is. Therefore, the region in which stable fixed points exist is limited around the boundary of either x C = 1 or y C = 1 (Compare Fig. S2-(A) and Fig. S1). Fig. S2-(B) shows how the region of stable fixed points changes when the learning of defection is faster than cooperation (S D ≥ S C ). For the same reason, the region in which stable fixed points exist is also limited around the boundary of either x D = 0 or y D = 0 (Compare Fig. S2-(B) and Fig. S1). Around the boundary of y * C = 1 (and also in the vicinity of y * D = 0), the oscillational dynamics hardly appears because of the relative slowness of y C and y D .
Additionally note that the most exploitative state from 1 to 2, in which x * C = 1 and y * D = 0 are satisfied, is stable for all learning speeds. Such a state is easy to be achieved, because the dynamics of two of four variables are slow around them.