Ubiquitous power law scaling in nonlinear self-excited Hawkes processes

The origin(s) of the ubiquity of probability distribution functions (PDF) with power law tails is still a matter of fascination and investigation in many scientific fields from linguistic, social, economic, computer sciences to essentially all natural sciences. In parallel, self-excited dynamics is a prevalent characteristic of many systems, from the physics of shot noise and intermittent processes, to seismicity, financial and social systems. Motivated by activation processes of the Arrhenius form, we bring the two threads together by introducing a general class of nonlinear self-excited point processes with fast-accelerating intensities as a function of"tension". Solving the corresponding master equations, we find that a wide class of such nonlinear Hawkes processes have the PDF of their intensities described by a power law on the condition that (i) the intensity is a fast-accelerating function of tension, (ii) the distribution of marks is two-sided with non-positive mean, and (iii) it has fast-decaying tails. In particular, Zipf's scaling is obtained in the limit where the average mark is vanishing. This unearths a novel mechanism for power laws including Zipf's law, providing a new understanding of their ubiquity.

Introduction. Many different types of data in the natural and social sciences exhibit power law density distributions of the size or frequencies of their characteristic variables. Namely, the probability densify function (PDF) P (S) of a variable S is given by P (S) ∼ 1/S 1+α for large S values, with α > 0. Many mechanisms have been proposed to rationalise it [1][2][3][4], such as proportional growth with additional conditions [5], family transformation of the Bose-Einstein distribution [6], least-effort principles [7], optimisation between efficiency and faithfulness of self-reproduction [8] and so on.
Self-excited point processes assume that past events strongly influence the occurrence of future events. The Hawkes process [9] is the simplest such process, where the intensity (probability per unit time that a new event occurs) is linear in the sum of the triggering influence of all past events. In the last decade, the Hawkes process and generalisations have enjoyed an explosive growth in the investigation of their properties and in a large set of applications in all fields of knowledge [10][11][12][13].
Theoretically challenging, nonlinear self-excited processes have been scarcely investigated [14,15] except for a few special cases [16], even if they are a priori more suited to represent the interplay between stochasticity and nonlinear dynamics in many complex systems. Here, we study a class of nonlinear Hawkes processes characterised by fast-accelerating intensities as a function of an auxiliary field called the "tension", and report the first explicit solution that is applicable to a wide class of nonlinear Hawkes processes. We find that this class of nonlinear Hawkes family universally exhibits intensity distributions with power law tails. In particular, Zipf's scaling nat-urally appears when the distribution of marks is symmetric. A weaker condition is that the average mark is vanishing. These models are motivated by activation processes of the Arrhenius form, which are relevant in many applications in physics and also in seismicity and finance modelling as explained below.
Model. The key ingredients of the nonlinear self-exciting Hawkes process considered here are the intensity λ(t) and tension ν(t). Let us introduce a time series {t i } i , representing the timestamps of events, such as earthquakes, retweets on Twitter, or neural discharges in a brain. The intensity λ(t) fully characterises the statistics of occurrence of events, such that an event occurs with probability λ(t)dt during the interval [t, t + dt). We assume that the intensity is a nonlinear positive and monotonically-increasing function of the system tension ν(t), λ(t) = g(ν(t)) , where g(ν) is called the tension-intensity map. The tension ν(t) quantifies the total stress due to historical events, such as resulting from elastic deformations of the crust induced by earthquakes. In finance, λ(t) can represent the rate of volatility jumps and ν(t) is the rate of financial returns whose amplitude exceeds some threshold. The tension at a given time is obtained as the sum of perturbations over all past events (see Fig. 1(a) for a realisation), such that where each event i has a mark y i distributed according to the PDF ρ(y) and N (t) is the number of events during [0, t).
Conditions. There is a large variety of nonlinear Hawkes processes defined via the pair of functions g(ν) and ρ(y).
Here, the Bachmann-Landau-like inequality notation a(x) > O(b(x)) means lim x→∞ a(x)/b(x) = ∞. Also, the condition (iii) essentially means that fat-tail mark distributions, such as power law distributions, are out of scope in this Letter. Remarkably, all nonlinear Hawkes processes satisfying these three conditions have their steady-state intensity PDFs obeying the universal power law scaling, as we show below. Typical analytical forms satisfying condition (i) include g(ν) ∝ ν n with n > 2 and which is motivated by the physics of rupture [17] and earthquakes [18,19], modelled as activated processes following an Arrhenius law. Indeed, we assume that the tension ν is proportional to the seismic energy E (itself proportional to the total mechanical stress in the Earth crust), and let us assume that an earthquake happens if the system's state jumps over an energy barrier E 0 from a metastable state to another. According to the Arrhenius law, the escape rate is proportional to e −β(E0−E) ∝ e βν with a disordered-enhanced effective inverse temperature ∼ β [17], consistently with Eq. (3). This exponential dependence (3) also encompasses the class of multifractal processes emerging from the interplay between exponential activation and long memory [20], which have been shown to be relevant to model financial volatility [21]. By construction, the tension is dependent on all the marks of previous events, while future marks are drawn independently of the past history. This is consistent with the empirical unpredictability of earthquake magnitudes. Condition (ii) guarantees the stationarity of the model as a result of the cumulative contribution of the negative marks y i < 0, which prevent ν from diverging. An event with a negative (positive) mark y i is likely to inhibit (induce) future events. The coexistence of events that inhibit and of events that promote future activity in our nonlinear Hawkes model is a fundamental extension to the general class of Hawkes processes. This allows us to realistically account for ubiquitous inhibitory effect in real complex systems, such as the random mechanical stress-relaxation after earthquake in seismology, or inhibitory synaptic potentials in neural networks. Note that, in contrast, the standard Hawkes process and many other versions only have positive marks, corresponding to taking into account excitations exclusively.
Power law intensity PDF. When conditions (i)-(iii) are satisfied, the steady-state PDF of the intensity λ is analytically given by with c * being the nonnegative root of Φ(c * ) = 0 (see SM). This formula readily reduces to various power law asymptotic forms, such as λ 0 1 n (for g(ν) λ 0 ν n , n > 2). (5) Beyond conditions (i)-(iii), no other properties or details, including the shape of the memory function, change the robust classes given by expressions (5).
Zipf's law. Result (5) implies that Zipf's scaling appears as an important subclass of the non-linear Hawkes processes as a special case a = 0: except for minor logarithmic corrections. The condition a = 0 is realised exactly when c * = 0, which holds for zero-mean mark m = 0, implying a = c * /h(0) = 0. This is for instance realised for symmetric mark distribution ρ(y) = ρ(−y). Approximate Zipf's distributions are obtained when a = c * /h(0) is small, which occurs for large h(0). Symmetric mark distributions are realised in the physics of earthquakes as discussed in [19]. Indeed, the stress perturbations induced by a (small) earthquake correspond to the stress field of a double-couple, which can be simply represented by a concentrated set of four forces of the same norm, summing to zero (zero total force) and with total torque also equal to zero. A large earthquake is just a set of double-couple sources places along its fault surface. The stress induced by a doublecouple has a nice butterfly symmetry with four lobes, two positive and two negative ones, and is perfectly symmetric. With the correspondence that the tension ν is proportional to stress, and is given by (2), and that the exponential intensity function (3) derives from the physics of earthquake nucleation with Arrhenius law with an effective temperature [19], this justifies the symmetric property of the distribution of marks for earthquakes.
The results of numerical simulations for m = 0 (zero mean marks) are presented in Fig. 1. Panel a) shows a typical realisation of ν for case (3), while panel b) shows the derived temporal evolution of λ. Panel c) shows the corresponding steady-state PDF of λ obeying Zipf's law (see also SM for numerical simulations for the negative-mean cases m < 0).
Field-master equation. Our general result for a large class of memory functions can be derived using our recently introduced field-master-equation framework [22,23] (see also Appendix for the technical detail). The main idea is to convert the original low-dimensional non-Markovian stochastic process onto a high-dimensional Markovian field dynamics. This technique is called Markov embedding and has been applied for some special cases, such as memory functions composed of discrete sums of exponentials (see Refs. [24,25] for the generalised Langevin equation and Refs. [26,27] for Hawkes processes).
The Markov embedding scheme can be formulated for the nonlinear Hawkes process (1) with (2) as follows. Let us decompose the memory kernel h(t) as a continuous sum of ex-ponentials. This amounts to representing h(t) as a Laplacelike transform of another functionh(x) of the auxiliary variable x ∈ (0, ∞): Based on this decomposition, the original process (1) with (2) is equivalent to a Markovian stochastic partial differential equation (SPDE) for the excess tension with the total tension ν(t) = ∞ 0 dxz(t, x) (see Figs. 2a and 2b for schematics of the Markov embedding scheme) and the compound Poisson noise ξ P ρ(y);λ(t) = Remarkably, while the original process is non-Markovian in a one-dimensional space ν(t), the field dynamics is Markovian The equivalence between the original nonlinear Hawkes process (1) with (2) and the SPDE (8) can be shown as follows: the formal solution of the SPDE (8) is given by The total tension is then given by . This is equivalent to (2). We thus find that the Markovian SDE (8) is a correct representation after Markov embedding.
The SPDE (8) can be regarded as the "physical dynamics" of the field variable {z(t, x)} x∈R + , since x can be considered as the "physical position" in R + := (0, ∞) on which the field is evaluated. This interpretation has the advantage that the functional methods for various SPDEs of stochastic field dynamics are available for advanced analytics (e.g., the functional Fokker-Planck equations for the reaction-diffusion equations [28]).
Since the SPDE (8) is Markovian, we can obtain the corresponding master equation. By introducing the probability density functional (PDF) P t [z] := P t [{z(t, x)} R + ], the field master equation is given by with the advective and jump Liouville operators L A and L J , respectively, defined by with G[z] := g( ∞ 0 z(t, x)dx) and ρ(y) is the mark distribution.
Note that P t [z] is a path probability measure: the probability is given by P t [z]Dz that the configuration of the FIG. 2: (a-b) Schematic of the Markov embedding from the one-dimensional non-Markovian process ν(t) to the infinite-dimensional Markovian field dynamics {z(t, x)} x∈R + . The original process (a) is non-Markovian because its time evolution (1) with (2) depends on all the history {ν(s)} s≤t . On the other hand, the field dynamics (b) is Markovian because its time evolution (8) depends only on the current configuration of the field variable {z(t, x)} x∈R + . Note that the auxiliary field variable x ∈ R + is introduced according to Eqs. (7) and (8) and is interpreted as a "position" at which the field is evaluated. The decay speed is faster for smaller x, while it is slower for larger x according to Eq. field variable is nearly-equal to {z(t, x)} R + , where Dz := x∈R + dz(x) is the path-integral volume element. In addition, the ensemble average A is given by the path integral A := AP t [z]Dz. Technically, the field master equation (9) should be interpreted as a formal limit from discrete underlying descriptions according to the standard convention (see Ref. [28] and Appendix). The steady-state solution P ss [z] is related to the steady-state intensity PDF P ss (λ) Derivation. We now provide an outline of the theoretical derivation of the solution of the field-master equation (see Appendix for more detailed calculations and an illustrativecase study with the exponential memory). Let us introduce φ[z] := G[z]P ss [z] to rewrite Eq. (9) in the steady state as (11) Since the first term is negligible for large z assuming the condition (i), the asymptotic solution satisfies Under conditions (ii) and (iii), its solution is given by Here C 0 is an arbitrary functional without W as an argument. After a path integral for marginalisation, we obtain (13) Equation (4) then follows by the change of variable ν → λ.
Intuition. Let us consider the case of an exponential growing intensity (3) and a simple exponential memory kernel h(t) = (n/τ )e −t/τ , where the integral of the memory, n = ∞ 0 h(t)dt > 0, would be interpreted as the branching ratio in the linear case. Suppose that the initial tension is zero and thus the initial intensity is λ 0 . Naively, one could infer that the typical waiting time till the next event, the expected event interval (EEI), is given by 1 λ0 . Choosing the parameters such that τ 1 λ0 would imply that the influence of an event in triggering future events is extremely localised temporally and one should expect no intermittency, no power laws and a rather trivial behaviour. This reasoning is wrong as it neglects the nonlinear nature of the model with strong feedback loops. Indeed, defining the small parameter η := λ 0 τ 1, after one event occurs with positive mark y > 0, the intensity is given instantaneously by λ(t) = λ 0 e βλ0yn/η and the corresponding EEI is of the order of 1 λ(t) = 1 λ0 e −βλ0yn/η . Paradoxically, as the memory τ of the event is vanishingly smaller than the naive characteristic time scale 1 λ0 , the time needed for the next event to be triggered becomes exceedingly smaller, since 1 λ0 e −βλ0yn/η τ 1 λ0 for sufficiently small η such that η ln 1 η βλ 0 yn. Hence, in contradiction with the naive view, a very short memory enhances triggering and creates a very rich bursty dynamics of events. Readily generalised to multiple events, this reasoning gives an intuition on the basic source of the scale-free nature of the power-law intensity PDF (5), suggesting the absence of both characteristic intensity and EEI.
Number-of-events statistics. The intensity PDF is a fundamental quantity to characterise temporal properties of point processes and allows one to derive various other quantities. One such variable that is directly observable is the total number of events N twin occurring in a finite time window t win . Assuming symmetric mark distributions ρ(y) = ρ(−y), we show that Zipf's law also holds for the distribution of N twin . For simplicity, we consider the diffusive scaling limit (i.e., essentially equivalent to the system-size expansion [28], an established perturbative method invented by van Kampen [29] based on realistic scaling assumptions; see SM for a brief review) by introducing a small parameter : with -independent functionsḡ andρ. We focus on the case withḡ(ν) = λ 0 e βν . In this diffusive limit, corresponding to the mark size being typically much smaller in absolute value than the tension at any given time, the statistics of N twin obeys Zipf's law for a sufficiently short time window t win (see Fig. 2c): as an intermediate asymptotics [30] with upper cutoff N cut = O( −2 ). This relation can be derived from a superposition of Poisson statistics. Let us consider a long time series in [0, T ) and then randomly select a timepoint τ ∈ [0, T ). For a sufficientlyshort time window [τ, τ + t win ), we can assume that λ(t) is constant and the number of events obeys the Poisson statistics P (N twin |λ) = (λt win ) Nt win e −λtwin /N twin !. Choosing τ randomly and neglecting dependences between count numbers across different windows, the unconditional distribution is given by the superposition of the Poisson distribution as This examples shows that Zipf's law (6) for the intensity PDF P ss (λ) is directly relevant to Zipf's laws for other observable quantities.
Theoretically, relation (15) is expected to hold only up to the cutoff N cut (see Appendix), which diverges as → 0, guaranteeing the robust universality of Zipf's law for P ss (N twin ) in the diffusive limit. Beyond the cutoff, we numerically observe a fatter tail stemming from dependences between count numbers in adjacent time windows, which becomes dominant at very high count numbers, as can be seen from its impact on ν given by (2).
Conclusion. As power laws are widely observed in many complex systems, our theoretical finding suggests the nonlinear self-excited mechanism as an explanation for the universality of power laws. Intuitively, these properties emerge from the intricate interplay between a kind of multiplicative process, memory and endogeneity / reflexity. Our new tools and results will be useful for data analysis of real complex systems. Interested readers are referred to Ref. [31] for more mathematical details. Let us first focus on the case of a superposition of exponentials: For this case, Eq. (2) can be converted into the following Markovian dynamics, with the state-dependent Poisson noise ξ P ρ(y);λ(t) , defined by where t i is the ith event time and y i is a random number obeying a given distribution ρ (y). Note that the probability that an event occurs within interval [t, t + dt) is given by This technique [26,27] is called Markov embedding, where low-dimensional non-Markovian dynamics is converted onto higherdimensional Markovian dynamics.
We note that this Markov embedding framework is sufficiently general since any memory kernel h(t) can be written as a continuous sum of exponentials (via the Laplace transformation), which can be approximated by the discrete-sum formula (A1), such that In this sense, the discrete representationh k corresponds to the continuous function representationh(x) via this relationship. This method can be formally generalised for the general continuous sum of exponentials as shown in Appendix B 4.

Numerical scheme
We have numerically studied Eq. (2) based on the Monte Carlo simulations of Eq. (A2) for Fig. 1. Let us introduce a discretised time series, with mean m and variance σ 2 . The time step ∆s k in Eq. (A7) must be sufficiently small, such that λ(s i )∆s i 1. We therefore employ an adaptive scheme with ∆t (1) max and ∆t (2) max . In addition, we introduce a finite cutoff for the tension-intensity map, with λ max = 10 7 , to control rounding error. For the numerical trajectory generated by Eq. (A7), we obtain the empirical steady intensity distribution as under the assumption of ergodicity. In addition, we have applied a parallel computing method, with a total number N PC of parallel threads.
a. Numerical simulation for the negative mean mark case m < 0 Since the PDF obeying Zipf's law is shown for the zero mean mark case in the main text as Fig. 1c, here we provide numerical simulations for the cases with negative mean marks m ≤ 0, where the exponent deviates from Zipf's scalings. Our theory predicts the relation for the Gaussian mark distribution ρ(y) = e −(y−m) 2 /(2σ 2 ) / √ 2πσ 2 (see Appendix F 4 c for the derivation), which agrees with the numerical intensity PDFs as shown in Fig. 3.

b. Parameters
In our work, we set the parameters summarised in Table I for numerical simulations.

Appendix B: Master equation
Since Eq. (A2) is Markovian, we can derive the corresponding master equation (i.e., the time-evolution equation for the probability density function (PDF)). Let us consider the phase point z := (z 1 , . . . , z K ) and its PDF P t (z). Indeed, the PDF satisfies the following master equation where we introduce G(z) := g K k=1 z k andh := h 1 , . . . ,h K . In the following, we focus on the case with non-positive mean mark: Remarkably, the solution explodes for m > 0 (see Appendix F 3 for an intuitve discussion on the condition of the explosive solutions).

Derivation
Equation (B1) can be derived as follows. Let us introduce an arbitrary function f (z). The time-evolution of f (z) is given by with jump size y obeying a given PDF ρ(y). We take an ensemble average to obtain Here we integrate by part to obtain We also apply a variable transformation z + yh → z to obtain This yields the identity Since this identity holds for an arbitrary function f (z), we obtain Eq. (B1).

Solution for zero-mean mark distributions
The asymptotic solution of the master equation (B1) can be obtained as follows. Let us define the steady PDF P ss (z) := lim t→∞ P t (z) and φ(z) := G(z)P ss (z) to rewrite Eq. (B1) as for t → ∞. For large z, the first term is negligibly small for the fast-accelerating intensity G(z) We thus obtain Let us apply a variable transformation from z = (z 1 , . . . , z K ) to Z := (W, Z 2 , . . . Z K ) with which leads to By defining with Z := (Z 2 , . . . , Z K ), we can rewrite Eq. (B9) as This form of the integral equation is useful because the dependence on y disappears for Z , and can be regarded as an effectively one-dimensional integral equation.
With the condition that the mark distribution has zero mean m := ∞ −∞ yρ(y)dy = 0 with fast-decaying tail, the solution of this integral equation is given by with arbitrary functions C 0 (Z ) and C 1 (Z ) that do not have W as an argument (see Appendix F 4 for the derivation). As confirmed soon later, the natural boundary condition requires C 1 (Z ) = 0 and thus the general solution is finally given by The tension distribution in the steady state P ss (ν) := lim t→∞ δ(ν − ν(t)) is given by marginalisation of the full distribution as for large ν, assuming that ∞ −∞ C 0 (z 2 , . . . , z K )Π K j=2 dz j is finite (see also Appendix F 5 for the detailed calculation below). This implies Eq. (C7) in the main text. This implies that the steady intensity PDF is given by where we have used the Jacobian relationship P ss (λ) = |dν/dλ|P ss (ν), representing the conservation of probability under a change of variable.
a. Consistency with the natural boundary condition.
Technically, the steady state solution of the master equation should satisfiy the natural boundary condition, requiring a vanishing probability current at z → ∞. Here we impose this condition on the steady state solution (B14). For large z, the master equation is asymptotically given by which is equivalent to after the variable transformation z → Z := (W ; Z ) defined by Eq. (B10). The probability current is defined by the Kramers-Moyal expansion: (B20) The natural boundary condition requires the vanishing probability current at W = +∞ as Since α 1 = m = 0 for the zero-mean mark m = 0, in the steady state, the natural boundary condition is given by which requires that C 1 (Z ) = 0.
b. Polynomial intensity case: g(ν) ∝ ν n for some n > 2 We next study the power law forms of the intensity, by assuming various tension-intensity maps. Let us first consider the polynomial case of g(ν) = λ 0 + λ 1 ν n for some n > 2. This means that This means that the asymptotic form is given by the quasi-Zipf law: c. Superpolynomial intensity case g(ν) > O(ν n ) with any n > 2 To develop some intuition, let us consider two typical cases. One typical case is given by g(ν) = λ 0 e βν , implying that Another typical case is given by the super-exponential case g(ν) = λ 0 e βν n with n > 1. This case implies This means that Zipf's law holds up to the minor logarithmic factor. Let us generalise these Zipf's law for general superpolynomial cases as follows. Considering the relation we obtain For most of physically motivated functions λ = g(ν), the logarithmic contribution from {log λ} −1 is subleading compared with Zipf's part λ −2 except for the polynomial intensity. Indeed, by assuming the asymptotic balance between the logarithmic and the power law parts as (d/dν) log λ(ν) Cλ a with some real numbers a = 0 and C < ∞, we deduce the polynomial intensity λ(ν) = g(ν) ∝ ν −1/a as the corresponding exception. Thus, we find that the logarithmic factor {(d/dν) log λ(ν)} −1 is a minor correction term for the superpolynomial cases.

Solution for negative-mean mark distributions
The above calculation can be generalised by assuming that the mean mark is negative and that the probability of the positive marks is nonzero: where C 0 (Z ) and C 1 (Z ) are arbitrary functions without W as an argument (see Appendix F 4 for the derivation) and c * > 0 is the unique positive root of Φ(c * ) = 0 for the moment-generating function defined by We can prove that Φ(c) = 0 has only two roots at c = 0 and c = c * > 0 as shown in Appendix F 4. The outline of the proof is as follows: since the second order derivative of Φ(x) is always positive (see Eq. (F18) below), the first order derivative is an increasing function of x. If the mean of y is zero, at x = 0, the first order derivative is equal to zero, and thus positive for x > 0. This proves that c * = 0 for m = 0. If the mean of y is negative, the first order derivative is negative at x = 0 but it increases and passes positive for larger x. Thus Φ(x) first decreases below 0 and then crosses it again at some c * , which is the solution. Finally, the natural boundary condition requires that C 1 (Z ) must be zero as shown later soon: C 1 (Z ) = 0. We then obtain the general solution The tension distribution in the steady state P ss (ν) := lim t→∞ δ(ν − ν(t)) is given by marginalisation of the full distribution as which is derived in Appendix F 6. This implies that the steady intensity PDF is given by where we have used the Jacobian relationship P ss (λ) = |dν/dλ|P ss (ν), representating the probability conservation.
a. Consistency with the natural boundary condition.
Let us confirm the consistency of the general solution (B31) for the natural boundary condition. By substituting the general solution (B31) in the probability current J ss (W ; Z ), we obtain where we have used the expansion of the generating function Φ(x) = ∞ n=1 (α n /n!)x n . Since Φ(c * ) = 0, we obtain J ss (W ; Z ) = mC 1 (Z ). Because the natural boundary condition requires lim W →∞ J ss (W ; Z ) = 0 for any Z , the function C 1 (Z ) must be zero.
Using formula (B36), the steady PDF is given by Here we can drop the sub-dominant double-logarithmic correction, since We thus obtain the asymptotic formula for the super-exponential cases to leading order: which obeys quasi-Zipf's scaling with the correction in the exponent due to the asymmetry of the mark distribution. One can notice that the correction term (log λ/λ 0 ) 1/n in Eq. (B41) can be also regarded as subleading in the sense that lim λ→∞ 1 log λ β −1 log λ λ0 1/n = 0, deducing the Zipf's law P ss (λ) ∝ λ −2 for large λ. However, one should be cautious in dropping this term for practical analyses, since the convergence speed is slow.

Markov embedding and field master equation for continuous sum of exponentials
We have formulated the Markov embedding method for the nonliear Hawkes process with the discrete sum of exponentials (A1) and have derived the corresponding master equation (B1). Here we formally generalise this methodology for the most general case of continuous sum of exponentials: On the basis of this decomposition, the original nonlinear Hawkes process is converted into a Markovian stochastic partial differential equation (SPDE).
This conversion implies that the original one-dimensional non-Markovian process ν(t) is equivalent to an infinite-dimensional Markovian process described by {z(t, x)} x∈(0,∞) . Here x ∈ (0, ∞) is the label of the auxiliary variables {z(t, x)} x∈(0,∞) distributed on the auxiliary field (0, ∞). The master equation corresponding to the SPDE (B47) can be formally written with the formalism of functional calculus. Indeed, by introducing the probability density functional P [z] := P t [{z(x)} x ] and the intensity functional G[z] := g ∞ 0 dxz(t, x) , the field master equation [22,23] is given by which should be interpreted as a formal limit from the discrete representation (B1) according to the standard convention [28], and thus has the same asymptotic solution (B16). The functional description for the field master equation (B48) is formally introduced as follows. Let us discuss the nonlinear Hawkes process (A2) for the discrete sum of exponentials (A1), whose master equation is given by Eq. (B1). Here we introduce a lattice for x with interval dx, such that for the non-negative integer k = 1, . . . , K. Equation (A2) is then rewritten as The corresponding master equation is given by which follows the convention [28]. We note that the rigorous foundation for the functional description has not been established yet [28], and constitutes a problem out of scope of our paper.

Appendix C: Illustrative case
As an appendix, let us focus on the illustrative case where the memory function h(t) =he −t/τ is a single exponential and the distribution of marks is symmetric: ρ(y) = ρ(−y), in order to provide an intuitive understanding of the underlying generating mechanism of the Zipf and quasi-Zipf laws. In this case, the original model can be converted into a simple process obeying the stochastic differential equation (SDE) in terms of the compound Poisson process ξ P ρ(y);λ with jump-size distribution ρ(y) and corresponding intensity λ = g(ν). We apply the diffusive approximation: ξ P ρ(y);λ(t) ≈ 2Dg(ν)ξ G , with the standard white Gaussian noise ξ G and D := (h 2 /2) ∞ −∞ y 2 ρ(y)dy, which requires that the second-order moment of the PDF ρ(y) exists. This can be shown by the Kramers-Moyal (KM) expantion and truncating its series up to the second order. Indeed, the KM expantion of the master equation is given by with the kthe-order KM coefficient defined by By truncating the KM expansion up to the second-order, we obtain which is equivalent to Eq. (C1). Since 2Dg(ν) ν for fast-accelerating intensities, we obtain a ν-dependent diffusion process We note that the truncation of the KM expansion can be proved by making assumption of the diffusive scaling using the systemsize expansion (see Appendix D for the mathematical detail).

Assuming the fast-acceralating intensities
Let us consider the case of fast-accelerating intensities g(ν) > O(ν 2 ). For this case, the linear part −ν/τ becomes smaller than the diffusive term 2Dg(ν)ξ G at large ν: This means that the diffusive model (C3) can be regarded as the inhomogeneous diffusive process for large ν without relaxation term: which corresponds to This means that the corresponding steady state PDF is asymptotically given by which is consistent with the normalisation condition ∞ −∞ dνP ss (ν) = 1 under the assumption of fast-accelerating intensities g(ν) > O(ν 2 ). Since this asymptotic solution is consistent with the normalisation condition, the solution (C7) is the correct asymptotic form for P ss (ν). Since λ = g(ν), the identity P ss (ν)dν = P ss (λ)dλ expressing the conservation of probability under a change of variable leads to the following expression for the PDF of λ: This recovers the universal and quasi-Zipf's laws (6) for both superpolynomial and polynomial fast-accelerating intensity as shown in Appendix B 2.

a. Intuitive understanding
An intuitive understanding that the SDE (C3) leads to a stationary solution and thus a bona fide PDF for ν is obtained by applying Ito's lemma on the change of variable ν → χ := e −(β/2)ν for the case (3) with D = 1/2 and λ 0 = 1, leading to where we use the mathematical notation dB := ξ G dt to represent the increment of the standard Brownian (or Wiener) process. Expression (C9) describes the motion of a Brownian particle in the potential V (χ) = − χ µ χ dχ = (1/2τ )χ 2 ln(χe −2 ) − (β 2 /8) ln χ, from which the drift force µ χ derives. The behaviour of λ at large values is controlled by the dynamics of ν at large positive values, which corresponds to χ close to 0. As χ approaches 0, µ χ diverges on the positive side and repeals χ from the origin. Thus ν and λ never diverges. When χ grows, µ χ becomes negative and also diverge in amplitude, pushing it back to smaller values, thus preventing ν to become too negative and therefore stopping λ from being too small.

The case of the quadratic Hawkes process
As a marginal case, let us consider the case with a quadratic intensity which does not belong to the fast-accelerating intensity and is out of scope of our main manuscript. A part of this case is studied in Ref. [16], and this non-linear Hawkes process is called the quadratic Hawkes (QHawkes) processes exhibiting quite different behaviour (e.g., interested readers should see Ref. [16], where a special case of the QHawkes process is investigated by assuming the diffusive limit and the exponential memory kernel h(t) =he −t/τ . Before our work, this was the only study where an explicit asymptotic solution of the nonlinear Hawkes processes was given for a specific setup). The QHawkes process belongs to another class of nonlinear Hawkes processes, because the relaxation term −ν/τ and the diffusive term 2Dg(ν)ξ G ∝ √ 2λ 1 Dνξ G are of the same order for large ν: Indeed, the QHawkes is essentially similar to the Kesten process [32] because the diffusive model (C3) can be asymptotically regarded as a continuous version of the Kesten process as with the increment of the standard Brownian process dB := ξ G dt. Since the solution of the Kesten processes are known to obey non-universal power laws (in the sense that the power law exponents continuously vary according to system parameters), the intensity distribution of the QHawkes process also obeys a power law relation, with a non-universal positive number a which can take any positive number according to system parameters, such as D, τ and λ 1 . We note that this result can be confirmed by directly solving the Fokker-Planck equation (C2c), even without knowing the theoretical background of the linear Kesten processes. This is in contrast with the universal Zipf's law in our work under the symmetric assumption ρ(y) = ρ(−y) with a fixed power law exponent independent of system parameters. Thus, the QHawkes process is essentially different from our nonlinear Hawkes models because the QHawkes process can be regarded as a linear-Kesten family member while our setup belongs to nonlinear Kesten families.
Appendix D: Diffusive asymptotics: the system-size expansion Here we briefly review the system-size expansion, an established perturbative method for systems with small-noise or with weak-coupling. This method is relevant to the diffusive approximation in Appendix C and Zipf's law (15) for the number of events in the main text.
In Appendix C, we have used the diffusive approximation to derive multiplicative Gaussian noise terms √ λξ G from the Poisson noise terms ξ P ρ(y);λ . This approximation is asymptotically correct by assuming the diffusive scaling with a small parameter > 0. We also assume thatḡ(ν) andρ(Y ) are -independent with scaled mark Y := y/ . We note that this method is essentially equivalent to the system-size expansion (or the Ω expansion), invented by van Kampen [29]; historically is often written by := 1/Ω with large parameter Ω, called the system size (see a recent related Letter [34] and a review [35] for more detail). The intuitive explanation of this scaling (D1) is given as follows: the compound Poisson noise is given by ξ P ρ(y);g(ν) = Here we assume that the mark y i is sufficiently small; y i is proportional to a small parameter as with -independent mark Y i and mark distributionρ(y). Considering the Jacobian relation associated with the preservation of probability, we obtain In this sense, the scaling (D3) can be regarded as a small-noise limit or a weak-interaction limit. However, if we naively take the small noise limit → 0, the effect of the noise completely disappears. To keep the minimal effect of the noise, let us assume that the intensity is sufficiently large as We thus obtain the diffusive scaling (D1). In this sense, this scaling implies that the mark size is small but the frequency ∼ −2 is sufficiently high. We thus obtain the following specific form of the nonlinear Hawkes process (see Fig. 4 for typical trajectories): Under this assumption, we can prove that the KM expansion (C2a) converges to the Fokker-Planck equation (C2c). Indeed, the KM coefficients (C2b) have the scalings The KM expansion (C2a) therefore can be transformed as We thus asymptotically obtain the Fokker-Planck equation (C2c) for the diffusive scaling. In addition, this Fokker-Planck equation is equivalent to a multiplicative Langevin dynamics described by by using the Ito convention for the small limit. This methodology can be readily generalised for general memory kernel h(t), by considering the system-size expansion for the field-master equation.
Appendix E: Zipf's law for the events-number statistics We have studied Zipf's law for the steady distribution of intensity, P ss (λ) ∝ λ −1−a by assuming the symmetry ρ(y) = ρ(−y). Since the intensity is one of the fundamental characteristic quantities for point processes in general, our finding will be useful for understanding various Zipf's law even for other quantities. Here we discuss its application to the number of events occurring during a finite time window t win as an example.
Let us consider a long-time interval [0, T ) and randomly select a time point t * ∈ [0, T ). We then count the number of events during an interval [t * , t * +t win ) as N twin (t * ) to observe the corresponding PDF P t=t * (N twin ). In the steady state, we can assume that P t=t * (N twin ) does not depend on the selection of t * and therefore we write P t=t * (N twin ) by P ss (N twin ).
Here we derive Zipf's law for P ss (N twin ) using Zipf's law for P ss (λ), by focusing on the exponential tension-intensity map The basic idea of the derivation is to use the superposition of the Poisson distributions for a sufficiently short time window t win . For a short time window t win , let us assume that the intensity is approximately constant during [t * , t * + t win ): λ(t) ≈ const. for t ∈ [t * , t * + t win ). Under this assumption, the number of events obeys the Poisson distribution: Since t * is randomly selected, the PDF of λ(t * ) obeys Zipf's law (6) for P ss (λ) ∝ λ −2 . We thus derive Zipf's law for the unconditional PDF of the number of events as This equation assumes that one can neglect the dependence between the realisations of λ's in subsequent windows. This assumption is likely incorrect for large realisations of λ, giving a clue as to the origin of the deviation from Zipf's law beyond the cut-off value N cut = O( −2 ). The study of this strong dependence regime is left for a future work. While the idea of the superposition of the Poisson statistics works formally, the criteria for the sufficiently-short time window is ambigious for the case of nonlinear Hawkes processes, because the intensity obeys the scale-free distribution (i.e., Zipf's law) and has no clear characteristic timescales as the result of the intemittent properties of the nonlinear Hawkes processes. Because of this scale-free nature, the convergenve of the superposition technique (E3) is not uniform in terms of N twin and has a finite cutoff N cut .
To clarify this technical issue, let us consider the dimensional analysis of the nonlinear Hawkes processes by assuming the diffusive scaling limit (D1). Under this condition, the Markov-embedding representation (A2) of the nonlinear Hawkes process is approximated by for small . Assuming that ν is approximately constant ν ≈ ν * := ν(t * ) during [t * , t * + t win ), the typical displacement ∆z diff due to diffusion is given by Let us estimate the typical value of z. Since the steady PDF of tensions ν is given by P ss (ν) ∼ e −βν , the typical value of ν is given by β −1 . Since ν is directly related to z k as ν = K k=1 ν k , the typical value of z k is also the order of β −1 : z * k = β −1 . We thus estimate that the time window t win is sufficiently short if the following condition is satisfied: .
Remarkably, λ(t * ) obeys a scale-free distribution P ss (λ) ∝ λ −2 and has no specific characteristic value. It means that this criteria has an explicit dependence on the value of λ(t * ). Mathematically, this means that the short time window approximation does not uniformly hold in terms of λ if the time window is fixed. Since the number of events during t win is estimated to be N ≈ λt win , the cutoff of the PDF derived from the superposition relation (E3) is estimated to be In this sense, Zipf's law for P ss (N twin ) holds only up to the cutoff N twin N cut as Since the cutoff diverges as lim ↓0 N cut = ∞, this asymptotic relation holds for a wide regime for small and should be regarded as an intermediate asymptotics [30] for the diffusive limit (see Fig. 5 for the numerical confirmation).
(a) (b) (c) FIG. 5: Numerical confirmation of Zipf's law for the number of events Nt win during a short time twin. By setting twin = 10 −5 , we numerically observe the PDF of Nt win for 2 = 10 −4 , 10 −5 , 10 −6 . Zipf's law Pss(Nt win ) ∝ N −2 t win was found to hold up to Ncut = O( −2 ) as theoretically predicted, supporting that Zipf's law is indeed valid as intermediate asymptotics. Beyond the cutoff N > Ncut, we numerically found a fatter tail characterised by a power law exponent 3/2. This finding is an interesting issue, requiring further investigation.

Numerical scheme
For the numerical simulation of Figure 2c, we set (λ 0 , β, σ, λ max , t win , ∆t  In this Letter, the Bachmann-Landau equality and inequality notation is defined by Using this notation, we obtain

On the Laplace-like transformation
In the main text, we have introduced a Laplace-like transformation This transformation is similar to the Laplace transformation. Indeed, by introducing the variable transformation x := 1/s, we obtain with the Laplace representation H(s). This calculation implies thath(x) is equivalent to x −2 H(x −1 ).

On the condition for explosive solutions
Here we present an intuitive discussion on the existence of solutions for nonpositive mean mark m ≤ 0. To capture intuitively the nature of the dynamics, let us truncate the Kramers-Moyal expansion (B37) up to the second order: Recall that the function G has been introduced in equation (B1) as G(z) := g K k=1 z k . This Fokker-Planck equation is equivalent to the following stochastic differential equation with the standard Brownian motion B t . For positive mean mark m > 0, the time-evolution is explosive. Indeed, the noise term is negligible for large W → ∞, and the effectively deterministic dynamics is obviously explosive as soon as G grows faster than linearly as a power law or exponential function. On the other hand, the negative-mean case m < 0 is not explosive. This intuitive discussion is consistent with the rigorous mathematical results on singular stochastic processes [33]. In addition, the rigorous results in Ref. [33] guarantee that the solution is not explosive even for the marginal case m = 0. Indeed, according to [33], for an SDE .

(F12)
According to [33], the non-positive-mean case with m ≤ 0 is classified as recurrent, without explosion to ∞, because On the other hand, for positive mean mark m > 0, the system is classified as explosive. Indeed, we obtain because G −1 (x) decays faster than x −2 on the assumption of the fast-accelerating intensity.

Derivation of solution (B14)
Here we show that the solution of the integral equation This means that Φ(x) has no more than one minimum. Here we assume that ρ(y) decays sufficiently fast and Φ(x) exists. The general solution of Eq. (F15) is given by the superposition of exponentials (i.e., the two-sided Laplace representation), with the ith zero point c i , satisfying Φ(c i ) = 0 and c i < c i+1 .
a. Assuming a zero-mean mark distribution Let us assume that the mean mark is zero: Because the minimum value of Φ(x) is given by Φ(0) = 0, the only real solution of Φ(c) = 0 is therefore given by c = 0. Interestingly, c = 0 is the double root of Φ(c) = 0 and thus a special treatment is necessary: One of the basic solutions of Eq. (F15) is given by a constant function φ(ν) i C i e −ciν = C 0 . (F22) In addition, the affine function φ(ν) C 0 + C 1 ν (F23) with another constant C 1 is also a solution. Indeed, we obtain the consistent relation Considering these properties, the schematic picture of Φ(x) is given by Fig. 6b and all the roots of Φ(c) = 0 are given by c = 0 and c = c * > 0. We thus find that the solution of Eq. (F15) is given by For the zero-mean mark limit. For a reference, let us consider the zero-mean mark limit m ↑ 0. Interestingly, for infinitesimal negative m, the positive root of Φ(c) = 0 approaches zero, such that c * ↓ 0 for m ↑ 0. Then, the solution (F28) can be expanded as up to the second order by assuming infinitesimal positive c * . Here we replace C 0 := C 0 + C 1 and C 1 := −c * C 0 to obtain φ(ν) i C i e −ciν = C 0 + C 1 ν + O(C 1 c * 1 ).
By taking the limit m ↑ 0 and thus c * ↓ 0, we obtain the affine form of the solution (F23) and the specific values of the constants C 0 and C 1 .
c. Example: Gaussian mark distribution As an example, let us consider the case of the Gaussian mark distribution: with mean m and variance σ 2 . The moment-generating function is given by dyρ(y)(e xy − 1) = e mx+σ 2 x 2 /2 − 1.
This means that the non-zero root of Φ(c * ) = 0 is given by