Robust identification of controlled Hawkes processes

The identification of Hawkes-like processes can pose significant challenges. Despite substantial amounts of data, standard estimation methods show significant bias or fail to converge. To overcome these issues, we propose an alternative approach based on an expectation-maximization algorithm, which instrumentalizes the internal branching structure of the process, thus improving convergence behavior. Furthermore, we show that our method provides a tight lower bound for maximum-likelihood estimates. The approach is discussed in the context of a practical application, namely the collection of outstanding unsecured consumer debt.


I. INTRODUCTION
In contrast to the exogenous intensity of an inhomogeneous Poisson point process, the intensity of a Hawkes process is self-exciting: it depends endogenously on the arrival history [1,2]. Any arrival event induces an intensity jump which dissipates through a memory kernel, and this in turn influences the probability of the next arrival event. The first applications of Hawkes processes appeared in seismology, for the analysis of earthquakes and associated aftershock sequences [3]. Since then, self-exciting processes have proved useful across numerous other fields, including finance [4], marketing [5], and neuroscience [6], to name a few. The performance of the underlying parametric models depends first and foremost on a correct model specification. Here we focus on the identification of a class of linear controlled marked Hawkes processes, where the arrival events include scalar marks and the arrival intensity is regulated by an impulse control. This class of controlled self-exciting processes was considered by Chehrazi and Weber [7] to predict the repayment behavior of unsecured loans placed in credit collections. In this application, the collector disposes of a set of account-treatment actions (e.g., establishing first-party contact or sending a notice letter) to exert pressure on the debtor. A similar class of processes was used by Rambaldi et al. [8] to model foreign-exchange price dynamics subject to exogenous deterministic jumps in the form of news about macroeconomic events.
In the credit-collection example, a misspecification of model parameters leads to faulty predictions of account values and suboptimal account-treatment schedules. 1 Although standard identification techniques, such as maximum-likelihood * michael.mark@epfl.ch † thomas.weber@epfl.ch 1 For optimal closed-loop control of repayment processes, see Chehrazi et al. [9]. estimation (MLE), may well be asymptotically consistent, the corresponding estimators tend to exhibit a significant bias as soon as the amount of available data is sparse. This is the case in many practical applications such as credit collections where a delinquent account over the collection history usually features only a few repayment events. In addition, it is often difficult to compute the best-fit parameters because of nonconcavities and near-vanishing gradients of the objective function that lead to ill-conditioned iterations. To ameliorate convergence behavior and estimation performance of standard MLE methods, we propose a robust estimation method based on an expectation-maximization (EM) algorithm. The latter exploits the branching structure of the process, featuring a primal-dual type approximation. In each iteration, first the lower bound for the likelihood function is updated ("expectation step") before the parameter estimate is reoptimized ("maximization step"). Using a fairly generic setup (in the context of credit collections, to fix ideas), we show that the EM algorithm achieves substantial improvements in convergence behavior and thus an increased robustness with respect to a broad range of starting values for the parameter vector.

A. Literature
Due to its relative simplicity, MLE is a common inference method for point processes specified via conditional intensity. A semiclosed form for the estimator was derived by Rubin [10] who established a link between the conditional density function of the interarrival times and the intensity for regular point processes. 2 The performance of MLE was tested on seismic data by Ozaki [11], who also introduced a computationally efficient recursive simplification for MLE and derived the Jacobian and Hessian of the corresponding likelihood function. Determining the MLE estimator then amounts to solving a nonconvex program, using appropriate optimization machinery-with the Jacobian and Hessian readily available for the univariate case.
Although Ogata [12] proved that the associated MLE estimator is asymptotically normal, efficient, and consistent, the amount of data available for fitting is often insufficient for attaining the asymptotic regime.
The resulting estimates tend to be heavily biased or worse, the estimator fails to converge, in many practical applications. For example, such convergence issues were noted in the case of our class of controlled Hawkes processes by Chehrazi and Weber [7] who proceeded, somewhat ad hoc, to filter out unlikely local minima using a Cramér-von Mises goodness-of-fit criterion. The generically poor and unreliable convergence of the MLE estimator is further exacerbated by the log-likelihood function's exhibiting frequently multimodal or extremely flat behavior near its critical points, resulting overall in a lack of apparent well-posedness [13], in the sense that close initialization values can produce very different estimation results. Veen and Schoenberg [14] documented these anomalies for the popular seismological spatialtemporal epidemic type aftershock sequence (ETAS) model [15][16][17] highlighting the low curvature of the ETAS log likelihood which deteriorates the performance of the numerical optimization routine. Furthermore, multimodality of the log likelihood has been empirically confirmed, since for different starting values the optimizer converges generically to different local minima. To overcome the associated computational challenges, the authors suggested to take advantage of the natural branching structure of the process [18] framing the estimation as an incomplete-data problem where the information about which event triggers other events is unobservable. Building on the case presented by Veen and Schoenberg for the ETAS model, we develop an adapted version of the expectationmaximization algorithm suited for our class of controlled Hawkes processes. Furthermore, we improve the original method by an additional term that shifts the EM-objective function (i.e., the "expected complete log likelihood"; see Sec. III B) such that, at the optimum, it becomes a tight lower bound to the log-likelihood function.

B. Outline
The remainder of this paper is organized as follows. In Sec. II, we introduce the class of linear controlled Hawkes processes and showcase its importance in two practical examples. Section III first reviews the MLE estimator pinpointing its shortcomings and then constructs our estimation method based on the EM algorithm. In Sec. IV, we compare the two methods in terms of their respective convergence stability and bias. Section V concludes.

A. Definition
The intensity of a (linear) controlled Hawkes process (CHP) is given by where τ i 0 denotes the ith arrival time and m i ∈ R is the corresponding mark, for i 1. The background intensity rate μ(t ) is a deterministic function of time t 0, the function g : R + × R → R + is a (non-negative-valued) memory kernel, and the (open-loop) control a(t ) is assumed to be a rightcontinuous function of the form where each j : R + → R + denotes a (non-negative-valued) exogenous kernel and each ϑ j is an instant at which the control variable a undergoes a jump, for j 1. The corresponding control impulses are dissipated via the (non-negative-valued) exogenous kernels j as opposed to the endogenous kernel g which governs the memory from self-excitation. We assume that on any finite time interval [0, t] the number of impulses is finite, and the intervention times ϑ j are known in advance. We also assume that both types of kernels satisfy the usual stationarity condition, so ∞ 0 max{ j (t ), g(t )} dt 1 for all relevant j. 3 The σ algebra H t = {{(τ i , m i )} × {ϑ j } : τ i < t, ϑ j < t} describes the process history, including mark sizes m i and impulse times ϑ j . A sample intensity path is shown in Fig. 1.

B. Examples
The practical relevance of CHPs is illustrated by the following two examples.
Example 1: Trading with macroeconomic news. Rambaldi et al. [8] analyze the impact of macroeconomic news on market activity, measured by the rate of change in the best quotes. They consider a Hawkes process driven by an endogenous and an exogenous kernel. The self-excitation effect is described by an unmarked endogenous kernel in the form of a linear combination of exponentials, and the effect of (recurring) macroeconomic news by an exogenous kernel in the form of a single exponential, 3 In applications with a finite observation horizon, it is sufficient to impose

Stochastic intensity
Arrival time While best-quote changes occur at the random instants τ i , the arrival times ϑ j of news releases are known in advance, rendering the exogenous kernel deterministic. Provided the control function a(t ) ≡ j:ϑ j <t N (t − ϑ j ), the intensity of the controlled Hawkes process is then given by Eq. (1).
Example 2: Credit collections. A CHP was used by Chehrazi and Weber [7] to predict the repayment behavior by holders of credit-card accounts in default. A delinquent account with outstanding balance B(0) > 0 placed into collections at time t = 0 is credited with relative repayments r i at times {τ i } i∈N until the outstanding debt is paid in full. The sequence (τ i , r i ), for i 1, constitutes a marked point process with intensity dynamics that can be described by a mean-reverting stochastic differential equation together with an initial condition: where the two-dimensional jump process J (t ) = [N (t ), R(t )] represents the marked and unmarked version of the same counting process; indeed, N (t ) = i 1 {τ i <t} captures the holder's willingness to pay and R(t ) = i 1 {τ i <t} r i his or her ability to pay. The parameter λ ∞ represents the long-run steady state to which the repayment intensity reverts at the rate κ, while δ 1 = [δ 10 , δ 11 ] denotes the sensitivity to the self-exciting two-dimensional jumps. The control variable a(t ) is assumed to be a deterministic nondecreasing right-continuous and piecewise-constant function, taking values in R + . The exogenous kernel is where the parameter vector δ 2 = (δ 21 , δ 22 , . . . , δ 2M ) contains the sensitivities of the repayment intensity to M different account-treatment actions, and the mapping l : R + → N + describes the type l (ϑ j ) of the action taken at time ϑ j . In practice, the impulses map to the available collector actions which can vary from mild (inducing smaller intensity jumps, e.g., by sending a letter of notice or making phone calls) to severe (inducing larger intensity jumps, e.g., by filing a lawsuit). Overall, the repayment intensity evolves according to Eq. (1) for m i ≡ r i . In this, the deterministic drift, is exponentially mean reverting, and the triggering kernel, describes the effect of a repayment-event arrival at time τ i on the repayment intensity, for all t τ i .

C. Branching structure
The branching structure presents an augmented view of a Hawkes point process, consisting of the Poisson clusterprocess representation introduced by Hawkes and Oakes [18]. It maps event arrivals to clusters, each of which begins with an immigrant arrival following an inhomogeneous Poisson process of base-rate intensity μ(t ) + a(t ). Subsequently, every immigrant generates its own offsprings following an inhomogeneous Poisson process with intensity given by the triggering kernel g(t ), and this cascades through all offsprings, thus generically clustering the event arrivals. Conceptually, all events fall into two categories: immigrant arrivals and offspring arrivals. Offspring events are triggered by existing events in the process, while immigrant events arrive independently without being preceded by a parent event. This conceptual separation provides additional inner structure to the process. 4 More specifically, the branching structure of the ith arrival at time τ i is described by a mapping u : R + → N + , so where   Fig. 2 into a base-rate process and arrival-triggered inhomogeneous Poisson processes. Notice that the ratio of the intensity induced due to a particular arrival to the total intensity determines the probability that the event is an offspring of that particular arrival. (b) Representation of the branching distribution: each row designates the probability mass function for the particular arrival; the diagonal represents probabilities of events being immigrants.
In practice, the branching structure is usually unobservable. Yet, conditional on a set of process parameters and a sample sequence (τ i , m i ), it is possible to recover its probability distribution, Thus, it is possible to probabilistically assign any arrival i to being an immigrant or offspring [ Fig. 3 As shown in Fig. 3(a), a Hawkes process can be decomposed into a base-rate process and a sum of arrivaltriggered inhomogeneous Poisson processes. The resulting (probabilistic) branching structure can be used to perform efficient numerical simulation [19], or as we show in Sec. III, to improve process identification. Remark 1. The branching structure implies an intuitive isoperimetric constraint on the triggering kernel g that ensures the stability of the system. Indeed, the average number of direct (i.e., first-order) offsprings generated by a single event is the expected branching ratio ν = E m [ ∞ 0 g(t, m) dt], whereby the point process remains stable if and only if ν < 1. 5

III. IDENTIFICATION
Our estimation procedure is presented in the context of Example 2 concerning the collection on defaulted creditcard accounts. Repayments follow a CHP with intensity 5 Stability is viewed here in the sense that the ratio of total events N (t ) to the number M(t ) =   1), conditional on the parameter vector θ = (κ, λ 0 , λ ∞ , δ 10 , δ 11 , δ 2 ), a known distribution F of the relative repayments (marks) m i = r i , and a given sequence of account-treatment times {ϑ j } j∈N . The information from a realization of such a process then consists of event times {τ i } i∈N , associated marks {r i } i∈N (representing a sample draw from the relative-repayment distribution F ), and associated account-treatment times {ϑ j } j∈N . Note that the components δ ( j) 2 of the parameter vector δ 2 usually take values in a finite set D with n A elements, corresponding to the finitely many available actions, some of which (e.g., phone calls or text messages) may be applied repeatedly to the same account.
In the remainder of this section, we assume that K paths H k T (for k ∈ K) have been observed over a finite time interval [0, T ], corresponding to an account portfolio K = {1, . . . , K}. The joint information is summarized by

A. Maximum-likelihood estimation
The conventional MLE procedure directly solves where the incomplete data log likelihood is given by The descriptor incomplete was coined by Veen and Schoenberg [14]; it emphasizes the fact that the estimator does not use additional branching-structure information. The incomplete log-likelihood estimator derived in this manner is asymptotically normal, efficient, and consistent. However, it suffers from the following two notable defects that significantly deteriorate its performance: (a) A closed-form solution to the maximization problem (6) is rarely available. Moreover, the efficiency of firstand second-order numerical methods is often poor, as in many cases the log likelihood is extremely flat; see Fig. 4. Along certain trajectories even large disturbances reduce the log likelihood only marginally. For instance, in the (λ ∞ , κ ) subspace the parameter λ ∞ can be increased by a factor of 2 without significantly impacting the objective function.
(b) Even in the simplest case of a constant-rate exponential Hawkes process, the log likelihood can be multimodal [21]. Specifically, the log likelihood is concave only in the case where κ is fixed. For more complicated models, such as the case of the repayment process in Example 2, the optimization program is guaranteed to be nonconvex. Even if the log likelihood is unimodal, due to the extreme flatness near the MLE estimatesθ, the objective function can become numerically multimodal as a result of rounding errors.
Although our main focus is to showcase how the branching structure can be employed in the estimation, we note several possible workarounds for MLE-convergence problems; see Fig. 5. The simplest solution to prevent the MLE estimator from getting stuck at a local minimum is to solve the optimization program (6) Table I. Black crosses denote starting values; blue diamonds denote estimation results; the green circle marks the location of the reference parameters. (a) In 39 out of 100 cases, the optimization converged to absurdly large values and was registered as failed by red squares. (b) Although the MLE procedure converged in all 100 cases, the two discovered minima are local, both far from the reference parameters. 043305-5 of this method is its computational cost, as will be shown in Sec. IV; the advantage of this method, compared to the EM-based algorithm presented below, is negligible. Another highly popular technique relies on the regularization of the estimator, imposing a coefficient penalty in an L 1 or L 2 norm [22,23]. In the context of Hawkes processes, Guo et al. [24] proved that the regularized estimator is stable. However, the exact effects of the regularizer on the convergence are still not well understood; that is, despite being functional in practice, it does remain a "black-box solution."

B. Expectation-maximization algorithm
The expectation-maximization algorithm is based on the branching-structure representation introduced in Sec. II C. The idea is to provide the estimator with additional structural information about the process conditional on the observed sample in order to improve the fitting procedure, with the aim of circumventing the problems of ill-conditioning and lack of convergence that are prevalent in the standard MLE procedure.

Complete maximum-likelihood estimator
For a known branching structure described with the mapping u in Eq. (4), one obtains the complete data log-likelihood function as a sum of two terms, L 1 and L 2 .
(i) Log likelihood for immigrant events arriving with base- is the deterministic intensity of the inhomogeneous Poisson process for the immigrants and is the effect of the action j carried out at time ϑ j : for all accounts k ∈ K.
(ii) Cumulative log likelihood of offspring events generated, respectively, by the different inhomogeneous Poisson processes with intensity g(t − τ i , r i ) = (δ 10 + δ 11 r i )e −κ (t−τ i ) , for t ∈ [τ i , T ]: for all accounts k ∈ K. Summing L 1 + L 2 over the available sample paths in the account portfolio K, the complete log likelihood of the branching process, with intensity in Eq. (1), becomes Note that the construction of the complete log likelihood takes into account that the endogenous processes generating the offspring arrivals are mutually independent and independent of the exogenous process generating the immigrant arrivals [18].
As the branching structure is unobservable, the complete log likelihood is generally unavailable. It is therefore natural to resort to the expected complete log likelihood (ECLL), conditional on the observed portfolio history H K T :

EM algorithm
The expectation-maximization algorithm is initialized with a parameter value θ 0 obtained by using prior experience or an educated guess. The first step of the two-step iteration procedure (in iteration n 1) consists of computing the conditional ECLL of the branching structure, termed Q(θ, θ n ), by conditioning the probability distribution of the branching structure in Eq. (5) on the best available parameter estimate θ n and the process parameters on the unknown parameter θ and the available portfolio data H K T . In the second step, one then performs a maximization of Q(θ, θ n ) with respect to θ , resulting in the next iterate: θ n+1 .
Expectation step (E step). Using standard notation from the unsupervised-learning literature, where the EM algorithm is sometimes used for clustering purposes [25], the conditional ECLL becomes Note that the endogenous kernel g is computed conditional on the "true" parameter θ . Maximization step (M step). Based on the current parameter estimate θ n the next iterate is determined as a result of maximizing the conditional ECLL: where the compact parameter set is a subset of the positive orthant, chosen by the user so as to limit the search using standard numerical tools.
Termination. Starting with the initial seed θ 0 , one iterates through the expectation and maximization steps until the termination condition, is satisfied for a sufficiently small tolerance ε > 0. The procedure is summarized hereafter.
Initialize seed θ 0 ∈ , fix a tolerance ε ∈ (0, 1), and set n ← 0, δ ← 1; Convergence. Dempster et al. [26] show that the sequence [Q(θ n , θ n )] n∈N is increasing and bounded, so that it must converge ( [27], p. 55). However, there is no guarantee that the limit of the maximizing sequence (see, e.g., [28], Chap. 8) is indeed associated with a global extremum. Conceptually, the EM estimates are expected MLE estimates. Dempster et al. [26] also establish that estimates obtained using the EM algorithm are consistent, just as standard MLE estimates (based on the incomplete log-likelihood function).
Remark 2 (EM produces lower bound for MLE). Solving the MLE-problem (6) numerically usually entails local approximations of the objective function, followed by choosing an appropriate increment in the direction of steepest ascent. By contrast, the EM algorithm produces a local approximation of the objective conditional on the model parameters and the distribution of the branching structure as a latent variable. This local approximation constitutes a lower bound for the incomplete log likelihood [29]. The EM algorithm alternates between updating the lower bound (E step) in Eq. (8) and updating the parameter estimate (M step) in Eq. (10) until the termination condition in Eq. (11) is satisfied. Thus, by construction, L(θ |H K T ) Q(θ,θ ), as shown in Appendix A. Intuitively, direct maximization can be viewed as fitting a single point process with specified intensity function, whereas maximizing the conditional ECLL (via the EM algorithm) simultaneously fits N (T ) + 1 inhomogeneous Poisson processes, 6 each weighted by its corresponding branching-structure probability.
Although the objective function in Eq. (8) minorizes the log likelihood, we note that this lower bound is generally not tight. By taking into account the entropy of the branching 6 Any immigrant arrival triggers an offspring process and so does each offspring arrival. Hence, there is a process for each arrival [altogether N (T ) processes] and one immigrant process. The N (T ) + 1 processes are coupled by the branching distribution in Eq. (5), which depends on the parameter vector and the observed sample data. 043305-7 distribution, the following result corrects this shortcoming and provides a tight lower bound, guaranteeing that the approximation of the log likelihood becomes exact at the optimal EM estimate. Theorem 1 (Representation). For all θ ∈ , the incomplete log likelihood can be written in the form where the non-negative defect, describes the entropy of the branching distribution given the observed history H T .
Proof. See Appendix A. Theorem 1 implies that the conditional ECLL Q(θ, θ n ) can be "adjusted" using the defect to become a tight lower bound for the log likelihood, as follows: This adjusted (conditional) ECLL can be written in the form The adjusted ECLL not only establishes direct comparability with the incomplete log likelihood, but it also significantly reduces the number of iterations needed for convergence. To illustrate the procedure, an iteration of the EM algorithm is sketched in Fig. 6, with additional details provided in Appendix A. Henceforth, all mentions of "ECLL" refer to the adjusted ECLL with objective functionQ (instead of Q).

IV. SIMULATION
For a systematic comparison of the proposed EM algorithm with the standard MLE procedure, we are particularly interested in its convergence performance with respect to randomized initial values θ 0 . Convergence performance is key in practice, since an appropriate parameter range is difficult to determine ex ante. Even "educated guesses" for θ 0 are bound to often stray significantly from the ("true") reference value θ r . The latter is used in our broad numerical experiment to generate synthetic collections data in the context of Example 2 in Sec. II B.

A. Data
It is important to note that credit-collections data by their very nature are relatively sparse. A significant portion of accounts does not exhibit any repayments. 7 This is compensated by the transversal experience across an account portfolio K containing K sample paths. Throughout the numerical experiment, we consider a CHP driven by the intensity in Eq. (1), generated with the reference parameters specified in Table I. The marks (relative repayments) are assumed to be independent and identically distributed (i.i.d.), uniformly (i.e., r i ∼ U [0, 1]). 7 Even an "empty" sample path conveys valuable information about the underlying process and thus cannot be discarded.  For each of K = 500 accounts in the portfolio K, we consider L = 100 sample paths, referred to as "scenarios." Each scenario ∈ {1, . . . , L} generates a history H K T ( ), based on which the model identification is performed using the two alternative methods (MLE and EM). This is done for M = 100 random seeds θ (1) 0 , . . . , θ (M ) 0 , obtained as realizations of the random variable θ r diag(γ ), where γ = (γ g ) is a vector of the same length as θ r with entries of the form γ g = 10 β g /(20 dB) describing the gain (positive or negative). In the numerical experiment, gains are considered to be such that β g ∈ [−26 dB, 26 dB], corresponding to amplitude distortions γ g in the interval [1/20, 20]. 8 8 Here we consider a uniform distribution (i.i.d.) of γ g on [1/20, 20]. We have also run the entire study for a uniform distribution in the Each scenario history H K T ( ) corresponds to data from a portfolio of K treated accounts, with observation horizon T = 100, where each account k ∈ K is associated with an observed repayment sequence {(τ k i , r k i )} and a sequence of three control impulses (account treatments) at the i.i.d. times ϑ k j ∼ U [0, T ] (chosen such that ϑ k 1 < ϑ k 2 < ϑ k 3 ). Table II compares the average performance of the MLE estimatorθ MLE and the EM estimatorθ EM over 100 random seeds, distributed uniformly within ±25% of the reference parameter values. It also indicates how the length of the observation horizon T impacts the accuracy of the respective accuracy of the two estimators. Interestingly, both methods produce very similar results, although EM tends to be computationally more expensive. This behavior is somewhat expected, as both methods yield MLE estimates with the EM algorithm relying on additional information related to the branching structure of the repayment process. The real advantage of the EM algorithm over direct MLE maximization becomes apparent when considering initial seeds θ 0 of significant distance to the reference parameter θ r or when limiting the observation horizon.

B. Results
Consider first the convergence of the estimation in several two-dimensional subspaces of the parameter space . For this, the starting values θ (m) 0 , for m ∈ {1, . . . , M}, are set to their reference value θ r , while the investigated pair of parameter components are randomly varied between −26 dB and +26 dB (corresponding to a maximum variation by a factor of 1/20 or 20) relative to their corresponding reference values as starting values. Figure 7 shows that MLE fails to converge to the reference values for more than half of the initial seeds (in 52 of 100 dB space, i.e., for β g uniformly distributed (i.   Q(θ, θ n ), where n is the last iterate before attaining the termination condition. instances). We note that the optimizer designated all of the estimation results (blue diamonds) as local minima (implying that a step in any direction would not improve the objective function). To visualize the performance of the estimator in the complete parameter space we use error plots investigating the relationship θ (m) 0 →θ (m) . Figure 8 indicates that while the EM algorithm succeeds in bypassing erroneous local minima, it does so with reduced variance in the estimation results.

043305-10
Another encountered deficiency is that the MLE estimator failed to converge entirely for a subset of starting values, as can be seen in Figs. 5 and 9. 9 Again, this behavior was not registered for the EM algorithm, except for cases with starting points deviating by more than +10 000% (corresponding to exactly 60 dB) from the reference values.
Remark 3 (Numerical Conditioning). The superior convergence performance of the EM algorithm has two possible sources. First, the properties derived for the MLE estimator hold only asymptotically. Although bothθ MLE andθ EM are consistent, in a limited sample both estimates can differ, as and L(θ MLE ; H K T ) L(θ EM ; H K T ). In our application, the number of repayments may not be large enough to attain an asymptotic regime in the numerical maximization of the incomplete log likelihood. On the other hand, it might be enough for the EM algorithm to produce accurate results, due to the additional branching-structure information captured by the EM estimator.
Furthermore, given the EM construction as a lower bound for the MLE, intuitively, it is expected that the EM-objective function will exhibit a larger "curvature" compared to the incomplete log likelihood. Indeed, as shown in Fig. 10, the objective function for the EM algorithm appears to be a better conditioned objective function for the same problem. We showcase this property using the condition number of the Hessian matrix, which is intricately linked to the convergence performance. In particular, we focus on the difference between the condition numbers for the MLE and EM surface. Computationally, we obtain that the EM objective shows a better conditioned Hessian on average in 80% of all points in the search space; see Fig. 11. Nevertheless, it is important to remember that both methods are local techniques, so neither can provide any guarantee for attaining a global maximizer. Classical benchmark. As indicated in Sec. III A the erroneous local minima and hence the convergence issues can be circumvented using certain heuristic techniques. Disregarding for a moment the computational burden of repeating the optimization for M initial guesses, we characterize every batch of starting values by a single vector of estimatesθ that produces the largest incomplete log likelihood for MLE and EM, respectively:θ To evaluate the accuracy of the estimation with a single number, we define the (aggregate) relative error for the best (respectively, worst) case using the 2-norm · as e = θ − θ r θ r and e = θ − θ r θ r .
The evidence from our data indicates that the worst-case and the best-case EM estimates measured in the complete loglikelihood function value are almost indistinguishable. The difference between highest and lowest value of the complete log likelihood, over the batch of initial guesses, was on average in the order of 10 −2 . This means that the sample distribution of the relative error for the worst-case and best-case EM estimates are almost identical. This dramatically contrasts to  the MLE relative-error distribution presented in Fig. 9, where the difference between the best and worst estimates can be extremely large. These empirical relationships are captured in Table IV. The superior convergence performance of the EM algorithm and negligible difference in the best-to-worst comparison speaks for EM as a more robust method of the two. Throughout the numerical experiment we have not observed a single instance of the direct MLE procedure outperforming the EM in terms of convergence. Given the substantial number of scenarios tested, we believe that this is a representative and significant result.
Remark 4 (Action Sensitivity). It is worth pointing out that EM may improve the estimation performance of MLE even in settings with limited significance of the control process a(t ). 10 Figure 12 showcases the estimation performance, measured in terms of the relative errors of both estimation methods for various δ 2 . We consider a similar setup as in the previous section (i.e., ten initial guesses for the solver and ten independently generated portfolios for each value δ 2 ). In addition, we employ the same filtering technique using the log-likelihood function value to separate the best MLE and 10 For any given realization of a(t ), for t 0, the importance of the control process for the evolution of the arrival intensity is fully described by the (non-negative) sensitivity parameter δ 2 . The case of an autonomous Hawkes process (without control) corresponds to a degenerate situation with δ 2 = 0. FIG. 12. Impact of sensitivity parameter δ 2 on the relative error. Reported values are averages over ten independently generated portfolios. (a) Relative error of the estimates producing the highest log likelihood over ten initial guesses. (b) Average relative error of all estimates over ten initial guesses. Error bands mark the applicable coefficients of variation. the worst EM estimates. As expected, the relative error of the best MLEs closely corresponds to the EM estimates; see Fig. 12(a). However, when considering the average relative error of all solutions (not just the best), we observe the previously recorded behavior of MLE's divergence as a result of disparate local likelihood minima; see Fig. 12(b). This suggests that EM can be a preferred estimation method even for uncontrolled (i.e., autonomous) Hawkes processes.

V. CONCLUSION
We have constructed an alternative estimation method for (linear) controlled Hawkes processes based on the EM algorithm. Compared to conventional maximum-likelihood maximization, the presented method exhibits a substantially more robust behavior in terms of convergence and choice of the initial guesses. The robustness was tested based on extensive synthetic credit-collections data mirroring sparse repayment observations as encountered in practice. The EM algorithm performed reliably well across all scenarios and produced maximum-likelihood estimates with a variance significantly below that produced by the standard MLE method. The bias of the EM method was assessed on the best-case MLE to the worst-case EM measured in the value of the log-likelihood function. The difference in the estimates produced was inconsequential (±0.02%) suggesting that the EM algorithm provides a significant stability gain and would therefore be the advised method for the estimation of linear CHP. Our findings suggest that EM is a viable alternative to the conventional MLE, and in applications where a rich history of observations is unavailable, it is a superior estimation method. In cases where direct maximization is preferred, the EM algorithm may be used to obtain bootstrapped initial seeds of the model parameters in question. On the theoretical side, we have shown that the (non-negative) difference between the incomplete log likelihood and the expected complete log likelihood (ECLL) is given by the entropy of the branching distribution, thus establishing a lower bound for the incomplete log likelihood which at the optimal EM estimator becomes tight.

ACKNOWLEDGMENTS
The authors wish to thank N. Chehrazi, two anonymous referees, as well as participants of the INFORMS Annual Meeting in Phoenix, Arizona, for helpful comments and suggestions. Support for this research by the Swiss National Science Foundation (under Grant No. 105218-179175) is gratefully acknowledged.

APPENDIX A: ANALYTICAL DETAILS
ECLL forms lower bound for incomplete log likelihood. For simplicity we assume that a = 0 (or else considerμ = μ + a instead of μ). Comparing the classical incomplete log likelihood and the expected complete log likelihood (as in a log-likelihoodratio test) yields To obtain the last inequality, note first that μ and g have non-negative values, and P [u By the concavity of the natural logarithm and Jensen's inequality we get

043305-13
The right-hand side can then be rewritten in the form Finally, given that the logarithm is an increasing function, it is which implies the inequality in question.
Proof of Theorem 1. Without any loss of generality, we set K = 1, so that there is only a single data path in a singleton portfolio, with the superscript k dropped for notational convenience. Assume a realization X = {(τ 1 , r 1 ), (τ 2 , r 2 ), . . . , (τ n , r n )} of a CHP given by Eq. (1) with a branching structure described with a latent variable Y = {y 1 , y 2 , . . . , y n } (i.e., y i denotes the ancestor of the ith arrival). 11 That is, X is the incomplete data with complete data given by Z = (X , Y ). Furthermore, we assume a density of the observed variable p(X |θ ), an arbitrary density of the latent variable q(Y ), and a joint density p(X , Y |θ ) between the observed and hidden variables. In the setting of CHPs, we can identify the first and the last of the densities with the incomplete and complete log likelihoods, i.e., Let G be a lower bound to the log-likelihood function parametrized by a parameter θ and the density q(Y ), such that where D[q p(·|X , θ )] denotes the Kullback-Leibler divergence (relative entropy) of q with respect to p(·|X , θ ). 12 Clearly, the bound G becomes tight if and only if the two distributions are identical. The tight lower bound G can therefore be expressed [for q(·) = p(·|X, θ )] as where the penultimate equality is obtained using the law of total probability. The two terms correspond to the ECLL in Eq. (8) and the adjustment term (θ ) in Eq. (13), respectively. Consequently, the branching distribution p(·|X , θ ) is given by for any branching-structure realization Y (see also Fig. 3). Finally, we recover which concludes the proof. EM algorithm. Building on the proof of Theorem 1 (preserving the notation used there), the EM algorithm can be described as follows.

Symbol
Description Range a(t ) Account treatment schedule R + g(t, m) Triggering kernel R + L(·) Incomplete log-likelihood function R L C (·) Complete log-likelihood function R m i Relative repayment at time t = τ i R + H t Available information at time t - Repayment process N × R + N (t ) Repayment counting process N Q(θ, θ n ) Expected complete log-likelihood function R R(t ) Cumulative relative-repayment process R + r i Relative repayment at time t = τ i [0,1] t Current time R + T Observation period R ++ δ 1 Sensitivity of intensity with respect to J R dim(J ) ++ δ 2 Sensitivity of intensity with respect to a R N ++ κ Mean-reversion parameter R ++ λ(t ) Intensity process R + λ 0 Initial value of intensity [λ(0) = λ 0 ] R ++ λ ∞ Long-run stationary value of intensity R + μ(t ) Base-rate intensity R θ Vector of process parameters [θ = (κ, λ 0 , λ ∞ , δ 1 , δ 2 )] ϑ j Time of jth account-treatment action ( j 1) R ++ τ i Arrival time of ith repayment (i 1) R ++