Early Real-time Estimation of Infectious Disease Reproduction Number

When an infectious disease strikes a population, the number of newly reported cases is often the only available information that one can obtain during early stages of the outbreak. An important goal of early outbreak analysis is to obtain a reliable estimate for the basic reproduction number, $R_{0}$, from the limited information available. We present a novel method that enables us to make a reliable real-time estimate of the reproduction number at a much earlier stage compared to other available methods. Our method takes into account the possibility that a disease has a wide distribution of infectious period and that the degree distribution of the contact network is heterogeneous. We validate our analytical framework with numerical simulations.

The basic reproduction number, R 0 , is a fundamental characteristic of an infectious disease. It is generally defined as the expected number of new infections caused by a typical individual during the whole period of his/her infection, in a totally susceptible population (Anderson & May, 1991;Anderson & Britton, 2000;Dietz, 1993). Because R 0 is a simple scalar quantity, and perhaps because in many circumstances it determines the expected final size of an outbreak (Kermack & McKendrick, 1927;Dietz, 1993;Ma & Earn, 2006;Arino et al., 2007;Brauer, 2008) it has been widely used as a gold standard to gauge the degree of threat that a specific infectious agent will pose as an outbreak progresses (Hethcote, 2000;Lipsitch et al., 2003;Fraser et al., 2009). While it is clear that knowing the value of R 0 can be very useful to policymakers who are responding to an infectious disease invasion, it is not straightforward to obtain a reliable estimate of R 0 , especially in early stages of an outbreak before large scale uncontrolled transmission has taken place and before the basic biology and transmission pathways of the pathogen have been characterized.
Early in an outbreak, the pattern of disease spread is predominantly under the influence of the probabilistic nature of infection transmission. Consequently, a wide array of ultimate outcomes is possible, ranging from the outbreak fizzling out even in the absence of intervention to circumstances where the initial stage expands into a large-scale epidemic. Once a full-blown epidemic develops, several assumptions can be made that simplify the estimation of R 0 , as has been discussed in detail in the literature (Anderson & May, 1991;Dietz, 1993;Hethcote, 2000;Ma & Earn, 2006). In many circumstances, however, it is crucial to assess the impact of various intervention strategies before a large-scale epidemic occurs. In doing so, stochastic manifestations of disease transmission as well as the underlying structure of the contact network 2 should be taken into account. The first aspect has been widely studied during the past. For example, the Reed-Frost model is a particular case of a chain-binomial stochastic model, where each infected individual can infect susceptible individuals independently, and those individuals are assumed to have the same contact rate with one another (Abbey, 1952;Addy et al., 1991;Ball, 1986). Another example is the methodology developed by Becker (Becker, 1976) and Ball and Donnelly (Ball & Donnelly, 1995) based on a branching process susceptible-infectedrecovered (SIR) model. Branching processes have received wide attention because they facilitate the evaluation of the reproduction number as well as the final epidemic size and epidemic probability (Guttorp, 1991). More recently, the 2003 global outbreak of severe acute respiratory syndrome (SARS) inspired the development of new methodologies based on knowledge of the daily number of new cases and the distribution of the serial interval between successive infections (Wallinga & Teunis, 2004;Cauchemez et al., 2006a,b). Unfortunately, none of these methods take into consideration the underlying contact network behind the epidemic process.
In this paper, we propose a new methodology to estimate the basic reproduction number together with other useful epidemiological quantities at an early stage of an outbreak, while accounting for both the stochastic nature of the transmission process and the underlying structure of the contact network. The main parameters used in this approach, i.e., the quantities we require in order to apply the method, are the infection rate probability, the removal rate probability and the number of newly-infected individuals per unit of time (case-notification data). The infection rate quantifies crucial information about the biology of the disease. For instance, for a viral infection, the infection rate, in principle, may be related to the viral load of infectious individuals. The removal rate indicates how quickly the infectious individuals are removed from disease dynamics. This rate can be related to death or various means of interven-3 tion such as quarantine, reduction of social activity due to severity of illness, behavior change, or other factors. The product of infection rate probability and the cumulative removal rate probability is related to the generation interval distribution (Wallinga & Lipsitch, 2007) (see section VI). The three main parameters will be discussed in more detail throughout the paper.
We will derive a stochastic renewal equation that allows us to understand how the number of newly-infected individuals is related to the preceding number of infected cases. Unlike existing approaches (e.g., Wallinga and Teunis (Wallinga & Teunis, 2004)), our method incorporates the degree distribution of the social contact network. We will show how knowledge of the social contact network can be utilized to obtain a more reliable estimate of the reproduction number. This paper is organized as follows. In Section 2 we introduce the ideas of contact network epidemiology and define the concepts of infectious rate, recovery rate and transmissibility. In Section 3 we show some examples of the spread of an infectious agent on a contact network. In Section 4 we outline a general framework to estimate the reproductive number assuming that all information about a specific realization of the epidemic process up to time t is known. In Section 5 we derive several expressions that allow us to estimate the basic reproduction number, R 0 , at an early stage of an outbreak and show some examples of the methodology. Finally, in section 6 we show the relationship between our results and the methodology developed by Wallinga and Lipsitch (Wallinga & Lipsitch, 2007), which is a special case of the framework presented in this paper.

Network basis
In this section we introduce briefly the ideas of contact network epidemiology and define the concepts of infectious rate, recovery rate and transmissibility. We map a collection of N individ-uals to a network in which each vertex represents an individual and each edge shows a pathway of possible infection transmission between two individuals. The degree of each individual, i.e., the number of contacts that s/he has -or the number of edges emanating from a vertex -is denoted by κ, and p k denotes the degree distribution, i.e., the probability that a randomly chosen vertex has degree k (k contacts). Several important quantities can be derived once a network's degree distribution is known. All moments of the degree distribution can be obtained since k n = ∞ κ=0 k κ p κ . For n = 1, k is the average number of nearest neighbors of a randomly chosen individual, which we denote z 1 . The average number of second degree neighbors of a randomly selected individual, z 2 , can be expressed as k 2 − k (Newman, 2002). To estimate R 0 , we count the number of edges along which an individual can infect others, once that individual has become infected. This quantity, the excess degree, is the number of edges emanating from a vertex (individual), excluding the edge that was the source of the focal individual's infection. In fact, the average excess degree is simply the ratio z 2 z 1 (Noel et al., 2009).
We denote the time at which an individual acquires the infection, by t I , the time since acquiring infection by τ = t − t I (also known as age of infection) and the total time that the individual can cause infection (time-to-removal) by t R . While harboring the infection, the person is first exposed (i.e., infected but not infectious) and then infectious (either symptomatically or asymptomatically). The individual may also recover, by which we mean only that s/he can no longer transmit the infection, not that s/he has necessarily completely cleared the pathogen; for some diseases, after a temporary recovery, the person might become infectious again. Knowing that an individual acquires infection at a given time t I , various states of infectiousness of this individual can be encapsulated within the hazard of infection or infectivity function, λ i (τ ).
The infectivity function measures the instantaneous risk of disease transmission across an edge.

5
This means that the conditional probability that infection occurs across and edge between times τ and τ + dτ given it did not occurred by time τ is equal to λ i (τ )dτ . Typically λ i (τ ) is zero initially (i.e., during the latent period), increases to a certain level and then declines (during the infectious period) before finally vanishing and remaining zero (i.e., in the case of permanent recovery or death). Figure 2 shows four hypothetical infectivity functions, the first of which is the typical case. In practice, the functional form of λ i (τ ) should be estimated from the actual transmission profile corresponding to a specific disease. Note that because of the limited restrictions on the functional form of λ i (τ ), must be a non-negative integrable function, the methodology we present here applies to any staged progression transmission process, such as an SE 1 E 2 ....E m I 1 I 2 ....I n R process, where S, E i and I j (i = 1, . . . , m and j = 1, . . . , n ) and R are the susceptible, exposed (infected but not infectious), infectious and recovered classes, respectively.
Using the infectivity function, we can evaluate the probability that transmission occurs across an edge during a specific time period. In particular, given the removal time t R of an individual, the transmissibility T (τ, t R ) -the probability that the individual transmitted the disease to one of its neighbors before τ units of time (since acquiring the infection) satisfies (Cox & Oakes, 1984;Newman, 2002): Let φ(τ, t R ) denote the logarithmic derivative of T (τ, t R ), which we will refer to as the infectivity rate i.e., let 6 Then it follows that T (τ, t R ) can also be expressed as: In general, the time-to-removal period, t R , varies from one individual to another and there is no a priori knowledge of the exact value of this quantity for each individual. Therefore we must account for its variability as well. Let λ r (t R ) denote the removal function or recovery hazard, i.e., the instantaneous rate of recovery for an individual t R units of time after acquiring the infection. This means that the conditional probability that an individual is removed between times t R and t R + dt R given it was not removed by time denote the probability that an individual has time-to-removal period ≥ t R , then (Cox & Oakes, 1984) subject to the condition Ψ(∞) = 0. We define the probability density function Using equation (1) and ψ(t R ) one can calculate the expected value of the transmissibility across an edge τ units of time after infection as: Finally, we define the overall expected transmissibility, i.e., the transmissibility when we wait for an adequately long time (t → ∞), as The reproduction number can now be expressed as R 0 = z 2 z 1 T , and represents the average number of infections caused by an infected individual before it is removed. exponentially and therefore can be expressed as:

Disease dynamics on networks
Progression of disease spread starts to decline as the total number of infected cases becomes comparable to the size of network. This is when the finite-size effects become important (Noel et al., 2009) (declining regime). All three regimes are represented in figure 2.  1)), N = 100, 000 nodes and {λ i = 0.12771, λ r = 0.25}. We use the tilde notation to make the distinction between the realization of a stochastic process (with tilde) and its expected (average) value (without tilde). Using equation (3)  and 40 ≥ t time units, respectively. Also, in the right panel of figure 3, we show the variation of ln(J(t)) over time. The two solid lines y 1 , y 2 ≈ αt with α 1 = α 2 = 0.26 are the tangents to the curves in the exponential regime. Figure 4 shows the same results for an exponential network with p k = 1 − e −1/κ e −k/κ and κ = 4 (z2/z1 = 7.0416). In this example, the stochastic, exponential and declining regimes are approximately between 0 ≤ t < 10, 10 ≤ t < 20 and 20 ≤ t time steps, respectively. The lines y 1 , y 2 ≈ αt with α 1 = α 2 = 0.516 are the tangents to the curves in the exponential regime. Here, we used epidemic simulation outputs to generate "synthetic" time series data. In practice, α can be estimated from real-life time series data if the outbreak progresses beyond the stochastic regime. As can be seen in figures 3 and 4, the number of newly infected cases at time t,J(t), generally resembles a noisy Gaussian-type function if the pattern of disease spread has a chance to grow substantially.

Stochastic dynamics of disease
In this section we outline a general framework to estimate the reproductive number assuming that all information about a specific realization of the epidemic process up to time t is known. LetΨ(t − τ, t)J(t − τ ) denote the fraction of individuals infected at time t − τ that remain infectious at time t. As mentioned before, the tilde is used to denote that a quantity corresponds to a specific realization of the epidemic process. LetZ(t − τ, t) denote the average (per individual) excess degree at time t. The aggregated excess degree (total number of links) of the individuals infected at time t − τ that are still infectious at time t is then given bỹ J(t) is the only dependent variable in the above formula.
As an outbreak progresses, at any given time there is a population of infected individuals, where the total number of infected cases is given bỹ which in turn, implies thatR The total excess degree of removed individuals is given bỹ It is worth emphasizing thatZ r (t) is the total number of edges that could have transmitted the infection by already removed individuals by time t, while only a fraction of them actually transmitted the disease successfully. This latter fraction is given byĨ r (t) +R r (t), whereĨ r andR r are the number of infectious and removed individuals at time t, infected by individuals who are already removed at time t. In the same manner, one can defineĨ i (t) andR i (t) as the number of infectious and removed individuals who acquired infection from those who remain themselves infectious at time t. Fig. 6 illustrates how individuals in various states infect each other, where the arrows show "who is infected from whom". More precisely, the arrows establish the causality link by showing where the superscript index (the predecessor) may come from.
For instance, a recovered/removed predecessor, denoted by superscript r inR r (t), can only come from a group whose members were recovered while their predecessors were still infectious (R i ), or from a group in which the person and his/her predecessor are removed (denoted by the self-loop fromR r to itself). Similarly, the recovered predecessor (r) of an infected person (Ĩ r (t)), may only come from a group of recovered individuals whose respective predecessors were infectious (R i (t)) or from a group whose respective predecessors were already recovered (R r (t)). The expected transmissibility of the removed individualsT r (t) at time t can thus be defined asT It is not possible to obtain a closed form equation forĨ r andR r in terms of the above-mentioned quantities, because the information about who gets infected from whom is not included in equation (8). However, estimates for the expected values for these quantities can be calculated and are given in the next section. This will allow us to estimate the basic reproduction number

Quasi-stochastic dynamics of disease spread
In this section we derive several expressions that allow us to estimate the basic reproduction number, R 0 , at an early stage of an outbreak and present some examples of the methodology.
Before going into the details, we introduce the following identity equation, which is helpful in establishing a link between the concept in the previous and the current section: Although the previous formalism is very general, in almost all practical situations, it is very difficult to obtainZ(t − τ, t), unless one is able to perform a very precise contact-tracing investigation. As a result, we prefer to approximate the excess degree per individual by the "expected excess degree", i.e.,Z(t − τ, t) ≃ z 2 z 1 . It should be noted that this approximation is only valid when the following three conditions are met: a) the effect of degree correlation 14 between neighbors is negligible; b) the number of infected individuals is large enough to provide adequate sample size; and c) the number of infected individuals is small enough comparing to the network size. Since our objective is to calculate R 0 at an early stage, these assumptions remain completely plausible. It should also be noted that the expected excess degree at t = 0 isZ(0) = z 1 and rapidly tends to z 2 z 1 as time progresses. This leads to a more precise equation which might be more suitable whenR(t) is rather small. As before, we also assume that the time to removal t R has the probability density ψ(t R ).
ReplacingZ(t − τ, t) by z 2 z 1 in (8), we obtain a new expression for the number of new infections whereR 0 (t, τ ) ≡T (t, τ, t R ≥ τ ) z 2 z 1 is the only independent random variable.
The expected number of infectious and removed individuals are given bỹ

Figures 7 and 8 show the number of infectious (left panel) and removed (right panel) individuals
for a disease that spreads on binomial and exponential networks, respectively. Each panel shows three curves. The curve comprising of red circles corresponds to the computer simulation of an epidemic on the network; during the simulation the new case counts are recorded to create a synthetic "time series" for J(t). The blue curve corresponds to the closed form equations (16) or (17) (forĨ(t) orR(t)), and the green one corresponds to the scenarios when the identity in equation (14) is used. These figures show a perfect agreement between these three curves. It should be noted that in the quasi-stochastic analysis, sinceZ(t − τ, t) is approximated by z 2 z 1 , equation (12) takes the following simple form The expected transmissbility of removed individualsT r (t) can be obtained as an extension of Eq. (6) by replacing the removal rate distribution ψ(t R ) with the conditional distribution of removal given it occured before time t, defined asψ r (t, t R ). The quantityψ r (t, t R )dt R is proportional to the number of individuals already removed by time t that were removed after This probability function, after incorporating the proper normalization, can be written as The expected transmissibility of removed individuals can then be calculated from We now define the reproduction number of the removed individuals as Combining the last equation with equations (13) and (18) one obtains that Now, considering the symmetry between different compartments in figure 6, it is straightforward to derive the following equation: Using equation (22), the basic reproduction number R 0 = T z 2 z 1 can then be estimated as On the right hand side of the last expression, the expected value ofĨ r (t) +R r (t) can be calculated asĨ where η(t, t ′ , τ ) is the fraction of infected individuals who are removed by time t and might have infected others at time t ′ ≤ t ; ζ(t ′ , τ ) is the probability function that an infection at time t ′ was caused by any of these individuals. LetJ i (t, τ ) ≡J(t − τ )Ψ(τ ), then η(t, t ′ , τ ) can be written as Fig. 9 illustrates the concept. It shows how the new infection rate at time t ′ is due to the infectious rate J i (t ′ , τ ′ ) a fraction of whom J i (t, τ ) stays infectious at a later time t. It is straightforward to write ζ(t ′ ; τ ) as Substituting the expressions of η and ζ in Eq. (25) we obtaiñ which only depends on case notification data and on the disease transmissibility. For a disease with constant removal rate λ r , we haveJ and as a resultĨ r (t) +R r (t) =R(t). Interestingly, in this case we also obtainĨ i (t) +R i (t) =Ĩ(t) and R i (t) =Ĩ i (t). This means that the first fraction on the right hand side in Eq. (24) equals unity, and therefore T r (t) z 2 z 1 = 1. Then the expression for R 0 (t) takes the simpler form

Numerical examples
To test the framework presented so far, we performed epidemic spread simulations on the two networks introduced earlier (binomial and exponential) and in each case collected the "time series" of case counts resulted from the simulations. In Fig. 10 22), respectively. The discrepancy at the very early stage is mainly due to approximatingZ(t) with z 2 z 1 . The small asymptotic deviation of our estimate for the exponential network comes from finite size effects (Noel et al., 2009). It is worth noting that T r (t) is a functional of λ i (t), λ r (t) and J(t) (see equations (20) and (19)). Therefore, when λ r is constant, the condition T r (t) z 2 z 1 = 1 allows one to evaluate, from  Figure 12: Estimates for the value of z 2 z 1 for the binomial (left) and exponential (right) networks, when λ i and λ r are known. a given time series, one of the three quantities λ i , λ r or z 2 z 1 , if the other two quantities are assumed to be known (this statement holds even if λ r is a function of time, in which case a more elaborate equation (22) should be used). For instance, we simulated again the epidemic on the binomial and exponential networks presented earlier, but this time with constant values for λ r and λ i . Using these values and the obtainedJ(t), we calculated the value of z 2 z 1 , which is presented in figure (12) for the binomial (left panel) and exponential (right panel) networks.
The green line shows our estimates using the above condition, while the red line is calculated directly from the network structure.  Figure 13: Estimates for the value of λ i for the binomial (left) and exponential (right) networks, when z 2 z 1 and λ r are known.
As a second example, we used the same epidemic simulations, but this time we used as input the values for λ r and z 2 z 1 that were derived directly from simulations and estimated the value of λ i which is shown in figure 13 for the binomial (left panel) and exponential (right panel) networks. Here again, one would observe very good agreement between the value of λ i used in the simulations (black line) and its estimated value from the condition T r (t) z 2 z 1 = 1 (red circles).

Exponential regime and generation interval distribution
In this section we show the relationship between our results and the methodology developed by Wallinga and Lipsitch (Wallinga & Lipsitch, 2007), which is a special case of the framework presented in this paper. In the exponential regime, one can ignore the stochastic fluctuation of the quantities and replace them with their expected values, since the stochastic effects become much less pronounced. In particular, we can re-write equation (15) by ignoring the stochastic effects as following We define the expression inside the bracket as χ(τ ′ ), which is the expected infection rate per infected individual. Equation (30) then takes the following simple form which is the well known Lotka renewal equation (Lotka, 1939;Kot, 2002).
From the definition of χ(τ ) it is clear that ∞ 0 χ(τ )dτ = R 0 . In the exponential regime, after replacing J(t) ≈ exp(−αt), we obtain It is worth mentioning that the exact exponential regime can be reached in the limit t → ∞ for an infinite-size network and that is why the upper limit in the above integral is set to ∞.
This means that there should be a slight deviation from the exponential behavior for a "finitesize" system at "finite time" once the outbreak has surpassed the stochastic regime. Following the line of argument in Wallinga and Lipsitch (Wallinga & Lipsitch, 2007) we find that whereχ(τ ) = χ(τ )/R 0 is the generation interval distribution. This quantity relates the basic reproduction number to the Laplace transform of the generation interval distribution in the asymptotic case (infinite-size, infinite-time). It is straightforward to obtain the following equation 23 This equation describes the relationship between the generation interval distribution and the derivative of transmissibility with respect to the duration of infection.
Equation (32) has a simpler form for constant λ r and λ i . This can be obtained by replacing This leads to α = 0.26084 and α = 0.51626 for the binomial and exponential networks, respectively, which are in excellent agreement with results shown in figures 3 and 4.

Conclusions
Using concepts from network theory, we presented a method which provides a reliable estimate of the basic reproduction number, R 0 , at an early stage of an outbreak. We provided the details of calculations and compared our results at each step against simulations. Case notification data (time series) is the main input to this analytical framework. As an outbreak begins to unfold, the pattern of spread depends substantially on the structure of the underlying contact network. This dependency, in fact, manifests itself in the formation of the time series of newly infected cases. The proposed methodology highlights the interplay between the heterogeneity in contacts (network structure), the estimates of the basic reproductive number and infection 24 transmissibility. Depending on the circumstances, this methodology can be used to infer other useful quantities as well. For infectious pathogens that cause repeated outbreaks, there is enough empirical evidence to establish the distribution of the duration of infectiousness as well as the recovery rate of individuals. In this case, in addition to the basic reproductive number, the proposed methodology can shed light on the structure of the underlying contact network by estimating the ratio z 2 z 1 . This is an important piece of information, because in many circumstances it is not possible to capture and build a detailed contact network among individuals based on some network generative rules. The importance of this quantity becomes more apparent when an emerging infectious disease strikes a population. In this circumstance, there is much less information on the characteristics of the disease such as the duration of infectiousness and recovery rate, which in turn determine the transmissibility of disease. Knowledge of the transmissibility of a disease during an early stage of an epidemic can play a crucial role, as effective and cost-effective public health intervention strategies hinge on the degree of contagiousness of a disease. On the other hand, before the spread of disease becomes rampant, the structure of the contact network within a population remains more or less stable. Therefore, the estimated value of z 2 z 1 obtained during epidemic lulls from the time series corresponding to common infections can be used to estimate the transmissibility of an emerging infectious disease at an early stage of an emerging infectious outbreak. We demonstrated this concept by two examples. Our estimate for the reproduction number very quickly tends to the real value, thus enabling epidemiologists and policymakers to identify the optimal control strategies in real time. 25