Individual- and pair-based models of epidemic spreading: master equations and analysis of their forecasting capabilities

Mathematical modeling of disease spreading plays a crucial role in understanding, controlling and preventing epidemic outbreaks. In a microscopic description of the propagation of a disease over the complex network of human contacts, the probability that an individual is in a given state (susceptible, infectious, recovered etc) depends on the state of its neighbors in the network. Thus it depends on the state of pairs of nodes, which in turns depends on triples, in a hierarchy of dynamical dependencies. In order to produce models that are at the same time reliable and manageable, one has to understand how to truncate such a hierarchy, and how the chosen order of approximation affects the ability of the model to forecast the real temporal evolution of an epidemics. In this paper we provide a systematic analysis of the reliability (under different hypotheses on the quantity and quality of available data) of the predictions obtained by truncating the hierarchy either at the level of individuals or at the level of pairs. We find that pair-based models are reliable tools both for estimating the epidemiological parameters and for forecasting the temporal evolution of the epidemics, under all conditions taken into account in our work. However, a pair-based approach provides a much better prediction of an individual-based one, only if better data, namely information on the state of node pairs, are available. Overall, our results suggest that more refined mathematical models need to be informed by improved contact tracing techniques to better support decision on policies and containment measures to adopt.


I. INTRODUCTION
The last decades have witnessed the emergence of new infectious diseases and the resurgence of old ones [1][2][3].Examples include the outbreak of Ebola in West Africa [4], the epidemic of Zika virus in North and South America [5], and the global spreading of respiratory diseases such as influenza A(H1N1) [6] and, more recently, COVID-19 [7,8], all of which have risen international public health concern.In this context, mathematical modeling of disease spreading has played a fundamental role in the understanding, control and prevention of epidemic outbreaks, guiding the policy-making processes through quantitative analyses [9][10][11].As the spreading of an infectious disease within a population is heavily affected by the precise patterns of contacts among individuals and by their mobility habits, complex networks, which allow to represent the intricate structure of human interactions [12][13][14], are a fundamental tool for analyzing the dynamics of an epidemic.A variety of techniques exists to model the dynamics of a disease spreading on a network [10,11], ranging from macroscopic compartmental models based on ordinary differential equations [15,16], to more sophisticated data-driven agentbased microscopic simulations [17][18][19] and metapopulation structured models [20,21].In particular, deterministic representations of compartmental models can be formulated, aiming at describing the epidemic process in terms of the temporal evolution of the probabilities that a node is in a given state.To this purpose, one can adopt either a top-down approach, which consists in defining all the possible configurations of the network and the mechanisms ruling transitions from one network state to another, or a bottom-up approach, focusing instead on the state of single nodes [11].The latter approach leads to a hierarchy of coupled differential/master equations.Indeed, the dynamics of a node typically depends on the state of the node itself and of its neighbors, thus on the state of pairs of nodes, which in turns depends on triples, in a hierarchy of dynamical correlations and dependencies.
Such hierarchies are very common in kinetic and statistical physics.For example, classical derivation of the Boltzmann equation of rarefied gas dynamics is based on the so-called BBGKY hierarchy, formally obtained by the integration of the Liouville equation of a system of N particles undergoing binary collisions: the evolution equation for the k-particle distribution function depends on the (k + 1)-particle distribution function [22].As another example, in Extended Thermodynamics, the evolution equations of moments of order k of the distribution function depend on the moments of order k + 1 [23].In all cases, to be of practical use the hierarchies have to be truncated by introducing some approximation.For instance, in the case of the Boltzmann equation, the hierarchy is closed at the level of single particle distribution function by the so-called Stosszahlansatz and propagation of chaos, while in extended thermodynamics, the closure is usually based on the so-called maximum entropy principle [22,23].
In the context of mathematical epidemiology, similar bottom-up approaches have been adopted to develop individual and pair-based continuous-time models on networks, and applied, for instance, to the SIR process in the homogeneous [24] and heterogeneous [25] mean-field approximation.Individual and pair-based approximations for the SIR process have also been analyzed in the discrete-time case where, unlike the standard continuoustime BBGKY-type hierarchy, the master equations describing the dynamics of the system are expressed in terms of joint probabilities, whose order is governed by the network structure itself via the degrees of individual nodes [26], thus resulting in a richer although more complicated mathematical structure.A relevant question concerning these kinds of hierarchies is at which order one needs to truncate them in order to get accurate predictions about an epidemic process.Indeed, while it has been shown how to close the system of equations at various orders [27], it is less clear how the chosen approximation affects the features of the model, and, more specifically, its ability to forecast the dynamical evolution of the state variables.In particular, a systematic analysis of the reliability of the predictions obtained by different hierarchy closures and under different hypotheses on the quantity and quality of available data, is still lacking.Indeed, the problem of data availability is of utmost importance in the context of the COVID-19 pandemic, where the epidemic process has been proven difficult to characterize and the mechanisms at work have not yet been fully unveiled [28][29][30].In this paper we address the issues above focusing on the SEAIR model, a compartmental model with five compartments.This is a generalization of the SIR model accounting for two criti-cal features of infectious diseases such as the COVID-19, namely the existence of a latency period and the presence of asymptomatic carriers.While being rather simple, the SEAIR model allows to study the crucial case of two infectious population, namely the symptomatic and the asymptomatic, with the latter that, given the difficulties in its measurement, had a significant role in the COVID-19 pandemic [8,[31][32][33].In more detail, first we derive the discrete-time master equations for the SEAIR model with two different orders of approximation, closing the system both at the level of individuals (individual-based population models) and at the level of pairs (pair-based population models), and deriving the corresponding set of master equations.We then compare these two approximations, analyzing to which extent they can capture the temporal evolution of the SEAIR epidemic process.To do so, we consider scenarios of increasing complexity, with the hypotheses on the amount of available data gradually becoming more strict from case to case.
The paper is organized as follows.First, we describe the microscopic mechanisms underlying the epidemic process and discuss the mathematical modeling approaches to describe it in Sec.II.We then derive an individual-based and a pair-based version of the SEAIR model in Sec.III.Assuming to know the epidemiological characteristics, in Sec.IV we compare the ability of the models in reproducing the dynamics of an outbreak.In Sec.V we analyse the predictive capability of the models when parts of the information on the epidemic are not available.Finally, we summarize and discuss the main findings of the work in Sec.VI

II. MODELING AN EPIDEMIC PROCESS
In compartmental models the population is partitioned into several states, namely the compartments, representing the different stages of the disease course.For instance, in the SIR model, which is one of the simplest compartmental models, an individual can either be susceptible (S), infected (I), or recovered/removed (R).To model the spreading of a disease within a population, the most important step is to characterize the processes governing the transitions of individuals from one disease stage, to another i.e. from a compartment to another [10].In the SIR model, the contagion is defined by two fundamental mechanisms.First, we have a two-body nonlinear process, representing the infection of a susceptible individual (S) by an infected one (I), which acts as a mediator of the transition.Second, we have a one-body linear process, describing the recovery (R) of an infected individual (I).The transitions from one compartment to another can be formally expressed with the following kinetic equations where β and µ represent, in the case of a continuous-time model, the transition rates for the infection and recovery processes, while they can be interpreted as transition probabilities in the case of a discrete-time model.These are two tunable control parameters of the model, which can be fixed based on the previous knowledge we have of the disease under study.For instance, the parameter regulating the recovery process, i.e. µ, can be set from the knowledge of the typical infectious period of the disease, i.e. the average time during which an individual remains infectious before recovering.In a discrete time model, µ represents the probability that an infected individual recovers in a time unit ∆t, typically one day.The expected number of time units for recovery is therefore 1/µ.The infectious period for influenza, for example, commonly lasts about 2-8 days, while, for smallpox, it can last over 20 days [34], resulting in a smaller value of µ for smallpox, compared to the value characterizing influenza.
From compartmental models the basic reproduction number, R 0 , of an infectious disease [35] can be evaluated.This parameter, which represents the average number of secondary infections produced by an infectious individual at an early stage of an epidemic, is of utmost relevance in mathematical biology, as it determines whether an infection can spread across the population.Quantitatively, when R 0 < 1, i.e. when the number of new infections per infectious individual is less than one, the infection is not able to spread.Otherwise, when R 0 > 1, an epidemic outbreak can occur [36].In the simple case of the SIR model, R 0 is given by the ratio between the infection and the recovery rates/probabilities, i.e.R 0 = β/µ.
In this paper we consider a discrete-time compartmental model with five compartments, in which, at any time, an individual can be either susceptible (S), exposed to the disease but still unable to spread it (E), asymptomatic infectious (A), symptomatic infectious (I), or recovered/removed (R).We refer to this compartmental model as the SEAIR model [37,38].This is a generalization of the SIR model, as it incorporates two additional states, i.e. the E and A compartments, which describe two essential aspects of several infectious diseases, namely the existence of a latency period and the presence of asymptomatic carriers [39].In particular, the presence of infectious individuals with no symptoms of the disease (state A) may play a critical role in an epidemic outbreak, as it has been observed, for instance, in the recent COVID-19 pandemic [8,[31][32][33].In the SEAIR model the progress of an individual through the disease stages is determined by the following kinetic equations (2) that are graphically illustrated in the flow diagram of Fig. 1.At variance with the SIR model, the SEAIR model accounts for two different types of infection mechanisms, as a susceptible individual (in state S) may be infected upon a contact with either an asymptomatic (in state A) or a symptomatic (in state I) infectious individual.These processes are described by the first two kinetic equations, and are governed by two infection probabilities, β A and β I respectively, which may take different values.
These probabilities depend on several factors.For relatively short time step ∆t of the discrete-time model, we can imagine that the probability β A that a susceptible individual gets infected by an asymptomatic one in such a time step can be expressed as β A = ν A ×p A ×∆t, where ν A represents the rate of encounters of an asymptomatic individual with other individuals, and p A the probability that one such encounter transmits the infection.The contact frequency depends on the awareness of the health condition: susceptible individuals tend to avoid contacts with symptomatic people more than with other susceptible or asymptomatic individuals.Furthermore, the probability that one encounter transmits the infection, p A , depends both on the intrinsic transmissibility of the disease, and on the preventive measures adopted, such as social distancing, use of masks and disinfectant, and so on [40].For example, in the case of COVID-19, it has been shown that the absence of symptoms, such as coughing and sneezing, applies a smaller a-priori transmissibility of the disease, and therefore a small value of p I .It is therefore reasonable to assume β A < β I [41,42].However, as asymptomatic individuals can be unaware about their condition, there are high chances that they will maintain their habitual social behavior.Therefore, as asymptomatic individuals can spread the virus without knowing it, they constitute an important public-health risk.Indeed, it is reasonable to consider β A > β I , assuming that there is a higher chance of dangerous contacts with subjects with no symptoms, thus incorporating such increased contact rates in the value of the transmission probability p A , and consequently on β A [43].Hence, the possibility to tune separately the values of the two parameters β I and β A , and to consider both the regime with β A > β I and the one with β A < β I , is an important aspect of the SEAIR model.The other important feature, which makes the SEAIR different from the SIR model, is the addition of the E compartment accounting for the presence of individuals exposed to the disease but still unable to spread it.Notice, from the flow diagram in Fig. 1, that in the SEIAR model the newly infected individuals first move to the E state, meaning that they are not immediately able to spread the disease.Before becoming infectious, the exposed individuals undergo a latency period, after which they are able to spread the disease as asymptomatic carriers.The transition from E to A is governed by the probability α EA , which is the parameter controlling the average duration of the latency period τ = ∆t/α EA .Asymptomatic individuals can eventually develop symptoms, and this is taken into account by the transition from state A to state I, which takes place with probability α AI .Subsequently, the individuals recover, or are removed, either with or without having previously shown symptoms, with two different probabilities, µ I and µ A respectively.In conclusion, the SEAIR model is based on five compartments and is ruled by six different tunable parameters β A , β I , α EA , α AI , µ A , µ I .
As the disease can only spread through encounters between infectious and susceptible individuals, the progress of an epidemic outbreak ultimately depends on the patterns of contacts among the individual themselves.These are strongly ruled not only by social habits, but also by the geographical distributions of the individuals and by the way these distributions change in time.Humans travel across a hierarchy of characteristic spatial scales, such as neighbourhoods, cities and countries, so that their highly heterogeneous mobility influences the way in which they interact [44][45][46][47][48]. Therefore, a crucial aspect of any realistic modeling of epidemic spreading is how to model the contact patterns within a population.Large-scale agent-based simulations [19] and structured metapopulation models based on data-driven mobility schemes at the interpopulation level [49] are usually complemented by models amenable to mathematical analysis [50] that capture the influence of human behaviour and the existence of complex social structures.In this context, complex networks [14] have revealed particularly useful as they allow to represent the physical contact patterns that result from real movements of individuals between specific locations [17], and also to investigate in a controlled way how the social structure of a population affects the evolution and outcome of an epidemic.
Here, we will focus on network modelling of disease spreading.Namely, we will implement the stochastic processes that describe the infectious disease on an undirected graph G = (V, E), with N = |V| nodes and K = |E| links.In the following we will indicate as G = {g ij } (with g ij = 1 if node i and j are connected, while g ij = 0 otherwise) the N × N adjacency matrix of such graph.In this framework, each node i ∈ V of the graph represents an individual of the population, and can be in one of the five different states of SEAIR model.Each edge (i, j) ∈ E represents a contact along which the infection can take place.Nodes change states according to Eqs. (2), where the first two kinetic equations are implemented over the links of the graph.Different graph topologies will be studied in this article.For simplicity, however, the graphs considered here will always be fixed in time, which means that the pattern of contacts is assumed not to change during the evolution of the disease.This is, however, a strong assumption, as humans tend to react to a disease by avoiding contacts with infected individuals.This leads to a rewiring of links depending on the states of the nodes that affects the dynamics of the disease, which in turn influences the rewiring process.Temporal networks and time-varying graphs are an important subfield of network science [51][52][53], and sim- ple epidemic models, such as the SIS, have been studied on temporal networks [54,55] and on adaptive networks, i.e. on networks whose structure is coevolving with the disease [56].Although the SEAIR model can be straightforwardly implemented on a time-varying graph, this is beyond the scope of this paper.
The dynamics of the disease spreading on a network can be studied in several different ways [10,11].Focusing on probabilistic methods, one can decide to investigate a system numerically, performing extensive stochastic simulations of the epidemic process on the network, or, alternatively, one can develop a deterministic representation of the process, writing the evolution equations for the probability that the network is in a given state.In general, the deterministic equations can be derived at the level of each single node or at the population (meanfield) level.The first approach requires to consider a large number of evolution equations that can be impractical and computationally costly to integrate.Furthermore, empirical data on both the contact network and the state of the nodes may be limited or unavailable.For instance, the information on the state of the individuals is usually provided at a coarser grain, i.e. at a population level [43,57].Finally, from a microscopic stochastic simulation on a large network, it is more difficult to understand how the emerging collective behavior is related to the underlying parameters, which could be better interpreted by a mesoscopic coarse-grained modeling.For these reasons, it is common to assume that the individuals are homogeneously mixed and interact with each other completely at random.Under this hypothesis, each node of the network can be considered statistically equivalent to any other, which permits to describe the system at a population level, drastically reducing the number of evolution equations.
In Section III, we develop two population-level models for the contact-based SEAIR epidemic process described in Eqs.(2).First, we assume statistical independence at the level of nodes, deriving a model we refer to as the individual-based SEAIR model.Next, we assume statistical independence at the level of node pairs, crucially incorporating the dynamical correlations within the network in a model we refer to as the pair-based (or pairwise) SEAIR model.

III. POPULATION-LEVEL MODELS
To begin with, here we discuss the dynamical variables characterizing the individual-based and the pairwise SEAIR models.For the individual-based model, the description of the system is carried out at the level of single individuals, i.e. the nodes, while for the pairwise, one has to express the dynamics at the level of pairs of individuals, i.e. the edges.Mathematically, this means to write a set of master equations governing the temporal evolution of either the probability that a node is in a given compartment, in the case of an individual-based model, or the probability that an edge is in a certain state, in the case of a pairwise model.Let us denote as Ω the set of possible compartments, i.e., Ω = {S, E, A, I, R} in the SEAIR model.Adopting a standard notation [11], we indicate as X i t the probability that a node i belongs to the compartment X ∈ Ω at time t.For the sake of convenience, let us denote as U the state of a node which is not infectious, namely a node that belongs to either S, E or R. We denote with U i t the probability that a node i is the state U at time t.Such a probability can be evaluated as U i t = S i t + E i t + R i t .Under the homogeneous mixing hypothesis, each node of the network is assumed to be statistically equivalent to any other, meaning that X i t = X j t for i, j = 1, . . .N .Hence, for the population level model, we can drop the node index, as we have where [X] t represents the expected number of individuals in the compartment X at time t, and N is the number of individuals in the population.Here, X t is the probability that a generic node of the network is found in the state X at time t, and represents the fundamental variable of the individual-based model.Analogously, the probability that a generic node is in a non-infectious state U at time Similarly, we denote as X i Y j the probability that the edge (i, j) is in state (X i , Y j ).Again, in a mean-field approximation, we can drop the indices as we have , ∀i, j = 1, . . .N, X, Y ∈ Ω (4) where [XY ] t represents the expected number of pairs in state (X, Y ) at time t, K is the number of edges in the network, and XY t represents the probability that a generic link of the network is found in the state (X, Y ) at time t.The quantities XY t , with X, Y ∈ Ω, will be the fundamental variables of the pair-based model.Note that from Eq. ( 4) it follows that XY t = Y X t .The individual (node) state probabilities in any compartment, X t , can be obtained as the marginal probabilities of the pair (edge) state probabilities as In the most general case, the notation X i1 Y i2 . . .Z in t denotes the probability that a given connected subgraph induced by nodes i 1 ,i 2 ,. ..,i n is found in the state (X i1 Y i2 . . .Z in ) at time t.Once again, under the homogeneous mixing hypothesis, we can write such a quantity as XY . . .Z t .Now, in order to develop an individual-based model for the SEAIR dynamics, one has to write a set of equation describing how the individual state probabilities X t in Eq. ( 3) evolve in time.To do so, however, it is necessary to introduce an hypothesis of statistical independence at the level of the individuals.Indeed, the infection processes described in Eq. ( 2) require to consider the joint probability that an individual is susceptible and that one of its contacts in the network is infectious, either asymptomatic or symptomatic, i.e. the probabilities SA t and SI t , respectively.Therefore, to write a set of equations involving only the individual state probabilities, i.e. a closed-form expression, one has to express these higher-order probabilities in terms of the quantities X t .Mathematically, this is done by assuming statistical independence at the level of the nodes, i.e. writing XY t ≈ X t Y t .However, this assumption overlooks the impact of dynamical correlations that indeed exist within the contact network, e.g.infected nodes are more likely to be in contact with other infected nodes [11].
To take these correlations into account, one can develop a pair-based model of the SEAIR dynamics, describing how the pair state probabilities in Eq. ( 4) evolve in time [10].As we will see thereafter, even the pairwise model will require to express some high-order joint probabilities in terms of lower-order probabilities, in this case X t and XY t .
In the rest of this section, we formulate the master equations for both the individual-based and the pairwise models, and discuss under which hypotheses a closed set of equations can be obtained.

A. Individual-based SEAIR model
Here we derive the individual-based population-level model of the SEAIR epidemic process.As previously discussed, under the hypothesis of homogeneous mixing, we can write a set of evolution equations at the population level for the probability X t that a generic node in the network belongs to a given compartment X at time t.Following the transition diagram in Fig. 1, we can write the master equations (also known as rate equations) of the system as where the term Π X→Y denotes the probability that at a generic node there is a transition from state X at time t to state Y at time t + 1.
As we have seen, in the epidemic process defined by the kinetic equations in Eq. ( 2) we identify two classes of transitions.On the one hand, we have the class of two-body nonlinear processes, for which the transition of a given node from one compartment to another needs to be mediated by the interactions with other nodes, as the node can only be infected by one of its neighbors.On the other hand, we have the class of one-body linear processes, for which the transitions of an individual from a state to another does not depend on the state of its contacts.Consistently, we distinguish two kinds of transition probabilities, with the one-body processes being described by linear probability terms, whereas the two-body processes give rise to nonlinear terms.
Let us first consider an example of linear transition.The probability of moving from state I to state R, which corresponds to the recovery process of a symptomatic individual, can be written as where µ I represents the recovery probability for the symptomatic individuals.As the recovery of a symptomatic individual does not depend on the state of the other individuals, the transition probability is only determined by the probability of being in state I at time t, i.e. the term is linear in I t .Similarly, we can write the remaining linear transition probabilities as where α EA , α AI and µ A represent the probability that an individual becomes infectious, develops symptoms and recovers while asymptomatic, respectively.
As an example of nonlinear term, let us now consider the transition probability regulating the infection of a susceptible individual, i.e. the probability Π S→E .At variance with the recovery processes, an infection can only occur if a susceptible individual interacts with an infectious one.In other words, the transition probability of a node from state S to state E depends on the state of its nearest neighbors.Therefore, in order to express this term at a population level, we should first develop a node-level description, and thereafter introduce the homogeneous mixing hypothesis.To do this, let us denote the probability that a node i moves from state S to state E at time t as Π Si→Ei .If we consider the simple case of a node i of degree k i = 1, i.e. connected to only one other node, say j, we can write the transition probability as In the case where k i = 2, we have instead In the general case when node i is connected to k other nodes, i.e. k i = k, as shown in Fig. 2, we can write the transition probability as where β A and β I are the infection probabilities for the asymptomatic and symptomatic, respectively.Each term on the right-hand side is the product of two terms, namely the joint probability that node i belongs to the state S and its neighborhood is in a certain state, multiplied by the conditional probability that i gets infected over the next time step, given that particular state of its neighborhood.Now, going back to Eqs. ( 6), we note that they are exact but not closed, because Eq. ( 11) involves the joint probability of the (k + 1)-uple formed by the node and its k neighbors.Indeed, the evolution of such joint probability, in turn, would depend on the joint probability of larger set of nodes, giving rise to a hierarchy of coupled equations of increasing complexity.As mentioned in the Introduction, this type of hierarchies are very common in statistical physics.Examples are the BBGKY hierarchy in kinetic theory [22], and the moment hierarchy in extended thermodynamics [23].As it would be unpractical or even impossible to deal with the full hierarchy of evolution equations, it is common to close the system by expressing the higher-order joint probabilities in terms of lower-order ones.In the context of epidemic models, various closure methods have been adopted to develop individual and pair based models of continuoustime stochastic systems, both in the homogeneous and heterogeneous mean field approximation [11,24,25].
In this section, we close the system of equations ( 6) at the level of individual nodes by assuming statistical independence of their states.Under this hypothesis, we can approximate the (k + 1)-body joint probabilities as: Given the expressions in Eq. ( 12) we can rewrite the joint probability terms appearing in Eq. (11).Then, by considering the individuals within the populations to be homogeneously mixed and by assuming that the number of contacts of each node is fixed and equal to k (i.e.k i = k ∀i), after some algebra (see Appendix A for details), we can express the population-level transition probability from state S to state E as To summarize, a closed-form, individual-based population-level SEAIR model is obtained by replacing in Eq. ( 6) the transition term Π S→E as in Eq. ( 13), and the linear transition probabilities given by Eq. ( 7) and Eq.(8).
To conclude, hereafter we report the value of the basic reproduction number, i.e.R 0 , which permits to determine whether the infectious disease is able to spread across the population.We compute it by using the nextgeneration matrix (NGM) approach [35,58], developed for discrete-time epidemic models [59] (more details on the method are reported in Appendix D).We have

B. Pair-based SEAIR model
At variance with the individual-based model of Section III A, for which we have obtained the master equations for the individual probabilities X t , in order to derive a pair-based (or pairwise) population-level SEAIR model, we need to write down a closed set of equations describing the temporal evolution of the pair probabilities XY t .To do so, we first need to list all the possible transitions that may occur among the pair states.This conceptual step is not straightforward, as the number of ways in which a pair can move from one state to another according to Eq. ( 2) is considerably larger than in the case of individual transitions.For instance, a pair in state (S, S) can progress either to one of the states (S, E) or (E, S), when only one node of the pair is infected, or to the state (E, E), when both nodes are infected at the same time.Let us come back for a moment to the diagram of the individual transitions shown in Fig. 1.We can interpret the flow diagram as a directed graph and associate to it an adjacency matrix A I .If we label the model compartments such that S is indexed by 1, E by 2, and so on, then we have that (A I ) nm = 1 (with n, m = 1, . . .5) if there is a transition from compartment m to compartment n, and (A I ) nm = 0 otherwise.Note that, for any n, (A I ) nn = 1, as an individual in state n can remain in it, although, for the sake of clarity, the self-loops are not represented in Fig. 1.Therefore, the adjacency matrix A I corresponding to the diagram in Fig. 1

is given by
Analogously, the flow diagram at the level of pair transitions can be represented by an adjacency matrix A P .Remarkably, since the transition of a pair from a state to another is determined by the progress of an individual from a compartment to another, we can evaluate the matrix A P from the matrix A I as where the symbol ⊗ denotes the matrix direct product (also known as Kronecker product).In components, matrix A P is given as which means that there is a link from pair (i, ) to pair (j, m) in the pairwise graph if and only if there exist both edges (i, j) and ( , m) in the individual graph.The set of all the possible (twenty-five) different pair states in the pairwise SEAIR model is shown in Fig. 3, together with the flow diagram associated to the adjacency matrix A P .Again, as in Fig. 1, self-loops are not displayed.Finally, from this diagram, we can write the master equations for the pairwise SEAIR model as where the term Π XY →X Y represents the transition probability from the pair state (X, Y ) to the pair state (X , Y ).Note that, as Due to the symmetries in the pair states, the equations for the states (X , Y ) and (Y , X ) are equivalent.For this reason, we have only reported here the system of fifteen distinct equations (over the total of twenty-five) needed to fully characterize the system.
As for the individual-based model, we distinguish two classes of transition probabilities, i.e. linear and nonlinear ones.In particular, the class of nonlinear transition probabilities consists of all the terms involving at least one node in state S, as the probability of being infected (and consequently the probability of remaining susceptible) depends on the state of the nearest neighbors of the node.
Again, let us first consider the linear transition probabilities.As an example we focus on the case of two symptomatic infected individuals, one of which recovers while the other remains infectious, i.e. on the transition from state (I, I) to state (I, R).The probability of such transition can be written as: where the term (1 − µ I )µ I considers the two independent processes taking place in the pair transition, namely the recovery of the first node of the pair, occurring with probability µ I , and the persistence of the infection in the second, occurring with probability (1 − µ I ).
As an example of nonlinear transition probability, let us consider the transition from state (S, E) to state (E, A), i.e. the case of a susceptible node connected to an exposed one, with the former that gets infected, while the latter becomes asymptomatic infectious.Similarly to the individual-based model formulation, as the infection of a susceptible node depends on the state of its neighbors, we first need to develop a node-level description of the process.Hence, let us denote the susceptible node as i, the exposed node as j, and derive the expression for the probability, Π SiEj →EiAj , that the pair (i, j) moves from state (S i , E j ) to state (E i , A j ) at time t.We assume that the nodes i and j are connected on the contact graph G, i.e. according to the adjacency matrix G = {g ij }, and we consider the subgraph induced by the pair (i, j).Such subgraph consists of nodes i and j and of all their L neighbours.An example in which node i is in state S, node j is in state E and k i = k j = k is shown in Fig. 4 The transition probability can be written as: where h 1 , h 2 , ..., h L label the L nodes in the neighborhood of link (i, j).
Each term on the right-hand side of Eq. ( 17) is given by the probability that the subgraph induced by (i, j) is in a certain configuration, conditional to (i, j) being in state (S i , E j ), times a term which, in turn, is given by the product of the probability that node i gets infected and the probability that node j becomes infectious over the next time step.The different terms in Eq. ( 17) take into account all the possible different states of the neighbourhood of link (i, j).
Similarly to what observed in the individual-based model, given the current form of the transition probability, also Eqs. ( 15) are exact but not closed.This time, however, we close the system equations at the level of links by assuming statistical independence in the states of node pairs.In other words, to close the system we need to find an approximating function F that allows to write the higher-order joint probability as (18) At variance with the individual-based model, for which the expression of F is straightforward, several moment closures exist for the pairwise model, whose quality depends on the topology of the underlying contact network [60].Here we consider a closure which is exact under 4. Graphical representation of the subgraph induced by a link (i, j).Note that we are considering a configuration where each of the two nodes has k neighbors and there are no triangular loops, so that L = 2k − 2. Here, node i is in state S, node j is in state E, nA (mA) indicates the number of neighbors of i (of j) in state A, and, similarly, nI (mI ) those in state I.
the assumption that the network contains no cycles of any order.Though this assumption does not hold in real finite systems, the closure above provides a valuable approximation for the analysis of dynamic processes in large sparse networks (such as sparse random regular graphs and Erdös-Renyi random graphs), in which the number of cycles is negligibly small.Taking into account these considerations, the subgraph G ij = (V ij , E ij ) induced by (i, j) is in the form shown in Fig. 4, and, according to Ref. [26], the following approximating function F can be adopted Note that the numerator consists in the product of the state probabilities of each link in E ij .Note also that S i t and E j t are the marginals probabilities evaluated from Eq. ( 5) and that the closure in Eq. ( 19) is consistent with e marginal probabilities it is constructed from, namely The joint probabilities in Eq. ( 17) can then be rewritten using the expression of the approximating function in Eq. (19).We can finally introduce the homogeneous mixing hypothesis to formulate the master equations at the level of population, which permits to drop the node indices in the probability terms.Assuming each node to be connected to k neighbors, after some manipulation=s (detailed in Appendix B), we can write the transition probability Π SE→EA as This can be further simplified, so to write an expression which is similar to the one obtained for the individualbased model, i.e.Eq. ( 13).We have Note that as the transition of an individual from state E to state A is independent on the state of the neighbors, the transition probability does not depend on the state probabilities EX but on α EA only.
The other transition terms appearing in Eq. ( 15) are in the form of Eq. ( 16), if linear, or of Eq. ( 22), if nonlinear, and are derived following similar arguments.They are fully reported in Appendix C. Eq. ( 15) along with Eq. ( 16) and Eqs.(C1)-(C10) represent a closed set of equations for the pairwise population-level SEAIR model.
We conclude the Section, as done for the individualbased model, here we report the expression of the basic reproduction number for the pairwise SEAIR model, evaluated through the NGM approach (see Appendix D).In this case we have

IV. REPRODUCING THE EVOLUTION OF AN EPIDEMIC
The individual-based and the pairwise model provide an approximate description of a disease spreading over a network.Our goal is now to study to which extent, and under which conditions, these approximate descriptions can capture the time evolution of an epidemic outbreak.We will do this by performing different types of test.In this section we will first concentrate on the simpler case in which the parameters that regulate the disease spreading are known, while in the next section we will consider the case in which such parameters need to be extracted from the data.
Under the assumption that the model parameters are given, we follow the time course of the disease using the different models, starting from the same initial conditions, and compare their capacity in predicting different aspects of a disease outbreak, such as the time of the epidemic peak, its height, and the final number of infected individuals.To illustrate the results of our analysis, we focus on two different network topologies, namely a random r-regular graph and an Erdös-Renyi (ER) graph [14].We start with random regular graphs that, for their characteristics, are the network topologies that best match the hypothesis under the derivation of the closure (19).Indeed, in a random r-regular graph each node has the same degree, equal to r, and the degree distribution is a Kronecker delta, i.e., P (k) = δ r,k .In addition, the expected number of triangles is asymptotically equal to (r − 1) 3 /6 [61], therefore, their effect is negligible in large networks.
To simulate the epidemic process on a network, we consider, for each time step, the kinetic equations at the level of single nodes.At each iteration we inspect all the nodes in states E, A and I.For each of them, we run a Bernoulli process to determine whether the node transits to another state or not.Additionally, for each of the nodes in states A or I we consider each of their susceptible neighbors to determine if it gets infected or not, with probability β A or β I , respectively.These stochastic simulations are performed on networks with N = 500 nodes and r = 5.The initialization of the stochastic process is done by uniformly sampling 2% of the nodes and assuming them to be infected.Half of these infectious nodes were initialized as asymptomatic infectious individuals, while the other half as symptomatic ones.All the remaining nodes were set in the susceptible state.For each case study investigated, a number M of runs of the stochastic simulations is considered, each time reinitializing the network by randomly choosing the nodes to set in the A or I state.
As an example, the evolution of the disease obtained for a given set of parameters, namely β I = 0.4, β A = 0.6, α EA = 0.3, α AI = 0.2, µ A = 0.15, µ I = 0.3, S(0) = 99.98,E(0) = 0, A(0) = 0.01, I(0) = 0.01, R(0) = 0, is shown in Fig. 5.Each of the five dynamical variables of the SEAIR model, namely the percentage of nodes respectively in the S, E, A, I and R state, are reported.The results of M = 1000 stochastic simulations are represented as grey lines, while the average is plotted as a dashed black line.The evolution predicted by the individual-and the pair-based model are also reported as orange and blue solid lines, respectively, with the blue solid line, which appears to be almost superimposed to the dashed black line.We observe that the individualbased model substantially overestimates the number of infections, whereas the dynamics predicted by the pairwise model well reproduces the average behavior of the stochastic simulations.These findings are in agreement with the results obtained for the SIR model in Ref. [26], supporting the conclusion that pairwise models are very good approximations of the dynamics of an epidemic on random regular graphs independently from the specific type of epidemic process considered.
In order to show how the two models perform on a network that deviates from the assumptions underlying the derivation of the pairwise model, we here consider ER random graphs.In this case, the degree distribution is notably binomial peaked at k = N p, where p is the probability that a link between two nodes exists.Notice that, while for random r-regular graphs the degree distribution is a Kronecker delta, for ER graphs the variance on the node degrees is nonzero, i.e., σ 2 = N p(1−p).Similarly to the case of random regular graphs, triangles can also be neglected in ER random graphs, as their expected number is asymptotically equal to k 3 /6 [14].Again, we fix the set of parameters (e.g. the same as in the previous example) and we compare the time evolution of the epidemics obtained for the individual-based and pairwise models to the results of M = 1000 runs of the stochastic process on ER random networks with N = 500 nodes and p = 0.01.Fig. 6 shows that, despite being less accurate than in the previous example, the dynamics of the pairwise model is still in good agreement with the average behavior of the stochastic simulations, whereas the individual-based model fails to reproduce several features of the time evolution of the epidemic outbreak.Ideed for both network topologies considered above, the comparison between the models and the stochastic simulations shows that the individual-based model overestimates the total number of carriers of the disease, and underestimates the peak time and the duration of the epidemic process.This is consistent with the results of a more systematic analysis that we have carried out by varying the model parameters.
Figs. 7 and 8 illustrate some of the results of this analysis for the case of random r-regular graphs, where, in particular, we have varied the infection probability for the asymptomatic individuals in the range β A ∈ [0, 1] for different values of β I , and monitored several macroscopic quantities of interest, such as the final size of the recovered population, and the value and time of the peak in the number of symptomatic infected individuals.Note that, in this case, we have considered a larger network, i.e.N = 2000, and a smaller fraction of initial infected individuals, i.e. 10 −3 , so to better analyze the behavior of the system at an early stage of the epidemic outbreak.
Let us first study the behavior of the final size of recovered population, i.e.R ∞ (Fig. 7).Coherently with the previous results, as the individual-based model overestimates the number of infections occurring at each time step, the final size of recovered individuals is close to one even for relatively small values of β A (see for instance the case in which β I = 0.9), failing to predict the exact value R ∞ .At the same time, the pairwise model provides a good prediction of R ∞ for different values of β I and in the entire range of β A considered.Similarly, concerning the size and the time of the peak of symptomatic individuals, (Fig. 8), we observe that pairwise model is able to give better predictions compared to the individual-based one.Again, the individual-based model overestimates the size of the peak and anticipates its actual time, while the pairwise model is in good agreement with the numerical simulations.The differences between the peak time observed in the stochastic simulations and that predicted by the pairwise model around β A ≈ 0.32 are mainly due the phenomenon of stochastic fade-out [62] that is more pronounced for R 0 slightly greater than one.
A further remarkable aspect of the pairwise model is that it provides a more precise estimation of the critical values of β A and β I at which the epidemic outbreak occurs.In both figures and for both deterministic models, we also display the value of β A for which, given the value of the other parameters, the basic reproduction number R 0 goes to zero.We observe that the pairwise model seems to correctly predict the threshold value of β A , for different β I .Contrarily, the individual-based model predicts smaller threshold values, meaning that there is a range of β A for which the model wrongly forecasts the onset of an epidemic outbreak.In particular, as shown in Fig. 7, for β I > 0.4 the individual-based model does not admit a stable disease-free equilibrium, i.e. the infection always spreads for any value of β A , while the pairwise model correctly predicts the epidemic threshold.
Altogether, the results presented in this section clearly show that taking into account the correlations in the network of contacts is a fundamental step for an accurate description of the spreading of an epidemic in a population.Indeed the pairwise model outperforms the individualbased one, especially in reproducing the real evolution of the disease.Still, it is worth noting that the analysis presented above has been carried out under the ideal condition that all parameters and variables are known.In practice, however, this can be an oversimplifying and unrealistic hypothesis.For instance, some epidemiological parameters, such as the infection probabilities, may be hard to measure directly, and thus need to be estimated from empirical data on the disease spreading.Furthermore, some of the system variables can be unmeasurable.For example, it may not be possible to trace infected individuals during the incubation period, as the carriers themselves may be unaware of having been exposed to the disease.These considerations motivate the analysis that will be presented in the next section, where we will compare the deterministic models adopting a different perspective.Instead of assessing the discrepancies in the dynamics given the epidemiological parameters, we will fit the models to data from numerical simulations and analyze the differences in the predictions of both the parameters of the model and the evolution of the state variables.

V. DEALING WITH INCOMPLETE INFORMATION
In this section we concentrate on the case in which the epidemiological parameters are not known a priori, but need to be extracted from data.We will consider three separate cases in increasing order of complexity.We first start with the hypothesis that all the variables are known and they are so in the whole time interval considered.Then, we examine the case in which all the variables are known, but only up to a certain time.Finally, we assume that only a subset of the system variables is effectively available for fitting.For each of the three cases, we consider the problem of determining the parameters by fitting the available data and, whenever appropriate, that of predicting the temporal evolution of the system variables.The main focus would be again to compare the results of the individual-based and of the pairwise model.

A. Identifying the epidemiological parameters
We first study the problem of evaluating the epidemiological parameters when all the state variables are known for the entire course of the outbreak.As epidemic spreading models are often applied to the study of novel infectious diseases, they can be crucial for the determination of the more elusive epidemiological parameters, such as the infection probabilities or the incubation period, for which direct measurements can be difficult to perform.For this reason it is fundamental to assess the reliability of the estimation of the epidemiological parameters when using a model to fit the data.Here, we perform a series of numerical experiments aimed at comparing the individual-based and the pairwise deterministic SEAIR model in performing this task.We proceed in the following way.We fix a set of epidemiological parameters p = (β A , β I , α EA , α AI , µ A , µ I ), which are assumed to be unknown to the deterministic models and need to be determined through the fit (note that, instead, the initial conditions are assumed to be known quantities).The synthetic data are obtained by running a series of microscopic simulations on a given contact network and averaging the results over M different realizations of the stochastic process.The time-series obtained in this way plays the role of the empirical data that the deterministic models have to reproduce at their best.Note that, since real data usually consists in the daily/weekly number of individuals in a given state [57], as synthetic time-series we consider the temporal evolution of the fractions of individuals in each compartment, which we will denote with X t .Therefore, to determine the epidemiological parameters, we have to fit the corresponding quantities X t to the synthetic data X t .It is important to remark that for the individual-based model these correspond to the state variables of the system, whereas for the pairwise model they are function of the state variables (the pair probabilities), through relation (5).For each of the two models we perform an optimization procedure to derive the set of parameters p that yield the smallest fitting error.The model parameters are estimated by minimizing the root mean squared error (RMSE) between the time-series produced by the deterministic model, generically represented as X t , and the corre- sponding average X t of the microscopic simulations.We denote with Ω the set of compartments to which we fit the models, which in general can differ from Ω.In this first example, we will consider Ω = Ω, so that |Ω | = 5.The length of the simulation, T , is chosen such that the system has reached a stationary state, i.e. the epidemic outbreak has ended, as every infected individual has eventually recovered.Finally, as a measure of the discrepancy between the model estimation and the actual parameters we evaluate the quantity D(p) = p− p / √ 6, where v denotes the Euclidean norm and 6 is the number of components in vector p [63].
An example of the results is shown in Fig. 9, where the model predictions are reported together with the numerical simulations in the case of random r-regular graphs with N = 500 nodes and r = 3 and of the set of parameters listed in the figure caption.Averages over M = 1000 runs of the stochastic microscopic simulation have been considered.The system is initialized by picking a number of randomly selected nodes corresponding to the 2% of the total population and setting their initial state, either as asymptomatic (A) or infected (I) with equal probability.Fig. 9 shows that the dynamical evolution of the state variables generated by both the individual-based and the pairwise model are in good agreement with the stochastic simulation, with the pairwise model still providing better results.Indeed, for the individual-based model we  Hence, both the fitting error E fit and the discrepancy in the estimation of the parameters D(p) obtained by adopting the pairwise model are about two orders of magnitude smaller than the corresponding ones deduced by the individual-based model, so we conclude that the pairwise approximation provides a more reliable "prediction" of the epidemiological parameters.
These results are consistent with the conclusions of the previous section.In fact, when the deterministic models are informed of the epidemiological parameters, then the dynamics of the pairwise model closely matches the microscopic simulations, whereas the individual-based model largely overestimates the number of infections occurring at each time step.In the case considered in this section in which the epidemiological parameters are assumed to be unknown, both models are able to generate time-series that are close to the real time evolution of the epidemics.But in order to do so, the individual-based model has to use a set of parameters that significantly differs from that of the stochastic simulations.Still, the individual-based model has a higher fitting error.In the pairwise case, instead, when the parameters are known, the model behavior is close to that of the microscopic simulations, such that, when the parameters are obtained through fitting, their estimates slightly differ from those used to generate the data.
Fig. 9 provides an illustrative example obtained for a single set of the epidemiological parameters.As the error in the parameter estimation does depend, in general, on the vector p itself, we have repeated the previous analysis on the same contact network represented by a random r-regular graph with N = 500 and r = 3, but considering different sets of epidemiological parameters.In particular, we have systematically varied two parameters of the system, β A and β I , which are the ones ruling the disease transmission.For each pair (β A , β I ) we have calculated the average over M = 1000 runs of the stochastic epidemic process, applied the fitting procedure for the two deterministic models, and computed the error D(p).Overall, the previous results show that the pairwise approximation provides a more reliable estimation in the entire space of parameters.This suggests that the use of a pairwise model is preferable even when the epidemiological parameters of an infectious diseases are not known.

B. Forecasting the epidemic evolution
So far, we have seen how to extract the epidemiological parameters when data on the entire course of the epidemic spreading are available.However, mathematical models are also useful to forecast the evolution of a disease spreading [7,28,64].For instance, they play a crucial role in policy making, as their predictive power allows to estimate in advance the possible effects of different containment measures.On the grounds of this, here we compare the capability of the individual-based and of the pairwise models in forecasting the progress of an epidemic when empirical data is only available for a limited time interval.Similarly to the previous case, we assume all the epidemiological parameters p = (β A , β I , α EA , α AI , µ A , µ I ) are unknown quantities to be determined by fitting the deterministic models to the numerical simulations.This time, however, instead of considering the time-series S t , E t , A t , I t , and R t , over the entire time range T = 55, we fit the models over a limited time interval [0, T fit ].Then, we compare how the SEAIR models predict the evolution of the state variables in the remaining time interval [T fit , T ].Again, the fit is performed by minimizing the mean squared error (RMSE) between the predicted and the simulated time-series, where the average of M = 1000 microscopic simulations on a random r-regular graph with r = 3 and N = 500 nodes is used as the ground truth.We initialize the process by setting 1% of the network nodes in the A state, and another 1% of them in the I state.
As a first example, we fit the SEAIR models up to T f in = 20, and illustrate in Fig. 11 the evolution of the state variables.Both the individual-based and the pairwise model reproduce quite well the temporal evolution of S t , E t , A t and R t .However, as concerns the fraction symptomatic infectious I t , the individual-based model forecasting appears less reliable, while the pairwise model correctly predicts the time course of the variable.Moreover, if we consider the value of D(p), we can see that, in order to reproduce the system dynamics, the individual-based model has to rely on a set of epidemiological parameters that are substantially different from the one used to run the stochastic simulations.On the contrary, the pairwise model remains able to predict the correct set of parameters.Indeed, for the individualbased model we find D ind (p) = 0.20, while we obtain D pair (p) = 2.4 • 10 −3 for the pairwise model.
In the previous example, we have compared the predictions of the deterministic models to the numerical simulations of the epidemic process by taking a single value of T fit only.However, as the performance of the models may depend on the particular value of T fit considered, a more systematic analysis is required.We do this by varying the value of the fitting time interval in the range T fit ∈ [2,50], and monitoring the prediction error on the state variables in the time interval [T fit , T = 55].As a measure of the model performance we consider the mean squared error between the predicted and the simulated  time-series relative to all the state probabilities, namely Note that, since we consider a mean squared error, the value of E pred should not depend on the length of the time interval itself.
Fig. 12 shows the results.Two aspects are worth remarking.First, the value of E pred relative to the pairwise model is generally lower compared to that of the individual-based model, meaning that the former provides more reliable predictions on the temporal evolu-tion on the state variables.Second, the performance of the individual-based model is heavily influenced by the value of T fit , with the prediction error that decreases as the fitting time interval increases.Instead, the pairwise model provides a good prediction on the system dynamics even when it has at disposal only a limited amount of data, as one can see from the value of E pred , which remains stable in the entire range of values of T fit .
In conclusion the pairwise model provides a more reliable forecasting of the temporal evolution of the state variables than the individual-based model.Furthermore, in those cases where the performances of the two models are comparable, the pairwise model yields a better prediction of the epidemiological parameters, consistently with the results obtained in the previous subsection.

C. Estimating the unmeasurable variables
Finally, we work under the worst condition (among those investigated), in which some of the dynamical variables are not measurable.At variance with the previous case study, where we have evaluated the epidemiological parameters assuming to be able to measure each compartment X ∈ Ω, here we suppose to be able to measure only a subset of them, fitting the deterministic models to the corresponding time series only.Understanding how the models perform under these circumstances is of crucial practical relevance in the context of outbreaks such as that of COVID-19, where the disease is also spread by asymptomatic carriers that are difficult to trace.
In the analysis of the SEAIR models, we assume to be able to detect those individuals who are either symptomatic infectious (I) or recovered (R), considering unmeasurable the other states.We can consider two cases, which represent different degrees of data availability.In the first case, since empirical data are usually provided at an individual level, we assume the fractions of symptomatic and recovered individuals (I t , and R t ) to be available data, while the fractions of susceptible, exposed and asymptomatic individuals (S t , E t , and A t ) are unknown.Notice that we are here assuming to be able to trace the asymptomatic individuals once they recover.This corresponds to the ideal case in which serological tests are systematically performed on the population, allowing to detect those individuals who have recovered without having been diagnosed with the disease.Accordingly, to evaluate the epidemiological parameters, we fit the variables I t and R t to the corresponding time series.As discussed in Section V A, there is a fundamental difference in the way in which the individualbased and the pairwise models are fitted.Indeed, while in the individual-based model the probabilities X t are the state variables, in the pairwise model these quantities are derived from the pair probabilities through relation (5).This means that, at variance with the individualbased model, for the pairwise model a function of the state variables is used in the data fitting.Comparison between stochastic simulations, individual-based model (orange solid line) and pair-based model (blue solid line) when the measurable state variables are fitted over the entire time course of the epidemic process (i.e.T fit = T ).The stochastic simulations are run on a random r-regular graph, with the same settings of Fig. 5, and their results are represented as solid gray lines, while the average is plotted as a dashed black line.
In the second case, in addition to I t and R t , we assume to have a larger set of measured time series, including, in particular, some of the pair variables.Indeed, when we have information on the contact network structure, if we are able to detect those individuals that are either in state I or in state R, we can also measure the pairs formed by individuals in one of these two states.To account for this scenario, we have considered the fraction of pairs composed either by symptomatic or recovered individuals as measurable variables, i.e.II t , IR t , and RR t , while the remaining fractions are considered unknown (as implicitly assumed before).Under these hypotheses, for the pairwise model these data do not have to be fitted as a function of the state variables, since they can be directly associated to the corresponding (measurable) pair variables II t , IR t , and RR t .Summing up, in this second case, we fit the individual-based model to I t and R t , and the pairwise model to I t , R t , II t , IR t and RR t .
We begin our analysis from this latter case, which relies on the assumption that the contact network structure is known.Then we discuss the implications of fitting the pairwise model when such data are not available.
The dynamics of the SEAIR models is ultimately determined by two sets of quantities, namely the epidemiological parameters p and the initial condition of the state variables.Since, in the previous sections, we have assumed all the dynamical variables to be known at any Comparison between stochastic simulations, individual-based model (orange solid line) and pair-based model (blue solid line) when the measurable state variables are fitted up to T fit = 20 (red dotted line).The stochastic simulations are run on a random r-regular graph, with the same settings of Fig. 5, and their results are represented as solid gray lines, while the average is plotted as a dashed black line.
time, we have considered the initial conditions as known quantities, fitting the SEAIR models to empirical data only to evaluate p. Now, as some of the variables are unknown, also their initial conditions need to be estimated from the fit.It is worth noting that the pairwise model has a larger number of unknown initial conditions compared to the individual-based model.This means that, at variance with the cases in Sections V A and V B, where the number of unknown parameters was equal for both models, in the present scenario one needs to determine a larger set of parameters for the pairwise model.
We start by considering Fig. 13 and Fig. 14, in which we compare the prediction of the individual-based and of the pairwise model to the stochastic simulations, for two different values of T fit .As in the previous case, the numerical data are generated performing M = 1000 microscopic simulations on a random r-regular graph with N = 500 and r = 3, and taking the average dynamics as the ground truth.First, we discuss the case T fit = T , that represents an important case study occurring in the analysis of past epidemic outbreaks.In this scenario, we assume we know the dynamics of the measurable variables I t and R t over the entire time course of the epidemic, while the other variables remain unknown.In other words, we here analyze the capability of the models to determine the temporal evolution of the unmeasurable variables once the epidemic outbreak is over.As shown in Fig. 13, the pairwise model is in a much better agreement with the stochastic simulations compared to the individual-based one.
Conversely, to illustrate an example in which data is only available for a limited time interval, we consider the case T fit = 20.The results for this case study are displayed in Fig. 14.First, we note that, compared to the individual-based model, the pairwise model better forecasts the temporal evolution of the measurable variables I t and R t .Second, the pairwise model provides a good prediction of the time course of the unknown variables.In particular, we remark that the pairwise model prediction of both A t and S t is in good agreement with the numerical simulations.As regards the probability of being exposed E t , though it overestimates the average behavior of the stochastic simulations E t , we note that time series predicted by the pairwise model represents a significant improvement compared to the prediction of the individual-based model.Overall, the pairwise model provides a reliable prediction of the dynamics of both measured and unmeasured variables, outperforming the individual-based model.
Then, we study the predictive capability of the deterministic models as function of T fit .Fig. 15 displays, for both the individual-based and the pairwise models, the prediction error E unm on the unmeasured individual-level time series, namely where Θ = Ω \ Ω = {S, E, A} represents the set of unmeasured compartments.Note that, since we are interested in the prediction over the entire course of the epidemic, we considered the range [0, T ].Consistently with the previous results, we note that, for different values of T fit , the pairwise model generally provides a lower value of E unm compared to individualbased model, meaning that the former furnishes a more reliable prediction of the temporal evolution of the unknown variables.
The results discussed so far indicate that the pairwise model outperforms the individual-based model in pre- dicting the dynamics of the fractions of individuals in the unmeasured compartments.However, in order to obtain a more reliable prediction on quantities that are difficult or even impossible to measure, e.g. the asymptomatic infectious population, it is necessary to gather more and different data about the quantities that are accessible.In particular, here we assumed we are able to measure not only the fraction of individuals that are symptomatic infectious or recovered, but also the fraction of couples formed by individuals in one of those states.
To conclude, we discuss how important these data are for the prediction, by showing what happens when they are not available.Fig. 16 displays the prediction of the fraction of asymptomatic infectious A t (which we assumed to be unmeasurable) by the individual-based and pairwise model.While the individual-based model is fitted only to I t and R t , the pairwise model is fitted either using or not the fractions of pairs IR t , IR t and RR t .As it can be noticed, when only the individual probabilities are fitted, the pairwise model is not able to provide a good prediction for A t , performing as badly as the individual-based model.
Overall, the results of this section suggest that using a model that accounts for the dynamical correlations existing within the contact network can ensure a more accurate evaluation of the unknown quantities of the system.However, this approach demands to gather more refined data on how the disease spreads through the network itself, using, for instance, more detailed contact tracing techniques [65].

VI. CONCLUSIONS
In this work, we have developed two discrete-time population-level SEAIR models, providing a coarse grained description of the spreading of an infectious disease through a network of contacts.First, under the hypothesis of statistical independence at the level of nodes, we have derived the master equations for an individualbased model.Second, we have considered a more complex pairwise approximation, describing the system at the level of node pairs.This allowed us to account for the dynamical correlations existing within the contact network.
Assuming we know the epidemiological parameters characterizing the disease spreading, we have compared the deterministic models to numerical simulations of the epidemic dynamics.We have analyzed to which degree the predictions of the SEAIR models agreed with the results of stochastic simulations carried out over two different network topologies, namely the random r-regular graph and the Erdös-Renyi graph.Consistently with previous research [26], we found that the pairwise model is able to reproduce, for both topologies, the temporal evolution of the state variables, whereas the individual-based model overestimates the number of new infections occurring at each time step.This result was consistently observed when important parameters of the model, such as the symptomatic and asymptomatic transmission probabilities, were varied.
We have then considered the more realistic case in which the epidemiological parameters are unknown and need to be estimated by fitting the SEAIR models to empirical data.We have examined three case studies of increasing complexity.First, we have assumed to know the time-course of all the state variables for the entire duration of the epidemic outbreak, analyzing the capability of the deterministic models to estimate the epidemiological parameters.Compared to the individual-based model, the pairwise model performed better, as the error on the parameters resulted generally lower and independent on their specific value chosen for the numerical simulations.This aspect is crucial in those practical situations where one needs to know the exact value of the parameters, for instance to evaluate the possible effects of issuing/lifting a containment policy, which can be assumed to have an impact only on specific parameters (for instance, the obligation to wear a face mask likely affects only the transmission probabilities β A and β I ).
Second, we have assumed to know the dynamics of the state variables for a limited time interval only, analyzing the capability of both the individual and the pairwise model to forecast the evolution of the pandemic.Our results show that the individual-based model is not able to correctly forecast the dynamics of all the state variables, while the prediction of the pairwise model is in good agreement with the numerical simulations.Furthermore, the pairwise model still provides a better estimate of the epidemiological parameters even under these conditions.This higher reliability of the pairwise approach can play a fundamental role in the policy-making process, where accurately forecasting the evolution of the epidemic is crucial.
Third, we considered a more realistic scenario, assuming we are able to measure only a subset of the model compartments.In particular, we have assumed to know the fractions of symptomatic infected and recovered nodes, i.e.I t and R t .Additionally, we have considered to be able to measure the pairs in which the nodes are either in the infectious symptomatic or in the recovered states, i.e.II t , IR t and RR t .With this information available, the fractions of individuals in the unmeasured compartments, i.e. S t , E t and A t , as predicted by the pairwise model were found in good agreement with the numerical simulations.On the contrary, the individual-based model was not able to estimate their temporal evolution.Since the asymptomatic individuals can constitute a public-health risk, as they are able to spread the virus without knowing it, and since they can be difficult to trace, assessing their number within the population becomes crucial, and the pairwise approach can provide an important instrument to cope with their presence.
Overall, the results presented in this paper show that the pairwise modeling paradigm is a reliable tool for estimating the epidemiological parameters and forecasting the evolution of the epidemics.In particular, we showed that a pairwise approach, altogether with additional information regarding the contacts of the infectious individuals, provides a much better prediction of the unmeasurable variables of the system compared to an individual-based one.Moreover, such a modeling approach allows one to evaluate the size of the asymptomatic population, which, in the context of the COVID-19 pandemic, represents a critical issue.To fulfill this objective, the pairwise model requires to refine the data collection procedures so as to include information on the network structure.For this reason, extending the use of contact tracing techniques can play a crucial role in informing mathematical modeling, thus leading to more reliable predictions on the course of epidemics.
Finally, we note that the generality of the approach here discussed paves the way to applications to other classes of epidemic models (possibly including other compartments or an explicit dependency of the contact networks on time), which are possible directions for future work.can be chosen among the p infected neighbors.More compactly, we can write Given that U t = 1 − A t − I t , it is worth noting that the term k p p n A n t I p−n t U k−p t corresponds to the probability mass function of the multinomial distribution (see [66], Chapter 6).With this in mind, we can manipulate Eq. (A4) so to have ) where we have defined Note also that we added to the summation the term relative to p = 0, since this is equal to zero, and that we have considered that summing the probability mass function over all the possible values of p and n gives one.By recalling the multinomial theorem we can simplify Eq. (A5), obtaining Finally, by substituting , we obtain for Π S→E the expression in Eq. ( 13).
APPENDIX B: DERIVATION OF EQ. ( 22) Here we derive the expression for the nonlinear transition probability Π SE→EA of the pairwise SEAIR model, i.e.Eq. (22).We begin by considering Eq. ( 17), introducing the assumption that no triangular loops exist within the network and that the population is homogeneously mixed.In this case, all nodes have the same number of neighbors, k, so that the number of neighbors of a pair of nodes is L = 2k − 2. A graphical representation of the subgraph induced by the pair in state (S, E) under these hypotheses is provided in Fig. 4. Eq. ( 17) can written as Each term SE . . .Z t counts L + 2 = 2k elements.The first two, i.e. S and E, represent the state of the pair of nodes inducing the subgraph, the following k − 1 indicate the state of the neighborhood of the first node (the one in state S), while the remaining k −1 denote the state of the neighborhood of the second node (in state E).Eq. (B1) can be simplified taking into account the property of symmetry of the joint probabilities.Indeed, since under the homogeneous mixing hypothesis the nodes of the neighborhood are statistically equivalent to one another, the probability terms only depend on the number of infected nodes in each neighborhood, while they do not depend on which particular node is infected.For example, when only one symptomatic infectious node is present in the subgraph, assuming it to be connected to the node in state S, we have k − 1 equivalent terms, namely SEIU U . . .U t = SEU IU . . .U t = . . .
Therefore, we can rewrite Eq. (B1) as ) We note that each term is characterized by the product of four binomial factors, i.e. k−1 p p n k−1 q q m , representing the number of possible combinations of the neighboring nodes.The first two binomial coefficients are relative to the neighborhood of the node in state S, while the others concern the neighbors of the node in state E. For each of them, the first binomial factor indicates the number of ways p (or q) infected nodes can be chosen among the k − 1 neighbors of a node, while the second corresponds to the number of ways n (or m) asymptomatic infectious nodes can be chosen among the p (or q) infected neighbors.
As explained in Section III B, given the expression in Eq. (B1), the system (15) is not closed.To close it at the level of pairs, we consider the approximation in Eq. ( 19), which can be also rewritten at a population-level by dropping the node indices.With reference to the configuration of Fig. 4, we can approximate the probability that the node S is connected to n I neighbors in state I, to n A neighbors in state A, and to k − 1 − n I − n A neighbors in state U , while the node in state E is connected to m I neighbors in state I, to m A neighbors in state A, and to k − 1 − m I − m A neighbors in state U , as follows By substituting the closure (B4) in Eq. (B3), we obtain the expression for the transition probability in Eq. ( 21).As the transition of an individual from state E to state A is independent on the state of the neighbors, we expect the transition probability Π SE→EA to be independent of the state probabilities EX .Consistently, we note that the second double summation term, which accounts for all the possible configurations of the neighborhood of the node in state E, sums to one.To show this, we first define the quantities Note that, despite it is not explicitly indicated, the terms ε A , ε I and ε U clearly depend on time.Each term of the second double summation in Eq. ( 21) can now be rewritten as which corresponds to the probability mass function associated to the multinomial distribution.Since we sum over all the possible values of q and m, the second summation in Eq. ( 21), as we expect, is equal to one.We can thus rewrite the equation as This expression can be further simplified.First, similarly to what we have done for the state probabilities EX , we introduce the notation Then, we can manipulate Eq. (B8) so to have where we have defined x 1 = σ U , x 2 = σ A (1 − β A ) and x 3 = σ I (1 − β I ).Note also that we added to the summation the term relative to p = 0, since this is equal to zero.By using the multinomial theorem (A6) of power k − 1, we can greatly simplify the expression of the transition probability as Finally, by substituting , and by returning to the usual notation for the state probabilities, we obtain for Π SE→EA the expression in Eq. (22).Here we present the complete list of the transition probability terms of the SEAIR pairwise model at the population-level.These terms can be derived in a way analogous to Eq. ( 17), following the algebraic passages detailed in Appendix B. We begin by considering the nonlinear terms, and in particular the probability that a link in state (S, S) transits to another state.i.e. (E, E) or (S, E).Let us first write an expression for the term Π SS→EE in the form of Eq. ( 21).We have where we note that the two double summation are equal.
Similarly to what we have done for the expression of Π SE→EA , we can rewrite the previous equation as ) where the second power comes from the fact that both nodes undergo the same transition, i.e. S → E. The term in the square bracket represents the probability that a node in state S is infected by at least one of its neighbors, which can be either in state A or in state I. Therefore, to write the expression of Π SS→SE , we have to consider the probability of the complementary event, namely the probability that none of the infected neighbors transmits the disease to the susceptible node.We thus have (C3) Then, we account for all the possible transitions of links in state (S, E).Starting from the expression for Π SE→EA in Eq. ( 22), we can write the remaining terms by considering the probability of the complementary events, obtaining We now consider the possible transitions from state (S, A).Let us take into account the probability term Π SA→EI and let us come back for a moment to the expression with the double summation.This reads where we have already written the probability that the node in state A transits to state I as α AI (see Appendix B for more details).It is worth pointing out that, differently from the previous cases, the previous expression has a term (1 − β A ) raised to (n + 1)-th power, which comes from the fact that, as we are considering the links in state (S, A), the node in state S will always have an infectious neighbor.Following the same algebraic steps described in Appendix B, it is easy to verify that this expression can be simplified as (C6) Once again, by considering the probability of the complementary events, we can write the remaining transition probabilities for the state (S, A).Note that, since a node in state A can transit to two different states, i.e. to state I with probability α AI and state R with probability µ A , the probability that the node remains in state A is given by (1 − α AI − µ A ).We have (C7) Analogously to the previous case, it is easy to write the transition probability terms from the state (S, I).We have As concerns the state (S, R), as R is an absorbing state, i.e. the node remains in R with probability equal to 1, the only possible transition is determined by the probability that the node in state S transits to state E, namely Finally, we report the linear transition probability terms appearing in Eq. (15).All these terms can be expressed in the form of Eq. ( 16), so that they read:

(C10)
Note that the probability Π AI→IR consists of two terms as there are two possible ways in which a link in state (A, I) can transits to state (I, R).First, the node in state A can transits in state I (the asymptomatic infectious individual develops the symptoms) while the node in state I transits to state R (the symptomatic infectious individual recovers).Second, the node in state A can transits to state R (the asymptomatic infectious individual recovers) while the node in state I remains in it.In other words, a link in state (A, I) can transits either to state (I, R) or to state (R, I), according to different probabilities.Coherently, a link can transit to the state (I, R) coming from two different states, namely (A, I) and (I, A), according to the same probabilities, which justifies the use of a single transition probability term Π AI→IR in Eq. (15).In this Appendix, we show how to derive the expression of R 0 for both the individual-based and the pairwise SEAIR model.To do so, we use the next-generation matrix (NGM) approach [35], developed for discrete-time epidemic models [59].To begin with, we briefly discuss the method, and then we apply it to the deterministic models introduced in our work.
To to calculate the basic reproduction number, rather than the full system of master equations, the subsystem describing the evolution of the infected states may be considered.Here, we follow the terminology used in [35] and indicate with the term 'infected states' those compartments that are infectious (A and I) or have been exposed to the disease (E).As a vanishing fraction of individuals in an infected state indicates that the infection dies out, R 0 can be derived studying the condition under which the disease-free equilibrium (DFE), i.e. the equilibrium at which the fraction of individuals in an infected compartment is zero, becomes unstable.Note that this equilibrium always exists.Hence, in the context of the SEAIR models, instead of considering all the states, we will only focus on the variables involving the E, A and I compartments.Hereafter, we will generally denote with X(t) ∈ R d the vector containing the value of the subsystem dynamical variables at time t.To study the stability of the DFE, one can linearize the infected subsystem around it, writing the corresponding master equations as X(t + 1) = (T + Σ)X(t), (D1) where T is called the transmission matrix, as it accounts for the nonlinear probability terms, i.e. the disease transmission, while Σ is the transition one, which accounts for the linear transitions within the system.From the matrices T and Σ, the so-called next-generation matrix (NGM) [35] can be computed as where 1 d is the identity matrix of size d.The basic reproduction number can be calculated as the spectral radius of this matrix In fact, it is possible to prove that if R 0 < 1 the DFE is asymptotically stable, whereas it is unstable if R 0 > 1 [35,58,59,67].

Individual-based model
Here we show how to construct the NGM for the individual-based SEAIR model.In this case, to calculate the value of R 0 we can focus on the dynamics of E t , A t and I t , which are the variables representing the fraction of infected individuals within the population.Thus, we define X(t) = ( E t , A t , I t ) T .Linearizing around the DFE the equations relative to the infected subsystem in Eqs.(6), we obtain the following transmission and transition matrices: whose spectral radius ρ(K) gives the value of R 0 reported in Eq. ( 14).

Pairwise model
Here, we calculate the NGM for the pairwise SEAIR model.As mentioned above, we have to focus on all variables involving the E, A and I compartments, which in the pairwise model are the pair probabilities EX t , AX t and IX t , with X ∈ Ω.However, the subsystem of interest consists of a set of twelve master equations of Eqs.(15), which can be unfeasible to deal with.To reduce the number of equations, we can do the following consideration.Given the relation on the marginal probabilities (5), when the pair probabilities approach zero, the individual probabilities go to zero as well.In other words, if no link in the network has one infected node at one of its end, then no node in the network will be infected.Therefore, considering Eq. ( 5) and the expression of the transition probabilities, i.e.Eqs. ( 16), (22) and (C1)-(C10), we can rewrite the system (15) as where, for the purpose of simplification, we have used the notation (D8) Note that Eqs.(D7) represent a closed set of equations.As we are interested in analyzing the early stage of the epidemic, we can linearize the infected subsystem, which consists in the equations describing the dynamics of E t , A t , I t , SE t , SA t , SI t , around the DFE, characterized by S t+1 ≈ SS t+1 ≈ 1, while all the other variables approaches zero.We find The last step is to compute the spectral radius ρ((K)) that gives the expression of the basic reproduction num-

FIG. 1 .
FIG. 1. Graphical representation of the SEAIR model considered in this work.The model has five compartments, representing the susceptible (S), exposed (E), asymptomatic infectious (A), symptomatic infectious (I), and recovered/removed (R) state.Allowed transitions from one state to another are indicated by arrows, so that the flow diagram of the model can be represented as a directed graph with an associated adjacency matrix AI (see Section III B).Transition probabilities are governed by the six parameters reported close to the arrows and in Eq. 2. Edges in black represent linear transitions between states, while the edge in red refers to the nonlinear one.

FIG. 2 .
FIG. 2. Graphical representation of a node i with its k neighbors.Here, nA indicates the number of neighbors of i in state A, and, similarly, nI those in state I.

FIG. 3 .
FIG. 3. Graphical representation of the twenty-five different pair states (i.e.states of pairs of node) of the pairwise SEAIR model, and of all the possible transitions from one state to another described by a directed flow graph with an associated adjacency matrix AP .Edges in black represent linear transitions between states, while edges in red refer to nonlinear ones.

FIG. 6 .
FIG. 6.Comparison between stochastic simulations, individual-based model (orange solid line) and pair-based model (blue solid line) for an Erdös-Renyi graph with N = 500 nodes and p = 0.01.The numerical simulations and the model dynamics are evaluated using the same settings of Fig.5.The stochastic simulations are represented as solid gray lines, while the average is plotted as a dashed black line.

9 FIG. 7 .
FIG. 7. Final size of the recovered population R∞ as a function of βA, for different of βI .The numerical simulations are performed on a random r-regular graphs with N = 2000 nodes and r = 3, with αEA = 0.3, αAI = 0.2, µA = 0.4, µI = 0.5.The solid lines represent the value of R∞ as predicted by the individual-based (orange) and by the pairwise model (blue), while the black dashed line represents the average value of R∞ over M = 1000 stochastic simulations.The vertical dashed lines represent the value of βA at which the value of R0 is equal to one for both the individual-based (orange) and pairwise model (blue).

FIG. 8 .
FIG. 8. Size (left) and time (right) of the peak in the number of symptomatic individuals, as a function of βA, for βI = 0.3.The numerical simulations are performed with the same settings of Fig. 7.The solid colored lines refers to the individualbased (orange) and the pairwise model (blue), while the black dashed line represents the stochastic simulations.The vertical dashed lines represent the value of βA at which R0 = 1 for both the individual-based (orange) and pairwise model (blue).

FIG. 9 .
FIG. 9.Comparison between stochastic simulations, individual-based model (orange solid line) and pair-based model (blue solid line) when all the state variables are fitted over the entire course of the epidemic.The stochastic simulations are run on a random r-regular graph, with the same settings of Fig.5.The stochastic simulations are represented as solid gray lines, while the average is plotted as a dashed black line.

FIG. 10 .
FIG. 10.Error D(p) in the estimation of the epidemiological parameters as a function of βA and βI for the individual-based (left) and the pairwise (right) SEAIR models.The deterministic models are fitted to stochastic simulations performed on a r-regular graphs with N = 500 and r = 3, with αEA = 0.3, αAI = 0.2, µA = 0.15 and µI = 0.3.

Fig. 10 displays
D(p) as a function of β A and β I .Two things are worth noticing.Firstly, the estimation error for the pairwise model (on the right) is generally lower compared to that of the individual-based model (on the left).Second, the value of D(p) for the individual-based model depends smoothly on p, whereas, for the pairwise model, such a dependence appears to be strongly affected by random fluctuations, because of its much smaller value.

FIG. 11 .
FIG. 11.Comparison between stochastic simulations, individual-based model (orange solid line) and pair-based model (blue solid line) when all the state variables are fitted up to T fit = 20 (red dotted line).The stochastic simulations are run on a random r-regular graph, with the same settings of Fig.5.The stochastic simulations are represented as solid gray lines, while the average is plotted as a dashed black line.

FIG. 12 .
FIG.12.Value of the prediction error E pred as a function of the fitting range T fit , for both the individual-based (orange) and the pairwise (blue) models.

FIG. 13 .
FIG. 13.Comparison between stochastic simulations, individual-based model (orange solid line) and pair-based model (blue solid line) when the measurable state variables are fitted over the entire time course of the epidemic process (i.e.T fit = T ).The stochastic simulations are run on a random r-regular graph, with the same settings of Fig.5, and their results are represented as solid gray lines, while the average is plotted as a dashed black line.

FIG. 14 .
FIG. 14.Comparison between stochastic simulations, individual-based model (orange solid line) and pair-based model (blue solid line) when the measurable state variables are fitted up to T fit = 20 (red dotted line).The stochastic simulations are run on a random r-regular graph, with the same settings of Fig.5, and their results are represented as solid gray lines, while the average is plotted as a dashed black line.

FIG. 15 .
FIG. 15.Value of the prediction error Eunm as a function of the fitting range T fit , for both the individual-based (orange) and the pairwise (blue) models.

FIG. 16 .
FIG. 16.Comparison between stochastic simulations (dashed black), individual-based model (solid orange) and pairwise model for the prediction of the asymptomatic state of Fig. 14.The pairwise model has been fitted both including (solid blue) and excluding (dashed-dotted blue) the time series IIt, IRt and RRt from the set of measurable variables.Note that the solid blue line is superimposed to the dashed black one.