Network archaeology: phase transition in the recoverability of network history

Network growth processes can be understood as generative models of the structure and history of complex networks. This point of view naturally leads to the problem of network archaeology: Reconstructing all the past states of a network from its structure—a di ﬃ cult permutation inference problem. In this paper, we introduce a Bayesian formulation of network archaeology, with a generalization of preferential attachment as our generative mechanism. We develop a sequential Monte-Carlo algorithm to evaluate the posterior averages of this model, as well as an e ﬃ cient heuristic that uncovers a history well-correlated with the true one, in polynomial time. We use these methods to identify and characterize a phase transition in the quality of the reconstructed history, when they are applied to artiﬁcial networks generated by the model itself. Despite the existence of a no-recovery phase, we ﬁnd that non-trivial inference is possible in a large portion of the parameter space as well as on empirical data.

Unequal distributions of resources are ubiquitous in the natural and social world [1].While inequalities abound in many contexts, their impact is particularly dramatic in complex networks, whose structure are heavily constrained in the presence of skewed distributions.For instance, the aggregation of edges around a few hubs determines the outcome of diseases spreading in a population [2], the robustness of technological systems to targeted attacks and random failures [3], or the spectral property of many networks [4].It is therefore not surprising that much effort has been devoted to understanding how skewed distributions come about in networks.Many of the satisfactory explanations thus far uncovered have taken the form of constrained growth processes: the rich-get-richer principle [5], sampling space reduction processes [6] and latent fitness models [7] are all examples of growth processes that lead to a heavy-tailed distribution of the degrees.
A common characteristic shared by these processes is that they do not-nor are they expected to-give a perfect account of reality [8].Their rules are simple, and only capture the essence of the mechanisms at play, glossing over details [9].But despite these simplifications, growth processes endure as useful models of real complex systems.At a macroscopic level, their predictions have often been found to fit the statistics of real networks to surprising degrees of accuracy [10].At a microscopic level, they have been shown to act effectively as generative models of complex networks [11,12], i.e., as stochastic processes that can explain the details of a network's structure [13,14].This point of view has led, for example, to powerful statistical tests that can help determine how networks grow [15,16].
The notion of growth processes as generative model is now being pushed further than ever before [16].The burgeoning field of network archaeology [17], in particular, builds upon the idea that growth processes are generative models of the history of complex networks, able to reveal the past states of statically observed networks.This point of view is perhaps the most clearly stated in the bioinformatics literature, which seeks to reconstruct ancient protein-protein interaction (PPI) networks to, e.g., improve PPI network alignment algorithms [18,19] or understand how the PPI networks of organisms are shaped by evolution [20].Indeed, almost all algorithmic solutions to the PPI network archaeology problem are based on explicit models of network growth (variations on the duplication-divergence principle), and take the form of parsimonious inference frameworks [20][21][22]; greedy local searches informed by models [17,[23][24][25]; or maximum likelihood inference of approximative [26], graphical [19], and Bayesian [27] models of the networks' evolution.
Less obvious is the fact that a second body of work, rooted in information theory and computer science, also makes the statement that growth processes can generate the history of real complex networks.This second strand of literature [28][29][30][31][32][33][34][35] focuses on temporal reconstruction problems on tree-like networks generated by random attachment processes [5,36].It has led thus far to efficient root-finding algorithms (whose goal is to find the first node) [28][29][30][31], and to approximative reconstruction algorithms on trees [32][33][34].Applying any of these algorithms to a real network amounts to assuming that growth processeshere random attachment models-are likely generative models.
The goal of this paper is to investigate classical growth processes as generative models of the histories of networks, from the point of view of Bayesian statistics and hidden Markov processes.This is made possible by recent advances in particle filtering methods and network inference [27,37,38].Our contribution is threefold.One, we give a latent variable formulation of the 1, b = 0.9, T = 50, see main text).Since the network is artificial, its true history-i.e., the time of arrival of its edges in time-is known.The width and color of edges encode this history; older edges are drawn with thick, dark strokes, while younger edges are drawn using thin, light strokes.The age of nodes is encoded in their radius.Our goal is to infer these times of arrival as precisely as possible, using the network structure as our only input.(b) Expected time of arrival, computed with 10 5 samples of the posterior distribution of the model.The correlation of the inferred and real history equals ρ = 0.81 (see Materials and Methods for details).
network archaeology problem for a generalization of the classical preferential attachment (PA) model [5,[39][40][41].We derive all the tools necessary to infer history using the model, including a sampling algorithm for its posterior distribution adapted from Ref. [37], optimal estimators of the history, as well as efficient heuristics well correlated with these estimators.Two, we establish the extent to which complete history recovery is possible, and, in doing so, identify a phase transition in the quality of the inferred histories (i.e., we find a phase where recovery is impossible, and a phase where it is achievable in large networks).Three, we demonstrate with numerical experiments that we can extract temporal information from a real, statically observed network, here a phylogenetic tree of the Ebola virus.We conclude by listing a number of important open problems.

A. Bayesian network archaeology
A network G generated by a growth process is, by construction, associated with a history X, i.e., a series of events that explains how G evolved from an initial state G 0 .We consider the loosely defined goal of reconstructing X, using the structure of G as our only source of information (see Fig. 1).Formally, this is an estimation problem in which the history X is a latent variable, determined by the structure of the network.The relationship between the network and its history is expressed using Bayes' formula as where we will assume, for the sake of simplicity, that the growth process parameters θ can be estimated reliably and separately from G, and that the conditional posterior P(X|G, θ) captures most of our uncertainty on X1 .
To correctly define the probabilities appearing in (1), we first separate histories in two categories: Those that are consistent with G, and those that are not.We say that a history is consistent with a network if it has a non-zero probability, however small, of being the true history of the network.The likelihood P(G|X, θ) thus acts as a logical variable that enforces this consistency: It is equal to one if and only if the history X is consistent with the observed network G, and it is equal to zero otherwise.A complete specification of the probabilities appearing in (1) is obtained upon choosing a growth process: This fixes the prior P(X|θ), as well as the evidence P(G|θ), because it is a sum of P(X|θ) over all histories consistent with G.
In this latent variable formulation of the network archaeology problem, reconstructing the past amounts to extracting information from G via the posterior distribution P(X|G, θ).Doing so is not as straightforward as it first appears.The posterior distribution may be heavily degenerate, or even uniform over the set of all histories consistent with G [35] (see Supplementary Information).Therefore, a useful and attainable goal cannot be to find the one true history X(G) of G, because this history is often not identifiable.It turns out that another inference task, with a subtly different definition, is both challenging and achievable: That of finding a history as correlated with the ground truth X(G) as possible.We henceforth adopt this maximization as our inference goal.

B. Random attachment model
For the sake of concreteness, we will discuss network archaeology in the context of a specific growth process-although different choices of model can be equally fruitful [27,37].We use a variant of the classical PA model that incorporates both a non-linear attachment kernel [40] and densification events, i.e., attachment events between existing nodes [39,[41][42][43].
In our model, a new undirected edge is added at each time step, starting from a initial network G 0 comprising a single edge.With probability 1 − b the new edge connects two existing nodes, and it connects an existing node to a new node with complementary probability b.Whenever an existing node i is involved at time t + 1, it is chosen randomly with probability proportional to k γ i (t), where k i (t) is its degree (number of neighbors) at time t, and γ is the exponent of the attachment kernel (see Supplementary Information for an overview of the model).
The parameter b ∈ [0, 1] controls the density, and γ ∈ R controls the strength of the rich-get-richer effect.We refer to these parameters collectively with θ = (γ, b).We recover the classical PA model with (γ = 1, b = 1); the random attachment model with (γ = 0, b = 1) [36]; one of the models of Aiello, Chung and Lu with γ = 1, b ∈ [0, 1] [43,44]; and an undirected version of the Krapivsky-Redner-Leyvraz generalization if γ is free to vary and b = 1 [40].The model technically generates multigraphs for any b smaller than one, although the proportion of redundant edges and self-loops is numerically found to vanish for all b > 0 in the large network limit when γ < 1.It is thus a reasonable model of multigraphs, but also a good approximation of large sparse networks, with few or no redundant edges and self-loops.

C. Inference algorithms
According to the model, every event marks the arrival of precisely one new edge.This allows us to represent histories compactly as an ordering of the edges of G in discrete time t = 0, . . ., T − 1, an arbitrary time-scale defined in terms of events [1,37].Estimating the history then amounts to estimating the arrival times τ X (e) of the edges e ∈ E(G) in the ground truth history X.
A good estimator τ(e) of the arrival time of edge e is the posterior average: τ X (e)P(X|G, θ) , where Ψ(G) is the set of histories consistent with G.This estimator effectively combines histories, overcoming the degeneracy of the posterior distribution.It is straightforward to show that it minimizes the expected mean-squared error (MSE) on τ X (e), and we therefore refer to it as the MMSE estimator of the arrival time.Unfortunately, calculating the complete set of MMSE estimators { τ X (e) } is an intractable task, because there are far too many histories consistent with networks of even moderate sizes (the bound |Ψ(G)| = O(E!) holds, sometimes tightly so).Hence, we resort to approximations.We consider algorithms that fall in two broad categories (see Materials and Methods): Sampling methods that approximate (2), and structural methods that only rely on G to make predictions, forgoing explicit knowledge of the posterior distribution P(X|G, θ).The structural methods are much faster than sampling, but also less accurate, because they rely on network properties that are known to loosely correlate with age in random attachment processes.We use two such properties: The degree of nodes [5,45], and a generalization of the k-core decomposition of networks, known as the onion decomposition (OD) [46], because these two networks properties encode temporal information [5,45,47].We note that OD is closely related to the peeling method introduced in Ref. [34] to tackle the archaeology problem in the case (γ = 1, b = 1).In both cases, we order nodes according to the network property, and then induce a ranking on the edges.

D. Inference on artificial trees
We begin by testing the inference algorithms on trees drawn from the generative model itself (i.e., we set b = 1 and consider that γ is a free parameter).We compute the quality of a recovery using the Pearson product-moment correlation of the estimated arrival times and the ground truth.
The average achieved correlation is shown as a function of the attachment kernel γ in Fig. 2 (a), on small networks (T = 50).We distinguish two regimes based on the performance of the degree estimators: The regime γ > 0, characterized by skewed distributions of degrees, and the homogeneous regime γ < 0. The three methods behave similarly in the former regime: They first yield a relatively large correlation at γ = 0, and their quality then quickly plummets with growing γ, ultimately converging to a null average correlation for sufficiently large values of γ.The MMSE estimators remain slightly superior to the OD estimators throughout, and they both outperform the degree estimators by a significant margin.In contrast, the gap between methods is much larger in the homogeneous regime.While the quality of the OD and MMSE estimators increases with decreasing γ, the correlation achieved by the degree estimators goes in the opposite direction and shrinks with γ, eventually reaching 0 (not shown in the figure).
A better numerical portrait of the dependence of the attained correlation on γ is shown in Fig. 3 (a), where we apply the efficient OD method to increasingly larger networks.We find that for most values of γ > 1, the average correlation attained by the OD decreases as T −δ(γ) with δ(γ) > 0. If γ is close enough to 1, however, the average correlation becomes independent of T .

E. Inference on artificial networks with cycles
It is clear that trees offer an easier challenge than general networks, because long-range loops (i) drastically increase the number of histories consistent with G, and (ii) introduce uncertainties in the ordering of large subsets of edges.To get a better understanding of the inference process, we therefore repeat the above numerical experiments on more general networks that include cycles, generated with b < 1.The outcomes of our experiments are summarized in Fig. 2 (b) and Fig. 3 (b).
Allowing for cycles leads to three notable differences.First, we find that near perfect recovery is no longer possible in the The horizontal solid lines show the accuracy of randomly constructed sets R (no information retrieval), while the horizontal dotted lines shows the expected success rate in the limit γ → −∞, where a simple peeling technique is optimal.The sampled root sets are computed for γ ∈ {−1, − 1 2 , 0, 1 2 , 1}, using n = 10 5 samples.
γ < 0 regime.Second, the separation between the MMSE estimators and the structural methods (OD, degree) becomes more pronounced for all γ.Third and finally, the transition becomes sharper and it occurs at a lower value of γ c (b); notice the much sharper decline in Fig. 3 (b).

F. A different task: root-finding
Inferring the complete history of a network is only one of many possible problems that fit within the Bayesian formulation of network archaeology.Any other temporal inference task may be attacked with the same set of tools.As an example of the versatility of the framework, let us treat one such problem: Finding the root-the first edge-of G [31].
In line with Refs.[28][29][30], we can give a solution to this problem in terms of sets: We define a procedure that returns a set R of K(ε) edges, and guarantees that it will contain the first edge with probability 1 − ε.The size K depends on the acceptable error rate ε; larger sets cast a wider net, and are therefore more likely to contain the root.
To compute R, we use a strategy based on the marginalization of the distribution P(X|G, θ).We first obtain the probability P[τ X (e) = 0] that an edge e is the first, for each e ∈ E, via where I[S ] is the indicator function, equal to 1 if the statement S is true, and to 0 otherwise.We then define R as the set formed by the K edges that have the largest posterior probability P[τ X (e) = 0], which we evaluate again by sampling.For comparison, we also infer the root with the much faster onion decomposition, by constructing R with the K most central edges (with ties broken at random).The accuracy of the resulting algorithms is shown as a function of γ in Fig. 4. We distinguish, again, two main phases: Accurate recovery is possible in the strongly homogeneous regime γ 0, but the success rate diminishes with growing γ, reaching a non-informative limit in the regime γ 0.

A. Of information and phase transitions
In the Results section, we have shown that the history encoded in a network's structure can be recovered to varying degrees of accuracy, depending on the parameters of the generative model.These variations in accuracy, we now argue, are attributable to changes in the abundance of equivalent edges, i.e., edges that can never be ordered because they are structurally indistinguishable.

Model phenomenology
To better understand the origin of variations in recovery quality, we have to analyze the generative model itself.Let us focus for the moment on the special case b = 1 and γ ∈ R, thoroughly analyzed in Ref. [40].This model has many known phases, characterized by different degree distributions.In the limit γ → −∞, the model generates long paths, where every node has degree 2 except for the two end-nodes, of degree 1.For all negative values of γ, the model favors homogeneous degrees.When γ = 0, the degree distribution is geometric, of mean 2 (since we recover the uniform attachment model [36]).In the interval 0 < γ < 1, the degree distribution takes the form of a stretched exponential, with an asymptotic behavior fixed by γ.At precisely γ = 1, the attachment kernel becomes linear and the networks scale-free: The degree distribution follows a power-law of exponent −3.In the interval 1 < γ < 2, the networks condensate in a rapid succession of phase transitions at γ m = (m + 1)/m for m ∈ N * .When γ > γ m , the number of nodes of degree greater than m becomes finite.As a result, an extensive fraction of the edges aggregates around a single node-the condensate-and this fraction grows with increasing γ [48].The condensation is complete at γ = 2, where the model enters a winner-takes-all scenario characterized by a central node that monopolizes nearly all the edges.

No-recovery phase
The results of Figs.2-4 show that there is a region of this parameter space (γ 0) where correlated recovery is, without a doubt, impossible in the limit of large system sizes.Furthermore, the scaling analysis of Fig. 3 suggest that this is only true for some γ ≥ γ c -the hallmark of a phase transition at γ c .We define this transition as follows: On one side of γ c , there exists an algorithm that returns estimators {τ(e)} attaining a non-vanishing average correlation with the ground truth in the limit of large network sizes, while there are no such method on the other side of the divide, in the no-recovery phase.
The exact position γ c of the threshold is of theoretical interest.It is certainly no greater than 2, because the edges of the star-like graphs typical of this regime [40]-i.e., full condensates-are intrinsically unorderable.If we are able to find positively correlated histories beyond γ = 2 in Fig. 2, it is only because the system is small.Scale the network up and our predictive power vanishes (see Fig. 3).But while it is clear from the phenomenology of the generative model that the transition lies at some γ c ≤ 2, its exact location is harder to pinpoint.
We can give a lower bound on γ c by showing that an algorithm attains a non-vanishing average correlation at some γ ≤ γ c .This is what is done numerically in the finite-size scaling analysis of Fig. 3, where we use the OD algorithm to infer the history of larger and larger networks.The OD fails at some 11/10 ≤ γ ≤ 5/4, because it achieves an average correlation that decreases with growing T when γ = 5/4, while it attains a non-decreasing correlation when γ ≤ 11/10.Unless the scaling behavior changes outside of the inspected range of network sizes, this tells us that the phase transition occurs at some γ > 11/10.

Origin of the no-recovery phase
The appearance of a condensate from γ > 1 onwards gives a nice qualitative explanation as to why there should be a phase transition in recovery quality at all.When an edge attaches to the condensate, the temporal information it carries becomes inaccessible.Furthermore, because these edges are added throughout the growth process, any estimation technique that tries to find a total ordering will conflate old and new edges in a single class.The diminishing correlation of the estimators in the regime γ > 1 is hence at least in part attributable to the presence of this condensate.
To quantify the impact of equivalent edges on the attained correlation, we verify how the average information content (IC) of the generated networks scales with network size, for different values of γ (see Materials and Methods for details).In a nutshell, the IC is an information theoretic quantity that measures the prevalence of equivalent edges in a network [49]; it scales linearly with log(|E|) when there are Θ(|E|) sets of equivalent edges in G, and it goes to a constant when there are only a finite number of them.Importantly, the IC is defined in terms of truly equivalent edges, i.e., those that are indistinguishable up to an automorphism of the network [50]-not only the edges of the condensate.What we want to verify is whether good performance correlates with the presence of many distinguishable sets of edges, i.e., a relatively large IC.
Our results are shown in Fig. 5 (a).The scaling behavior of the IC confirms that there is an extensive number of equivalence classes when γ = 1, despite the fact that the automorphism group is known to be non-trivial [51].Figure 5 (b) shows the difference between the true information content and the information content obtained by assuming that the equivalence classes of edges are determined only by the degree of the nodes at the end of edges (a coarsening of the true equivalence classes).Because the difference is close to zero for high values of γ, this second figure tells us that most of the equivalence classes are degree classes in this regime.The figure also tells us that many new equivalence classes are created as γ approaches 1, precisely in the regime where OD does well.Coupled with Fig. 3, Fig. 5 shows that an abundance of equivalent edges-specifically those of the condensate-imply poor performance.

Nearly perfect recovery
While the abundance of nodes of low degree leads to the onset of a no-recovery phase in the regime γ 0, the exact opposite effect happens at the other extreme of the spectrum, when γ 0. In this regime, the networks are effectively grown as a random path, where all nodes are of degree two except the two end-nodes, of degree one.All edges are clearly ordered up to a mirror symmetry.Standard concentration inequalities tell us that the time of arrival of any edge can be identified up to an uncertainty of vanishing size in the large T limit.Near perfect recovery is therefore trivial: Peeling the path symmetrically from both sides yields a close approximation of the arrival time of every edge.

Effect of cycles
The above conclusions explicitly rely on the fact that the networks are trees, i.e., that b = 1.It turns out that many of these conclusions carry over to the case b < 1, as highlighted by the similarity of the numerical results in Figs.2-3.There are essentially two areas where allowing for cycles brings notable differences: Near-perfect recovery becomes impossible when γ 0, and the correlation attained by our best methods decreases faster with growing γ.
The disappearance of the near perfect recovery phase in the regime γ 0 is imputable to the appearance of random longrange connections.These connections close dangling paths and erase all temporal information along them.The net result is that the limit γ → −∞ actually poses a hard challenge for any b < 1, mirroring the limit γ → ∞.The near perfect recovery phase is thus highly uncharacteristic of the general model.
If the correlation diminishes more abruptly with growing γ when b < 1, it is because cycles, self-loops and parallel edges-all allowed motifs in the regime b < 1-accentuate the condensation phenomenon.A typical network realization in the super-linear regime γ > 1 with b < 1 comprises of: Many unorderable self-loops centered on the condensate; a number of parallel edges connecting high degree nodes; and star-like node arrangements around high-degree nodes.The information sinks of the case b = 1 are thus both larger and better interconnected.

B. On the quality of the inference methods
Our numerical results suggest that even though the naive degree-based approach works relatively well in the regime γ > 0, the MMSE estimators and the onion decomposition actually perform much better across the board.Of these two methods, the correlation attained by the MMSE estimators is perhaps the least surprising; after all, they build on an explicit knowledge of the growth model.In fact, we can show (see Supplementary Information) that using anything but the MMSE estimators yields suboptimal results on average.The result that needs explaining, then, is the excellent average correlation achieved by the OD.
A simple combinatorial argument can explain this performance.Notice that for a network G, the vast majority of histories X ∈ Ψ(G) place central edges at the beginning, and peripheral edges at the end [31,52] (both in trees and general networks).This is a consequence of the fact that there are many more consistent ways of enumerating the graph G starting from its center than from its periphery.The net result is that the MMSE estimators of the central edges are heavily skewed towards early arrival time, while that of the edges in the periphery are skewed towards later times (see ( 2)).In other words, the separation in layers uncovered by the OD is contained directly in the posterior distribution-the onion decomposition is good because it somehow approximates the optimal MMSE estimators.This is a useful connection, because the OD is much more efficient than sampling: The OD returns its final point estimates in O(|E| × log |V|) steps [46], whereas a single sample is generated in roughly as many steps by the sampling algorithm.

C. New insights on related works
At this point, we ought to discuss important connections with related work on root-finding [28][29][30][31] and complete history inference [34,35], on random and preferential trees (i.e, for the models (γ, b) = (0, 1) and (1, 1)).These previous analyses establish optimal algorithms to find node orderings; so let us define the operator τX (v), identical to τ X (e) on vertices.
The most comprehensive root-finding method is put forward in Ref. [28].Their strategy is to compute the number ϕ(v) = |{X|τ X (v) = 0}| of histories rooted on v, and to return the K nodes with the largest ϕ(v).They show that this algorithm can be employed to construct sets of constant size that contains the root with a fixed error rate ε < 1 as T goes to infinity, and that the case γ = 0 is easier than the case γ = 1 (smaller sets are needed to attain the same error rate ε).Our results (Fig. 4) corroborates these observations and put them in the broader context of Bayesian inference with general γ (a generalization suggested in Ref. [28]).For example, notice that ϕ(v) is in fact proportional to the posterior probability: when the distribution P(X|G, γ) is uniform.Thus the estimators of Refs.[28,29,31] can be in fact seen as outputting the K first maxima of the marginal distribution for the first node of G, assuming a uniform posterior distribution.The general estimator appearing in (3) accounts for general parametrization, and point the way to obvious extensions to more complicated root graphs [30].
Turning to related works on complete history recovery, we note that the recent analysis of Ref. [34] also relies on a peeling algorithm, almost isomorphic in its action to the onion decomposition [46]; it is shown in this reference that the algorithm performs extremely well on scale-free trees (γ = 1 and b = 1), in line with our analysis.Importantly, the notion of inference quality used in Ref. [34] is different from ours, because the authors point out the fact that there is a trade-off between precision and density (number of ordered pairs of nodes).In particular, when the peeling algorithm is allowed to withhold judgment on contentious pairs of nodes, one obtains nearly perfectly accurate estimates, at the cost of a small estimate density.Our analysis showcases another aspect of the power of peeling-type algorithms: They are also almost as effective at extracting information as the best estimators (MMSE), when all pairs must be ordered.

D. Application: Phylogenetic tree of the Ebola virus
The end goal of network archaeology is to uncover temporal information from real statically observed networks not explicitly generated by any growth process.It was shown recently that growth processes can be hard to tell apart from one another, even when perfect temporal data is available [16].One consequence is that the generalization of PA-and many more models-can actually make sense as an effective model of the growth of some real networks.The details ultimately do not matter much.
As a proof of concept, we apply our method to the inferred phylogenetic tree of the Ebola virus for the 2013-2016 West African Ebola epidemic [53,54].The extensive coverage of the surveillance and sequencing effort for this epidemic [55] means that the metadata is a close approximation for strain emergence.Our goal is to find an ordering of the emergence of all strains consistent with the metadata.
In Fig. 6, we show that all the inference methods recover some level of temporal information; statistical inference, however, does much better than the others, regardless of the measure of quality used.The naive estimators yield correlations of ρ degree = 0.152 and ρ OD = 0.150 with the known metadata, while we find ρ MMSE = 0.456 with sampling method (using 25 000 samples).Furthermore, while all the methods resolve pairs of mutations separated by any number of time steps better than chance, the MMSE estimators greatly outperforms the other techniques.This performance gap is here due to the presence of equivalent edges; the OD identifies only 26 sets of distinguishable edges, while the degree estimators identify 47.In contrast, the true MMSE estimators would be able to order all pairs of edges not on the same orbit (1588 distinguishable sets of edges), a property that is retained by the subsampled estimators.
Similar analyses are not possible for most epidemics, due to a lack of data.For epidemics that are more sparsely monitored than the 2013-2016 Ebola outbreak, reconstruction typically relies on models that are hard to parametrize (transmission rates,  [53], while the remainder of the nodes represent inferred common ancestors (n = 959, in blue), and edges are most likely mutations [54].We find that this network is best modeled by γ = −0.71,associated with a KS-statistic of D * = 0.17 and a p-value P[D > D * ] = 0.53 under the random model, signaling a good fit (see Materials and Methods).(right) Fraction of the edge pairs correctly ordered by the estimators, when separated by at least t min time steps in the metadata.The many ties found by OD and degrees are broken at random.mutation rates, demographics, etc.) [56].The quality of our results suggests that reconstructing the history of phylogenetic trees should be one of the interesting avenues for future network archaeology, especially since the method does not rely on parameters that require additional data sources.

E. Looking ahead: challenges and generalizations
The opportunities brought about by network archaeology are tantalizing; in bioinformatics alone-the only field where it has found widespread adoption thus far [27,57]-network archaeology with the divergence-duplication models has already yielded insights into the past states of real PPI networks [17,20] and improved on network alignment [18,19].Generalization to models that are relevant to social and technological networks will allow us to answer new questions about the past of statically observed systems, and to improve on network analysis techniques [58].
Our paper shows an example of how to carry this analysis almost automatically.With a model specified as a Markov chain, a sequential Monte-Carlo algorithm based on Refs.[37,38,[59][60][61] provides weighted histories that can be aggregated as MMSE estimators, to yield optimal estimators of the true history of a network.Drawing from a background network will always work, as long as histories leave tangible traces (i.e., there are no deletion events [21]).
That said, our analysis is of course far from complete, and leaves a number of important theoretical and computational problems open.First, while we have provided compelling evidence for the existence of a scalable inference phase and a norecovery phase, we have not pinpointed the location γ c of the transition that separates them.Our numerical analysis suggests that it lies at some rational value γ c = (m + 1)/m for 2 ≤ m ≤ 10 when b = 1, but finding the exact location will require further analytical work, perhaps in the spirit of [51].Second, the phenomenology of the observed phase transition is strikingly similar to that observed in many disorder models [62,63].While growth models are formally out of equilibrium processes--and thus cannot be obviously mapped onto disorder models-it will be important to establish how the network archaeology phase transition fits within the broader family of phase transitions in Bayesian inference problems.Finally, we have argued for using the OD [46] in large networks because sequential Monte-Carlo is ultimately not scalable.This substitution will most likely not work with all models, because it is based on a correlation between the posterior averages of arrival times and the order of peeling of a network specific to the class of model studied.As a result, the next step for general network archaeology should be to derive efficient approximation methods that work with general models, to allow for flexible network archaeology.These methods will have to handle models specified as chains P(X|θ) with some arbitrary notion of consistency P(G|X, θ).The relaxation technique of Ref. [64] for permutation inference comes to mind; but also dynamical variants of message-passing [65], perhaps in the spirit of Ref. [66].

Generative model
We consider a growth model that generalizes the classical preferential attachment model (PA) of Barabási-Albert [5].The important features of the generalization are a non-linear attachment kernel k γ [40] and the possibility for new links to connect pairs of existing nodes [15,39,43].
The model generates sequences of undirected multigraphs G 0 , ..., G T −1 , where G t−1 has one fewer edge than G t .At time step t, we first draw a random node from V t , the node set of G t prior to any modifications of the graph's structure.This node is selected from a categorical distribution of weights where γ ∈ R is the exponent of the attachment kernel, and where k i (t) is the degree of node i at time t, before the new edge is added.We then complete the edge with a new node with probability b, or with an existing node with probability 1 − b (a densification event).When densification events occur, the second node is also selected from a categorical distribution of weights {u i (γ, V t )}.We refer to a particular sequence of graphs ending with G as a possible history for G. Histories are encoded as tuples X 0:T −1 = (G 1 , ..., G T −1 , G), where X t = G t is the graph at time t (we drop the subscripts whenever the intended meaning is clear, like in the main text).Since, the above growth model is a Markov process, the prior distribution P(X 0:T −1 |γ, b) can be written as a product of transition probabilities T −1 t=1 P(X t |X t−1 , γ, b) with where we have saved space by using ξ to denote an indicator variable that is equal to 1 when the transition X t−1 → X t is a densification event (and 0 otherwise), and where we have denoted by v 1 , v 2 as the nodes selected involved in the transition at time t.The posterior probability P(X 0:T −1 |G, γ, b) of history X 0:T −1 is obtained by conditioning on the observed labeled multigraph G (see (1), main text).The likelihood P(G|X 0:T −1 , γ, b) appearing in this formula is extremely simple: It equals 1 if and only if the history is consistent with G (equivalently: feasible [37]), and 0 otherwise.Hence the posterior distribution can be written as where I[S ] is the indicator function, and where the evidence is given by a sum over the set Ψ(G) of histories consistent with G.

Inference task
Let τ X (e) ∈ {0, ..., T − 1} denote the position of the edge e ∈ E(G) in history X (we also call it its arrival time).Our goal is to give the best possible estimate of {τ X (e)} e∈E(G) , using only the structure of G, where X is the real history of G, i.e., the ground truth.By convention, we express both the estimators and the true history in the time scale t = 0, . . ., T − 1 where T = |E|.Note, however, that the estimator τ(e) of τ X (e) need not be an integer, or distinct from other estimators (i.e., we allow τ(e) = τ(e ) for e e ).
We quantify the quality of the estimators {τ(e)} of the true arrival times using the Pearson product-moment correlation coefficient where τ = (T − 1)/2, and where we have dropped the subscript for the sake of conciseness.In taking the sum, it is assumed that the edges are distinguishable (this matters for multigraphs with self-loops).The correlation takes values in [−1, 1], where |ρ| = 1 indicates a perfect recovery up to a time reversal, and where |ρ| = 0 indicates that no information is extracted from the graph at all.It is not affected by an arbitrary linear transformation of the timescales, and it penalizes spurious ordering of tied events.

Limits and difficulty of the inference task
It is impossible to order two edges of G if they are indistinguishable up to an automorphism of G.The difficulty of the recovery problem is, as a result, determined by the distribution of equivalent edges in G.
Equivalent edges can be formally defined in terms of node orbits [67].An orbit is the set of nodes that map unto itself under an automorphism of the graph.The decomposition in orbits of the line-graph L(G) of G gives its equivalent edges.L(G) is constructed by assigning one node to every edge of G and by connecting two nodes if the corresponding edges share an endpoint; a node orbit in L(G) corresponds to a set of indistinguishable edges in G because the transformation from G to L(G) is reversible and unique except in one pathological case [68].
The information content of G [49,69] is a natural summary of the distribution of equivalent edges in G; it is defined as where C 1 , ..., C q are the q < |E| orbits of L(G).If all orbits are finite, then the information content of G is of order log |E|.
Conversely, if one extensive orbit accounts for the totality of edges-e.g., when G is a star graph-then S (G) is zero.In general, if there are orbits that account for a non-vanishing fraction α = α 1 +...+α of all edges, then S (G) ≈ (1−α) log |E|− i α i log α i .We note that the standard definition of S is in terms of nodes; we have here opted for edges partitions since they are the basic unit of our inference.

Structural estimation algorithms
We regroup in this section all the methods that forgo an explicit knowledge of the posterior distribution P(X 0:T −1 |G, γ, b) to make predictions.All these methods follow the same general formula: We first rank the edges based on some network property P (e.g., the edge's centrality), and then output the ranks directly as estimated arrival times.Whenever the edges of a subset S ⊆ E are indistinguishable according to property P, we give them the same rank λ(S ), reflecting the uncertainty of their true ordering.To set λ(S ), we require that the average time of arrival τ be preserved; this constraint forces λ(S ) = t + (m + 1)/2, where m = |S | and t + 1, ..., t + m are the ranks that would have been assigned to the edges of S , had they been ordered.This choice is optimal in the sense that assigning any other rank to the edges of S would not reliably increase the overall correlation if τ is preserved (see Supplementary Information).It also has the added benefit that the Pearson correlation of (9) can then be computed directly as Spearman's rank correlation, using the true and inferred arrival time as ranks.

Degree-based estimation
Our first structural estimator is based on the observation that the nodes that arrive earlier in a growth process have, on average, a larger degree [5,45].We use the degree of nodes to induce a ranking of edges as follows.Let (k low e , k

Layer-based estimation
Our second structural estimator is based on the onion decomposition (OD), a generalization of the k-core decomposition.In the classical O(|E| × log |V|) algorithm for the k-core decomposition [46,70], a network is "peeled" by repeatedly removing the node of the lowest degree and adjusting the degrees.This process can be batched by removing all nodes of a same current degree simultaneously.The coreness of a node is then given by its degree when it is removed.Different from the classical k-core algorithm, the OD treats batches of simultaneous removal as separate layers; this assigns both a coreness and a layer number to each node.To turn these numbers into ranks, we assume that nodes with the lowest coreness numbers appeared last and that, within a coreness class, the first removed nodes are the youngest.A simple modification allows the algorithm to order edges: An edge is assigned to a class as soon as one of its nodes is peeled away.All edges removed in the same pass are declared as tied.Because we effectively discard the coreness number to order edges and nodes, the OD is almost equivalent to the peeling algorithm of Ref. [34], that proceeds by iteratively removing the lowest degree nodes, but without batching.

Principled algorithms
In contrast with the structural algorithms, principled algorithms explicitly harness the posterior distribution P(X 0:T −1 |G, γ, b) to make inference.

Estimators
The minimum mean square error (MMSE) estimator τ(e) = τ X (e) P = X∈Ψ(G) τ X (e)P(X|G, γ, b) , (11) appears to be the better choice of estimation function for τ(e) given P(X|G, γ, b), because using it for all edges yields an MMSE history that maximizes the expectation of the correlation a posteriori (see Supplementary Information).
Sequential Importance Sampling (11) is intractable for all but the simplest graphs, because the support of the posterior distribution is extremely large.This forces us to resort to sampling methods to approximate τ(e) as where x i is a random history of length T drawn from the posterior distribution P(X|G, γ, b).
In practice it is hard to sample from the posterior directly, so we will prefer a transformation of (12).Given an "easy to sample" Markov process that enumerates the edges of G with probability Q(X 0:T −1 |G) = T −1 t=1 Q(X t |X t−1 , G), we can express the MMSE estimators as ∼ Q , in the large n limit, which provides an alternate method for computing the estimators, now using a proposal distribution Q.This re-weighting scheme is known as the importance sampling (IS) method [60,61,71], and the recursive structure of the weights yields a sequential importance sampling (SIS) algorithm [59] where one can update the weights efficiently.It turns out that for problems with the structure of network archaeology, the ESS of a population of samples generated with a SIS algorithm will go to 0 unless Q is very close to P [59].A natural extension of SIS called adaptive sequential Monte-Carlo (SMC) is designed to address this problem.In the SMC algorithm, we generate a set H(t) = {X (i)  0:t } i=1,..,n of n 1 histories in parallel, using the distribution Q.We keep track of their weights W(t) = {ω i (X (i)  0:t )} i=1,..,n as time moves forward, using ( 14) and an initialization at ω i = 1.In doing so, we monitor the ESS and we re-sample the set of histories whenever the ESS falls below a set threshold (n/2, the rule of thumb advocated by Ref. [59], appears to work well in practice).That is, we duplicate some histories and overwrite others, as to obtain a new set H (t) of uniformly weighted histories (and therefore one associated with an ESS of n).We implement the resampling step by drawing n index {a i } i=1,..,n from the multinomial distribution of probabilities W = {ω i / j ω j } i=1,..,n , and set H = {X (a i ) 0:t } [59].We then reset all weights to 1 and continue the propagation forward in time.See Ref. [37,38] for an alternative derivation of what is in effect the same effect algorithm, in the language of particle filtering and bridge sampling.
Note that while the resampling step increases the ESS by design, it does so at a cost.After resampling a few times, the first steps of the histories in H(t) converge to a small subset of possibilities, because any history that has evolved in an unlikely direction is culled early on.Since some of the erased histories could have eventually evolved to a high weight state, we can actually end up with poorer inference results (an effect called path degeneracy [59]).Our experiments (see Supplementary Information) suggest that the downsides of path degeneracy outweigh the benefit of an increased ESS on loopy graph and very heterogeneous trees.As a result, we use resampling only when b = 1 and γ < 1.

Proposal distribution
We use a variation on snowball sampling [73,74] as the proposal distribution Q.A snowball sample is a random recursive enumeration of a graph, rooted at a randomly selected seed.More explicitly, given a random initial edge e 0 (the seed), define the boundary as the set of all non-enumerated edges that share at least one node with e 0 .Then select an edge e 1 from the boundary uniformly at random, and update the boundary by adding all new edges reached with e 1 , repeating the process for e 2 , e 3 , . .., until the graph is exhausted.It is easy to see that the transition probability of the snowball sampling algorithm is where Ω(X t ) denotes the boundary at step t, with the convention that |Ω(X 0 )| = |E|.

FIG. 1 :
FIG.1: Reconstructing the history of a growing network.(a) An artificial network generated by our generalization of the preferential attachment model (with parameters γ = −1.1,b = 0.9, T = 50, see main text).Since the network is artificial, its true history-i.e., the time of arrival of its edges in time-is known.The width and color of edges encode this history; older edges are drawn with thick, dark strokes, while younger edges are drawn using thin, light strokes.The age of nodes is encoded in their radius.Our goal is to infer these times of arrival as precisely as possible, using the network structure as our only input.(b) Expected time of arrival, computed with 10 5 samples of the posterior distribution of the model.The correlation of the inferred and real history equals ρ = 0.81 (see Materials and Methods for details).

FIG. 2 :FIG. 3 : 2 .
FIG.2: Effect of the rich-get-richer phenomenon on recovery quality.Average correlation attained by the minimum mean-squared error (MMSE) estimators and two efficient methods based on network properties (a degree-based method and the onion decomposition[46]), on artificial networks of T = 50 edges generated using our generalization of the preferential attachment model.(a) Tree-like networks generated with b = 1.The special points corresponding to the uniform attachment model and the classical preferential attachment model are indicated with dotted vertical lines, at γ = 0 and γ = 1.The point where infinitely large networks fully condensate is shown with a solid vertical line at γ = 2 [40].(b) Typical diagram for networks with cycles, here at b = 0.75.Each point is obtained by averaging the correlation obtained on m different network instances, where (b = 1) m = 40, and (b = 0.75) m = 250.We use n Monte Carlo samples to approximate the MMSE estimators, where (b = 1) n = 10 5 , and where (b = 0.75) n = 5 × 10 6 .

20 FIG. 4 :
FIG.4: Root-finding on artificial trees.Success rate of the root-finding algorithms, with sets R of sizes K = 5 and K = 20 on artificial networks of T = 50 edges.The results of OD are shown with solid lines and symbols, while the sampling results are shown with symbols and error bars of one standard deviation.The horizontal solid lines show the accuracy of randomly constructed sets R (no information retrieval), while the horizontal dotted lines shows the expected success rate in the limit γ → −∞, where a simple peeling technique is optimal.The sampled root sets are computed for γ ∈ {−1, − 1 2 , 0, 1 2 , 1}, using n = 10 5 samples.

FIG. 5 :
FIG. 5: Finite size scaling of the average information content.(a) Average information content S (G) of tree networks (b = 1), as a function of network size T , for exponents γ ∈ { 5 4 , ...., 11 10 , 1} of the non-linear kernel.Curves that lie over (orange) and below (blue) the observed critical γ c are highlighted, matching those marked in Fig. 3 (a).The dotted line is the upper bound log(T ) on the information content.(b) Information content not accounted for by the degree classes, i.e., S (G) − S deg (G) where S deg (G) is the Shannon entropy of the partition obtained by classifying edges according to their nodes' degree, see main text.

FIG. 6 :
FIG.6: Application to the phylogenetic tree of strains of the Ebola virus.(left) Leaves (n = 1238, in orange) represent strains of Ebola sequenced during the 2013-2016 West African outbreak[53], while the remainder of the nodes represent inferred common ancestors (n = 959, in blue), and edges are most likely mutations[54].We find that this network is best modeled by γ = −0.71,associated with a KS-statistic of D * = 0.17 and a p-value P[D > D * ] = 0.53 under the random model, signaling a good fit (see Materials and Methods).(right) Fraction of the edge pairs correctly ordered by the estimators, when separated by at least t min time steps in the metadata.The many ties found by OD and degrees are broken at random.

highe
) denote the degree of the nodes connected by edge e, with k low e ≤ k high e .We rank edges in descending order of k high e , and break ties with k low e when possible.
Sequential Monte-CarloSIS methods encounter practical problems when the variance on the weights is large: A few samples dominate the others, leading to a poor characterization of the posterior distribution.The effective sample size[72] ESS({x i }|G, γ, b) := n i=1 ω(x i |G, γ, b) 2 n i=1 ω(x i |G, γ, b) 2 ,(15)quantifies this effect.An ESS close to n indicates that all samples contribute equally.