Percolation and the effective structure of complex networks

Analytical approaches to model the structure of complex networks can be distinguished into two groups according to whether they consider an intensive (e.g., fixed degree sequence and random otherwise) or an extensive (e.g., adjacency matrix) description of the network structure. While extensive approaches---such as the state-of-the-art Message Passing Approach---typically yield more accurate predictions, intensive approaches provide crucial insights on the role played by any given structural property in the outcome of dynamical processes. Here we introduce an intensive description that yields almost identical predictions to the ones obtained with MPA for bond percolation. Our approach distinguishes nodes according to two simple statistics: their degree and their position in the core-periphery organization of the network. Our near-exact predictions highlight how accurately capturing the long-range correlations in network structures allows to easily and effectively compress real complex network data.

The structure of real complex networks lies somewhere in-between order and randomness [1][2][3], with the consequence that it cannot typically be fully characterized by a concise set of synthesizing observables. This irreductibility explains why most theoretical approaches to model complex networks are inspired by statistical physics in that they consider ensembles of networks constrained by the values of observables (e.g. density of links, degreedegree correlations, clustering coefficient, degree/motif distribution) and otherwise organized randomly. These approaches have three notable advantages. First, they usually yield analytical treatment. Second, they are intensive in network size, meaning that their complexity scales with the support of the observables (i.e., sublinearly with the numbers of nodes and links). Third, they provide null models, of which many have led to the identification of fundamental properties characterizing the structure of real complex networks [4,5].
Despite important leaps forward in recent years, these approaches still fail to capture enough information to systematically provide accurate quantitative predictions of most dynamical processes on real complex networks. The reason for this shortcoming is that the properties from which the ensembles are constructed are not constraining enough; the ensembles are "too large" such that the original real networks are exceptions, rather than typical instances, in the ensembles. As a result, the current state-of-the-art approach-the so-called message passing approach (MPA) [6]-requires the whole structure to be specified as an input (i.e., the adjacency matrix, or a transformation thereof). This method is interesting because it is mathematically principled, meaning that it yields exact results on trees, and offers inexact, albeit generally good, predictions on networks containing loops (i.e., most real complex networks) [7].
However, by considering the whole structure of networks and thereby considering every link on equal foot-ing, the accuracy of the MPA comes at a significant computational and conceptual cost. First, its time and space complexity are extensive in the number of links and therefore in the size of the network. Second, and most importantly, it does not provide any insight on the role played by any given structural property in the outcome of a dynamical process. With the MPA, getting good predictions comes at the expense of understanding what led to that outcome.
In this paper, we bridge the gap between intensive and extensive approaches to the mathematical modeling of bond percolation on networks. We introduce a random network ensemble that relies solely on an intensive description of the network structure that, nevertheless, yields predictions that are comparable to the ones from the MPA for most of the 111 real complex networks considered in this study. This ensemble is based on the onion decomposition (OD), a refined k-core decomposition [8]. Critically, the OD can be translated into local connection rules allowing an exact mathematical treatment using probability generating functions (pgf) in the limit of large network size. This approach leads to exact predictions on trees like the MPA, and highlights the critical contribution of the OD to an accurate effective mathematical description of real complex networks.

RESULTS AND DISCUSSIONS
Most analytical models of complex networks rely on some variation of the tree-like approximation which assumes that complex networks have essentially no loops beyond some local structure of interest [9,10]. While this approximation is inaccurate for the vast majority of real complex networks, it nevertheless allows an elegant mathematical treatment which typically works surprisingly well [11]. In the case of the MPA, the tree-like approximation implies that a lot of information given to the model is thrown away due to loops being included in the input information (i.e., the adjacency matrix) to then be mathematically ignored. We here propose to limit the information we give to our model by compressing complex networks following their tree-like decomposition. We therefore rely on a known peeling process, which iteratively removes leaves (i.e., the peripheral nodes of the network) to calculate the depth of every node in the effective tree.
Taking this information into account, we then focus on predicting the outcome of bond percolation on complex networks: a canonical problem of network science analogous to many applied problems such as disease propagation or network resilience [12]. Given a network structure, this simple stochastic process consists in the occupation of each original link with probability p. We aim to predict the size of the largest connected component composed of occupied links, S, as well as the percolation threshold, p c , above which that component corresponds to a macroscopic fraction of the network. The outcome of percolation depends on structural properties at all scales, thus making it a good benchmark for theoretical network models.

Onion decomposition
The k-core decomposition is a well-known network metric that identifies a set of nested maximal subnetworks-the k-cores-in which each node shares at least k links with the other nodes [17,18]. A node belonging to the k-core but not to the (k + 1)-core is said to be of coreness k and to be part of the k-shell. Nodes with a high coreness are generally seen as more central whereas nodes with low corenesses are seen as being part of the periphery of the network. The onion decomposition (OD) refines the k-core decomposition by assigning a layer l to each node to further indicate its position within its shell (e.g., in the middle of the layer or at its boundary). The OD therefore unveils the internal organization of each centrality shell and, unlike the original k-core decomposition, can be used to assess whether the structure of a core is more similar to a tree or to a lattice, among other things [8].
The OD of a given network structure is obtained via the following pruning process (see Fig. 1). First we remove every nodes with the smallest degree, k min ; the coreness of these nodes is equal to k min and they are part of the first layer (l = 1). Removing these nodes may yield nodes whose remaining degree is now equal to or smaller than k min ; these nodes must also be removed, have a coreness of k min as well, but are part of the second layer (l = 2). If removing nodes of the second layer yields new nodes with a remaining degree equal to or lower than k min , they will be part of the third layer (l = 3), will have a coreness of k min and will also be removed. This process is repeated until no new nodes with a remaining degree equal to or lower than k min are left. We then update the value of k min to reflect the lowest remaining degree and repeat this whole process until every node has been assigned a coreness and a layer (the layer number keeps increasing such that each layer corresponds to a unique coreness).
An efficient implementation of this procedure has a run-time complexity of O(L log N ), where L and N are respectively the number of links and nodes, which implies that the OD can be quickly obtained for virtually any real complex network [8]. Most importantly, nodes belonging to a same layer are topologically similar with regard to the mesoscale centrality organization of the network. Because the layer of a node is only weakly related to its degree (i.e., the coreness of a node provides a lower bound to its degree), the pair layer-degree can therefore be used to indicate how well a node is connected, but also to indicate its "topological position" in the network. It therefore allows us to discriminate central nodes from peripheral ones which, based on their degree alone, would have otherwise been deemed identical.
Effective random network ensemble: the LCCM From the pruning process described above, it can be concluded that a node of coreness c belonging to the l-th layer is in one of two scenarios. 1) It must have exactly c links to nodes in layers l ≥ l if layer l is the first layer of the c-shell (i.e., nodes in layer l − 1 belong to the c -shell with c < c). 2) Otherwise, if it is not in the first layer of its c-shell, it must have at least c + 1 links to nodes of layers l ≥ l − 1 and at most c links to nodes of layers l ≥ l. The distinction between the two scenarios is that The Layered and Correlated Configuration Model assigns an ID to every node corresponding to its degree and its position in the core-periphery structure of the network. Degrees are not shown to lighten the presentation. Stubs are colored according to the layer to which they point: red if they point to more central layers and black if they point to the previous layer. There are no green stubs in this example. (c) The Configuration Model assigns an ID to every node according to its degree before randomly connecting them, therebyy destroying the mesoscopic and macroscopic structure of the original network. The Correlated Configuration Model fixes the number of links between different degree classes, and would therefore prohibit components formed by two nodes with degree 1, but would otherwise be very similar to the configuration model shown here.
nodes not in the first layer of their shell require at least one link to the previous layer to anchor them to their own layer. Also, the common feature of these scenarios is that a node of coreness c needs at least c links with nodes of equal or greater coreness.
By rewiring the links of a given network using a degreepreserving procedure [19,20] while ensuring that the aforementioned rules are respected at all time, it is possible to explore the ensemble of all possible single networks with the same fixed layer-degree sequence (i.e., the sequence of every pairs (l, k) in the original network). Exactly preserving the layers-and thus the coreness of every nodes-is of critical significance since previous rewiring approaches could only approximately preserve the k-core decomposition [21].
Additionally, the pair layer-degree assigned to each node can be used to enforce two-point correlations (i.e., the (layer-degree)-(layer-degree) correlations), thus reducing the size of a random network ensemble. This correlated ensemble can be explored via a double link swap Markov chain method preserving both the layer-degree sequence and the number of links within and between every node classes (i.e., nodes with the same layer-degree). One way to implement this method is by first choosing one link at random (e.g., joining nodes A and B) and then choosing another link at random (e.g., joining nodes C and D) among the links that are attached to at least one node whose layer-degree pair is the same at one of the two nodes connected by the first link (e.g., A and C have the same layer-degree) [22]. The two links are then swapped (e.g., A becomes connected to D and B to C) if no selflink or multi-link would be created. Doing so ensures that that both the degree sequence and the two-point correlations are preserved at all time. We call layered and correlated configuration model (LCCM) the ensemble of maximally random networks with a given joint layer-degree sequence and (layer-degree)-(layer-degree) correlations.
Since it preserves both the degree sequence and the degree-degree correlations, the LCCM is a subset of two commonly used random network ensembles defined by the configuration model (CM) [23] and the correlated configuration model (CMM) [24]; the latter being known for its fair accuracy in many applications [11]. The LCCM, however, distinguishes itself from these models (and other variants) by enforcing a mesoscopic organization via the layers of the OD. This feature has the critical advantage of making the LCCM a mathematically principled approached in the sense that it exactly preserves the structure of a wide variety of trees (see Fig. 2). As we show below, this mesoscopic information accounts for a significant portion of the missing gap between the predictions of the intensive configuration models and the extensive, current state-of-the-art MPA.  [13]. (upper right) PGP web of trust [14]. (lower left) A subset of the Internet at the autonomous level [15]. (lower right) Proteinprotein interaction network of Homo sapiens [16]. The insets show the absolute value of the difference between the MPA and the CM the CCM and the LCCM as a function of p, as well as an enlargement of the region around the percolation threshold. The largest connected component was used for all dataset.
the connection rules stated in the previous section, the LCCM requires to keep track of the number of links that each node in each layer l shares with nodes i) in layers l ≥ l, ii) in layer l = l − 1 and iii) in layers l < l − 1. We identify the corresponding half-links as red, black and green stubs, respectively. For instance, a link between nodes in layers 3 and 5 consists in a red stub stemming out of the node in layer 3 paired with a green stub belonging to the node in layer 5. Note that a link between two given layers can only consist in a unique pair of stub colors, and the only allowed combinations are red-red, red-black and red-green.
From the link correlation matrix L, whose entries specify the fraction of links within and between every classes of nodes, we can derive the function (see Methods) generating the probability P lk (k r , k b , k g ) that a node in class (l, k) has k r red stubs, k b black stubs and k g green stubs, given the connection rules of the LCCM. From the same link correlation matrix, we can also derive the functions (see Methods) for every α ∈ {r, b, g}, generating the probability Q α lk (l , k , α ) that a stub of color α stemming of a node of class (l, k) is attached to a stub of color α belonging to a node in class (l , k ). Combining these two functions yields the pgf generating the distribution of the number of nodes of each class that are neighbors of a randomly chosen node of class (l, k) Note that this pgf also includes the colors of the stub through which these neighors are connected to the node of class (l, k). Similarly, the number of such nodes that can be reached from a node of class (l, k) that has itself been reached by one of its stubs of color α is where k α lk = ∂ϕ lk (1) ∂x α lk is the average number of stubs of color α nodes of class (l, k) have. To compute the size of the extensive component, we assume that the networks in the ensemble are locally treelike, which occurs in the limit of large network size or when the detailed structure of matrix L only permits exact trees (i.e., when loops are structurally impossible). We define a α lk as the probability that attempting to reach a node in class (l, k) by one of its stubs of color α does not eventually lead to the extensive component. Noting p the probability that links are occupied, the probabilities {a α lk } are the solution of for all l, k and α. This last expression encodes the simple self-consistent argument that attempting to reach the node will not lead to the extensive component if 1) the link is unoccupied, which occurs with probability 1−p, or if 2) the link is occupied, with probability p, but the attempts to reach the other neighbors of the node that has just been reached will all fail, which occurs with probability f α lk (a). Note that this argument relies on the assumption that the state of these neighbors are independent, which is true for a tree-like structure. Having solved Eq. (5), the relative size of the extensive component, S, is then given by the probability that a randomly chosen node is found in S where P (l, k) is the fraction of nodes in class (l, k) which can be extracted from the link correlation matrix L (see Methods). Notice that since we assume the networks of the ensemble to be tree-like, the relative size of the extensive component if nodes (instead of links) were occupied with probability p is simply S site = pS to account for the probability that the initial randomly chosen node is occupied. Note also that the percolation threshold, p c , is the value of p at which a = 1 becomes an unstable solution of Eq. (5) (see Methods), which corresponds to the emergence of the extensive component.

Effective tree-like structure
Because it is a subset of both the CM and the CCM, the cardinality of the ensemble defined by the LCCM should, in principle, be smaller than the ensembles considered by the formers. Consequently, if the mesoscale structural information provided by the layers l is of any significance, we expect the predictions of the LCCM to be the closest to the ones obtain with the MPA. Figures 3 and 4 confirm this observation. In fact, our results demonstrate that identifying nodes using the layer in the OD alongside their degree does not merely improve the predictions, it drastically changes their nature, making them qualitatively very similar to the ones of the MPA when not strikingly quantitatively identical. As shown on Fig. 3, the LCCM reproduces the general shape of the curves, has the same number of inflection points, and always predict a connected network when all links are occupied (i.e., S must be 1 at p = 1 since we considered the largest connected components of every datasets). Interestingly, only the LCCM and the MPA are able to capture the mesoscopic core-periphery and/or modular structures that were numerically shown to lead to smeared (or double) phase transitions [25] such as the one observed on the protein-protein interaction network.
Perhaps most importantly, the LCCM approximates to high accuracy the percolation threshold predicted by the MPA, as seen in Fig. 4(left), with an relative error of less than 1.5% for 75% of the 111 network datasets considered. Additionally, Fig. 4(right) shows the expected error on the size of the extensive component averaged over the entire range of occupation probability p. When using the LCCM to compress the network structure, we find that the error, relative to the MPA, to be of the order of 10 −3 for 75% of the datasets considered; an improvement of at least one order of magnitude from existing approaches. Altogether, these results indicate that categorizing nodes with the classes (l, k) captures critical features of the local and mesoscopic tree-like organization of many real complex networks, thus offering an intensive effective description of their structure. CONCLUSION We introduced a random network ensemble that relies solely on an intensive description of the network structure that, nevertheless, yields predictions for percolation that are either essentially quantitatively identicalor at least strikingly qualitatively similar-to the ones obtained with the state-of-the-art MPA. This ensemble assigns two structural features to each node-its degree k (local) and its position l in the Onion Decomposition of the network (mesoscale)-and creates links according to simple connection rules that exactly preserve these two features. This ensemble lends itself to exact analytical calculations using probability generating functions in the limit of large network size, and is mathematically principled, meaning that it leads to exact predictions on trees, like the MPA, but unlike other intensive approaches such as the configuration model and its variants. The accuracy of the predictions of the LCCM shows that the OD easily captures important features of the mesoscale struc-tural organization of many real complex networks, and that this information should be leveraged by the future generations of models of complex networks.
For instance, Eq. (1), which provides the distribution of different link types (e.g., the number of links leading to lower or higher layers) for any node, could be straightforwardly included in equations for other problems such as the Susceptible-Infectious-Susceptible dynamics. It would thus be possible to track the fraction of infected nodes with a given pair (l, k) whose time evolution would be driven by the transmission events along the connections prescribed by Eqs. (3)-(4). In a purely numerical context, and using a simpler, less accurate version of the LCCM, this approach was already shown to lead to predictions of SIS dynamics that are an order of magnitude more precise than other network models [8]. More generally, the pair (l, k) consists in a straightforward and computationally inexpensive observable to characterize and rank nodes based on their local connectivity (through k) and global centrality (through l).
Finally, the accuracy of the LCCM strongly suggests that the long-range correlations induced by the OD effectively emulate the correlations considered in the MPA, and, consequently, that a large chunk of the structural properties behind the accuracy of the MPA now lend themselves to intensive analytical treatment. This opens the way for future work to focus on bringing the analytical modeling of complex networks beyond the ubiquitous tree-like approximation. Doing so should provide a unified framework for random graphs, regular structures like lattices, and the complex networks that lie in-between.

Link correlation matrix
We define the symmetrical link correlation matrix L whose elements, L lk,l k , correspond to the fraction of links between nodes of class (l, k) and (l , k ). It has the following properties 1 2 since each type of links appears twice in the matrix except for the links connecting nodes of the same class (i.e., diagonal elements), and where P (l, k) is the fraction of nodes belonging to the class (l, k) and k = lk kP (l, k) is the average degree.

Distribution of the number and of the color of stubs
The connection rules of the LCCM indicate that a node of degree k in layer l and coreness c l have at most c l red stubs. Since red stubs are defined as half-links toward nodes in layers l ≥ l, they represent a fraction 1 2 of all stubs in the network ensemble, where δ ll δ kk accounts for the fact that a link connecting two nodes of class (l, k) contribute to two red stubs. This last quantity would be equal to if every of these nodes had exactly c l red stubs. Consequently, since the LCCM only dictates bounds on the number of each color, the probability that a node of degree k in layer l has exactly k r red stubs is simply where p r lk = l ≥l k (1 + δ ll δ kk )L lk,l k 2c l P (l, k)/ k .
Note that whenever layer l is the first layer of its core-when c l > c l−1 -Eq. (12) reduces to p r lk = 1 meaning that each node has exactly c l red stubs, as prescribed by the connection rules of the LCCM.
Similarly, the fraction of half-links shared with nodes in layers l < l − 1 (i.e., green stubs) is The maximal value of this quantity, however, varies in function of l. If the layer is the first layer of its shell (i.e., if c l > c l−1 ), then each node has c l red stubs and up to k − c l green stubs according to the connections rules. If c l = c l−1 , nodes that have exactly c l red stubs can have up to k − c l − 1 green stubs since they must have at least one black stubs, and can have up to k − c l otherwise. The maximal value of Eq. (13) can therefore be summarized as such that the probability that a node of degree k in layer l has exactly k g green stubs is with p g lk = l <l−1 k L lk,l k 2(k − c l )P (l, k)/ k .
Combining Eqs. (11) and (15) yields the probability that a node in layer l and of degree k has k r , k g and k b red, green and blacks stubs, respectively Finally, after some elementary algebra, it can be shown that the generating function ϕ lk (x) associated with this distribution is

Transition probabilities
With the distribution of the number of stubs of each color that nodes have being provided by Eq. (18), the only missing quantities are the transition probabilities: the probability Q α lk (l , k , α ) that a stub of color α stemming from a node of class (l, k) leads to a stub of color α attached to a node of class (l , k ). Once more, this information can be extracted from the link correlation matrix L.
Let us recall that black stubs stemming from nodes of class (l, k) can only lead to red stubs attached to nodes in the previous layer (i.e., l = l − 1), which can be summarized by where the denominator is proportional to the fraction of all stubs that are black and that are stemming from nodes of class (l, k). Similarly, since green stubs can only lead to red stubs attached nodes in layer l < l − 1, we have Because red stubs can lead to all three colors of stubs, we first consider the case where a red stubs leads to a black stubs (i.e., to a node in layer l = l + 1), which corresponds to Q r lk (l , k , b) = δ l ,l+1 L lk,l k l ≥l where the denominator is proportional to the fraction of all stubs that corresponds to red stubs stemming from nodes of class (l, k). In the case of red stubs leading to red stubs-i.e., links between nodes in the same layer-, we need to double the contribution of L lk,lk since each link between nodes of the same class contributes to two red stubs, which yields Q r lk (l , k , r) = δ ll (1 + δ kk )L lk,l k l ≥l The case of red stubs leading to green stubs is similar to Eq. (20) and is straightforward to obtain Finally, by injecting Eqs. (19)- (23) in Eq. (2), we obtain l ≥l (24c)

Percolation threshold
The value of the percolation threshold, p c , can be computed analytically by a linear stability analysis of the solution a = 1 of Eq. (5). Substituing a α lk = 1 − ε α lk , where ε α lk 1, yields when limiting the expansion of f α lk (1 − ε) to the first order. The last equation can be rewritten as an eigenvalue problem thus indicating that the fixed point a = 1 looses its stability-i.e., the extensive component emerges-when the largest eigenvalue of pM exceeds 1. The percolation threshold, p c , therefore equals the reciprocal of the largest eigenvalue of M which, by virtue of the Perron-Frobenius theorem, is real and positive. The elements of M can be written as where the derivatives are calculated directly from Eq. (18) and Eqs. (24a)-(24c). While the derivatives of γ α lk x are straightforward, the derivatives of ϕ lk (x) require special care with respect to the value of k − c l . To facilitate the numerical implementation of the formalism, we provide the explicit expression of the derivatives of ϕ lk (x).
Let us recall that c l = c l−1 and p r lk = 1 whenever k = c l since these nodes are in the first layer of their core by definition, and that we set c 1 = c 0 to simplify the notation. Note also that α ∂ϕ lk (1) ∂x α lk = α k α lk = k and α,α for α, α ∈ {r, g, b} and regardless of the value of k − c l , as expected.