Multilayer stochastic block models reveal the multilayer structure of complex networks

In complex systems, the network of interactions we observe between system's components is the aggregate of the interactions that occur through different mechanisms or layers. Recent studies reveal that the existence of multiple interaction layers can have a dramatic impact in the dynamical processes occurring on these systems. However, these studies assume that the interactions between systems components in each one of the layers are known, while typically for real-world systems we do not have that information. Here, we address the issue of uncovering the different interaction layers from aggregate data by introducing multilayer stochastic block models (SBMs), a generalization of single-layer SBMs that considers different mechanisms of layer aggregation. First, we find the complete probabilistic solution to the problem of finding the optimal multilayer SBM for a given aggregate observed network. Because this solution is computationally intractable, we propose an approximation that enables us to verify that multilayer SBMs are more predictive of network structure in real-world complex systems.

In complex systems, the network of interactions we observe between system's components is the aggregate of the interactions that occur through different mechanisms or layers.Recent studies reveal that the existence of multiple interaction layers can have a dramatic impact in the dynamical processes occurring on these systems.However, these studies assume that the interactions between systems components in each one of the layers are known, while typically for real-world systems we do not have that information.Here, we address the issue of uncovering the different interaction layers from aggregate data by introducing multilayer stochastic block models (SBMs), a generalization of single-layer SBMs that considers different mechanisms of layer aggregation.First, we find the complete probabilistic solution to the problem of finding the optimal multilayer SBM for a given aggregate observed network.Because this solution is computationally intractable, we propose an approximation that enables us to verify that multilayer SBMs are more predictive of network structure in real-world complex systems.
The development of tools for the analysis of real-world complex networks has significantly advanced our understanding of complex systems in fields as diverse as molecular and cell biology [1], neuroscience [2], biomedicine [3,4], ecology [5,6], economics [7], and sociology [8].One of the main successes of the network approach has been to unravel the relationship between the modular organization of interactions within a complex system [9], and the function and temporal evolution of the system [10][11][12][13].As a result, a large body of research has been devoted to the detection of the modular structure (or community structure) of complex networks, that is, to the division of the nodes of the network into densely connected subgroups [14].
Stochastic block models (SBMs) [15][16][17] are a class of probabilistic generative network models that provide a more general description of the (mesoscopic) group structure of real-world networks than modular models.In SBMs, nodes are assumed to belong to groups and connect to each other with probabilities that depend only on their group memberships.The simple mathematical form of SBMs has enabled not only the identification of generalized community structures in networks [17][18][19][20][21][22][23][24][25][26], but also to make network inference a predictive tool to detect missing and spurious links in empirical network data [27], to predict human decisions [28,29] and the appearance of conflict in work teams [30], and for the identification of unknown interactions between drugs [31].
While these approaches have pushed forward our understanding of complex network structure, a limitation is that they rely on the premise that there is a single mechanism that describes the connectivity of the network, even though we know that real-world networks are often the result of processes occurring on different "layers" (for example, social networks comprise relationships that arise in the familiar layer, and others that arise in the professional layer) [32].Moreover, it is increasingly clear that the multilayer structure of complex networks can have a dramatic impact on the dynamical processes that take place on them [33][34][35][36][37]. Unfortunately, we often lack information about the different layers of interaction and can only observe projections of these multilayer interactions into an aggregate network in which all links are equivalent.
Here, we precisely address the problem of unraveling the underlying multilayer structure in real-world networks.To do so, we first introduce the family of multilayer SBMs that generalizes single-layer SBMs to situations where links arise in different layers and are aggregated through different mechanisms.Although there have been proposals to extend the concept of modularity to multilayer networks [38], ours represents a pioneering attempt to extend generative group-based models to multilayer systems, and to study those models rigorously using tools from statistical physics.
Second, we give the probabilistically complete solution to the problem of inferring the optimal multilayer SBM for a given aggregate network.Because this solution is computationally intractable, we propose an approximation which enables us to objectively address the question of whether an observed network is likely to be the projection of multiple layers.Our results suggest that many real-world networks are indeed projections.

I. MULTILAYER STOCHASTIC BLOCK MODELS
In our approach, nodes interact in different layers.In each one of these layers ℓ = 1, . . ., L we define a SBM as follows: each node i belongs to a specific group σ ℓ i , and links between pairs of nodes belonging to groups α and β in layer ℓ exist with probability q ℓ αβ .The observed adjacency matrix A Ø is an aggregate that results from the combination of the links in each of the layers, but where all information of the layers has been lost (Fig. 1).We call this model the multilayer SBM.
Here we consider the simplest case of two layers, L = 2.In such case, there are two combinations with a plausible physical interpretation: i) the AND combination of layers, in which A Ø ij = 1 if, and only if, (i, j) are connected in both layers (Fig. 1(a)); ii) the OR combination of layers, in which A Ø ij = 1 if (i, j) are connected in at least one layer (Fig. 1 each of these two mechanisms is plausible for specific scenarios.For example, the AND model is a plausible model for in vivo protein interactions, because in order for proteins to interact in the cell it is necessary for them to be capable of physically interacting (that is, to be linked in the layer of in vitro physical interactions) and to be expressed simultaneously in the same cellular compartment (that is, to be linked in the coexpression layer).The OR model is a plausible model for the effective on-line social network through which memes spread [39], because some people use Facebook to share memes, others use Twttier, and others use both.
In principle, we would like to identify which is the pair of partitions (P 1 , P 2 ) (in layers 1 and 2, respectively) that best describe the observed aggregate topology.The probabilistically complete way to solve this problem is to obtain the joint probability P (P 1 , P 2 |A Ø ) that P 1 and P 2 are the true partitions of the nodes given the aggregate observed network.This distribution is given by where Q ℓ is a matrix whose elements q ℓ αβ represent the probability that a link exists between a pair of nodes belonging to groups α and β in layer ℓ, and DQ ℓ ≡ α≤β 1 0 dq ℓ αβ is the integral over all possible values of these probabilities.
This integral can be computed both for AND combinations and for OR combinations of the two layers; for simplicity, here we focus on the AND model and discuss the OR model in the Appendices.Because in a SBM each links is independent of each other and in the AND model a link has to be present in both layers to appear in the observed aggregate network A Ø , the AND likelihood is where n 1 αβγδ is the number of links between pairs of nodes that are in groups α and β respectively in layer 1, and in groups γ and δ respectively in layer 2 (n and n 0 αβγδ is the number of non-links between such pairs of nodes (n ).Assuming a uniform distribution for the prior [27] [40], we can plug Eq. ( 2) into Eq.( 1) and integrate to find (Appendices) where, for clarity, we have used the shorthand r ≡ αβ and s ≡ γδ, m r ≡ s m rs and m s ≡ r m rs .Given Eq. ( 3), which is the complete probabilistic description of the multilayer SBM, one could in principle find the partitions P 1 and P 2 that maximize P AND (P 1 , P 2 |A Ø ).If this were possible, one would be able to perfectly disentangle the two SBMs responsible for the observed links, even though the observation did not have explicit information about the layers.It would also be possible to compare regular SBMs to multilayer SBMs to determine if a multilayer model is more or less appropriate to describe a given network.Unfortunately, the expression above becomes numerically intractable even for a small number of groups and therefore one needs to make approximations that simplify the problem.

II. LINK RELIABILITY WITH APPROXIMATE MULTILAYER STOCHASTIC BLOCK MODELS
We propose an approximation that makes it possible to work with multilayer SBMs.We start by noting that any multilayer SBM can be represented as a single-layer SBM (Fig. 2(a)) [41].In the single-layer SBM, each group comprises the nodes that belong to the same pair of groups α in P 1 and β in P 2 in the multilayer SBM (and only those); we call the single-layer partition the intersection partition.Moreover, if group r in the intersection partition corresponds to groups α in P 1 and β in P 2 , and group s in the intersection partition corresponds to groups γ in P 1 and δ in P 2 , then the probability of connection in the single-layer SBM is q AND rs = q 1 αγ q 2 βδ (for simplicity, we again focus on the AND model and leave the OR model for the Appendices).This fully determines the single-layer SBM.
Here, we make the following approximation: we keep the information of the partitions P 1 and P 2 in the intersection partition, but consider that the matrix elements q AND rs , while each 2. Exact and approximate multilayer SBM ensembles.(a) Two independent single-layer SBMs aggregated using the AND mechanism.Each single-layer SBM is represented by its node-to-node connection probability matrix (represented by the shades of green; note that node ordering is different in each SMB).The aggregation of the two layers can also be represented as a single-layer SBM, in which each group comprises the nodes that belong to the same pair of groups α in layer 1 and β in layer 2; this is the intersection partition PI.Moreover, if group r in PI corresponds to groups α in P1 and β in P2, and group s in PI corresponds to groups γ in P1 and δ in P2, then the probability of connection in the single-layer SBM is q AND rs = q 1 αγ q 2 βδ .(b) For a fixed pair of partitions P1 and P2, we integrate over the ensemble of all possible probability matrices Q1 and Q2 (Eq.( 3)).For each pair (Q1, Q2), the resulting q AND rs are highly correlated.In our approximation, we assume that the elements q AND rs are randomly drawn and independent of each other.
being the result of the product of two factors, are all independent of each other (see Fig. 2(b)).Since this approximation is equivalent to integrating separately every term with a different (α, β, γ, δ) combination in Eq. ( 2), it follows that the integrated likelihood depends exclusively on the intersection partition.In other words, within this approximation all pairs of partitions (P 1 , P 2 ) with the same intersection partition P I are equally likely, and it is not possible anymore to uniquely determine the multilayer SBM that best describes the observed topology.
Despite this limitation, our approximation still enables us to address the fundamental question of whether real-world networks are better described by single-layer or multilayer models.Specifically, in what follows we compare the predictive power of single-layer and multilayer SBMs in the problem of detecting missing and spurious links in noisy networks [27]; we argue that, if (approximate) multilayer SBMs yield better predictions on real networks, then there is evidence to suggest that these networks are likely the outcome of multilayer processes (despite being observed as single-layer aggregates).
In the problem of assessing link reliability [27,42], the goal is to compute the probability P (A ij = 1|A Ø ) that a link between nodes i and j truly exists (A ij = 1) given a noisy network observation A Ø , which contains false positives (spurious interactions that are reported but do not truly exist) and false negatives (missing interactions that truly exist but are not reported).We call the probability R ij = P (A ij = 1|A Ø ) the reliability of the link.In general, for any set M of models (single-layer SBMs, AND-multilayer SBMs or OR-multilayer SBMs), the reliability is [27] where Z is a normalization constant.
In the case of multilayer SBMs, the integral over the ensemble of models M requires: i) the integration over the connection probabilities Q 1 and Q 2 (akin to what we did to obtain Eq. ( 1)); ii) the sum over all pairs of partitions P 1 and P 2 .Within our approximation, the first step can be carried out analytically but the second cannot (Appendices).However, always within our approximation, one can exploit the fact that the integral in Eq. ( 4) depends exclusively on the intersection partition P I and map the sum over pairs of partitions onto a sum over a single partition.By doing so we obtain the following expression for the link reliability (see Appendices for the analogous expression for the OR model) where the sum is over all possible intersection partitions (that is, all single-level partitions), n 1 αβ is the number of links between groups α and β in the intersection partition, n αβ = n 0 αβ + n 1 αβ is the number of pairs of nodes in groups α and β, and D(P I ) is the number of pairs (P 1 , P 2 ) that have the same intersection partition P I (see Appendices).The energy function H is where the sum is over all distinct pairs of groups in P I .
As in [27], the expression for the link reliability (Eq.5) is analogous to an ensemble average of an observable in statistical mechanics, giving H(P I ) the meaning of an energy associated to a specific intersection partition.We can use a Markov Chain Monte Carlo algorithm to compute numerically R ij (see Supplementary Material for details).As it turns out,  , d, g).Dark green corresponds to high connection probability hp and light green to low connection probability lp, and we generate synthetic networks varying two parameters: the high-to-low connectivity ratio α = lp/hp < 1, and the average connectivity k (Appendices).We compare the performance (AUC) at detecting missing links (b, e, h) and spurious links (c, f, i) of the approximate multilayer SBM approach, AU C2L, against that of the the single-layer SBM approach, AU C1L.The size of the circles represents the AU C2L of the multilayer approach.The color of the circles represents the logarithm of the ratio AU C 2L AU C 1L , so that blue circles correspond to instances where the multilayer approach outperforms the single-layer approach, and conversely for red circles.
H(P I ) is equal to the energy obtained assuming a single SBM (Eq.S2, [27]) plus a term that arises because of the fact that the probability matrix elements associated to the intersection SBM are the result of a product of two probabilities.In a Bayesian context, we can interpret this term and the degeneration D(P i ) as non-uniform priors for the intersection partitions.

III. VALIDATION OF LINK RELIABILITY ESTIMATION IN MODEL NETWORKS
Now that we are able to estimate link reliabilities using single-layer SBMs [27] and our approximation to two-layer (AND and OR) SBMs (Eq.( 5)), we compare the performance of these approaches at detecting missing and spurious interactions.Our expectation is that if real-world networks are truly the result of the aggregation of multiple layers, assuming a two layer structure should result in higher accuracy.
To identify the limits of detectability of the 2-layer SBM model, we first construct a set of multilayer test networks that have a clearly differentiated block structure in each of two layers, and that are aggregated using the AND and OR models (see Methods and Fig. 3).We consider the predictive power of each of the approaches at detecting [27,42]: i) missing links (we remove a fraction f of the links and compute the fraction of times that a removed link has a higher reliability than a link not present in the original network, that is the AUC statistic); ii) spurious links (we add a fraction f of links and compute the fraction of times that an added link has a lower reliability than a link present in the original network, that is the AUC statistic).
For AND networks (Fig. 3(a-f)) we find that, for the detection of both missing and spurious links, the 2-layer approach outperforms the single-layer approach, especially: (i) when the number of distinct node groups in the intersection partition and the connectivity grow; (ii) for small or moderate noise levels (fraction of removed/added links.Only when the structure of the blocks becomes very blurry do we observe that the single-layer approach works better (but in this region all approaches do in fact work poorly).
For OR networks (Fig. 3(g-i)), the 2-layer approach again outperforms its single-layer counterpart in most situations.In this case, however, the largest improvements in performance happen for the hard cases with lower connectivity.This can be explained by noting that the OR model tends to generate very dense networks, whereas aggregate AND networks are sparser than the networks in each of the layers.Therefore, in general we expect the AND model to produce better results in real-world networks.

IV. MULTILAYER STOCHASTIC BLOCK MODELS ARE MORE PREDICTIVE FOR REAL NETWORKS
Finally, we compare the performance of the single-layer and multilayer approaches on three real-world networks: (i) the air transportation network in Eastern Europe [43]; (ii) the neural network of C. elegans [44]; and (iii) the email network within a university [45].Our results show that the twolayer AND model provides a better description of these realworld networks since both missing and spurious interactions are more accurately detected by the multilayer SBM approach consistently (the improvement is slight but, in most cases, significant), especially for low observational noise.

V. DISCUSSION
We have introduced the family of multilayer SBMs, which generalizes single-layer SBMs to situations where links arise in different layers and are aggregated through different mechanisms.We have also given the probabilistically complete solution to the problem of inferring the optimal multilayer SBM for a given aggregate network, and proposed a tractable approximation which enables us to objectively address the question of whether an observed network is best described as the projection of multiple layers or as a single layer.Our results suggest that many real-world networks are indeed projections.
Although, as mentioned above, there have been proposals to extend the concept of modularity to multilayer networks [38], ours represents a pioneering attempt to extend stochastic block models to multilayer systems.In this regard, it is important to stress that in this work we are concerned with the learning of multilayer models from aggregate networks where all information about the layers has been lost; in this sense, our work is different from previous attempts to do inference of stochastic block models on multigraphs where the layers themselves are observed [29].
Our work is also different from works on link prediction using latent feature models [46][47][48].An important difference between latent feature approaches and ours is that the latent feature model considers that the probability of existence of a link is a function of the weighted sum of the interactions at the different layers; therefore, the latent feature model does not allow a physical interpretation of what each layer is and of how layers are combined.All in all, latent feature models are very well suited for the inference of unobserved links, but due to the intricacies of the model and the difficulty to interpret its "parameters," it is not clear whether they are appropriate to address the question of whether a real network is really the outcome of a multilayer process or not (and may it may also be prone to overfitting when observational data is noisy).
Our multilayer SBM is the simplest group-based multilayer model one can propose.We believe that its detailed analysis will open the door to better understand the structure of real complex networks.
FIG.2.Exact and approximate multilayer SBM ensembles.(a) Two independent single-layer SBMs aggregated using the AND mechanism.Each single-layer SBM is represented by its node-to-node connection probability matrix (represented by the shades of green; note that node ordering is different in each SMB).The aggregation of the two layers can also be represented as a single-layer SBM, in which each group comprises the nodes that belong to the same pair of groups α in layer 1 and β in layer 2; this is the intersection partition PI.Moreover, if group r in PI corresponds to groups α in P1 and β in P2, and group s in PI corresponds to groups γ in P1 and δ in P2, then the probability of connection in the single-layer SBM is q AND

2 FIG. 3 .
FIG.3.Performance of missing and spurious link identification on synthetic aggregated 2-layer networks.Each row corresponds to a different collection of 2-layer SBMs, which are illustrated in (a, d, g).Dark green corresponds to high connection probability hp and light green to low connection probability lp, and we generate synthetic networks varying two parameters: the high-to-low connectivity ratio α = lp/hp < 1, and the average connectivity k (Appendices).We compare the performance (AUC) at detecting missing links (b, e, h) and spurious links (c, f, i) of the approximate multilayer SBM approach, AU C2L, against that of the the single-layer SBM approach, AU C1L.The size of the circles represents the AU C2L of the multilayer approach.The color of the circles represents the logarithm of the ratio AU C 2L AU C 1L , so that blue circles correspond to instances where the multilayer approach outperforms the single-layer approach, and conversely for red circles.

FIG. 4 .
FIG.4.Performance of missing and spurious link identification on real networks.To compare the performance of the different approaches at detecting missing links (a, c, e), we randomly remove a fraction of the links (false negatives) from the real network and calculate the reliability of each unobserved link.Then we rank the links by decreasing score and calculate how often a removed link (false negative) has a higher reliability that a link that is truly non-existent in the real network (true negative).Analogously, to detect spurious links (b, d, f) we a fraction of links (false positives), calculate the reliability of the observed links, and calculate how often an added link (false positive) has a lower reliability that a link that is truly existent in the real network (true positive).We tested the analysis on three real networks: (a, b) the air transportation network in Eastern Europe; (c, d) the neural network of (C.elegans); (e, f) the email network within an organization.