Link prediction in multiplex networks via triadic closure

Link prediction algorithms can help to understand the structure and dynamics of complex systems, to reconstruct networks from incomplete data sets and to forecast future interactions in evolving networks. Available algorithms based on similarity between nodes are bounded by the limited amount of links present in these networks. In this work, we reduce this latter intrinsic limitation and show that different kind of relational data can be exploited to improve the prediction of new links. To this aim, we propose a novel link prediction algorithm by generalizing the Adamic-Adar method to multiplex networks composed by an arbitrary number of layers, that encode diverse forms of interactions. We show that the new metric outperforms the classical single-layered Adamic-Adar score and other state-of-the-art methods, across several social, biological and technological systems. As a byproduct, the coefficients that maximize the Multiplex Adamic-Adar metric indicate how the information structured in a multiplex network can be optimized for the link prediction task, revealing which layers are redundant. Interestingly, this effect can be asymmetric with respect to predictions in different layers. Our work paves the way for a deeper understanding of the role of different relational data in predicting new interactions and provides a new algorithm for link prediction in multiplex networks that can be applied to a plethora of systems.

Network science has been established as a pivotal tool to characterize the structure of real-world complex systems, that involves multiple types of relations among their fundamental components [1]. One of the most important challenges within the complex systems framework is to elucidate which entities are related to which others and what are the types of these relationships [2]. Within the network science domain, this scientific task translates into a link prediction problem, that attempts to estimate the likelihood of the existence of a link between two nodes, based on the observed links and attributes of nodes [3,4]. Link prediction algorithms are extremely helpful in at least two directions: to reconstruct networks from incomplete data sets and to forecast future interactions in evolving networks. Examples of the first application can be found in biological networks, such as protein interaction networks, where many links are still unknown and their existence must be demonstrated by expensive experiments [5]. Prediction algorithms help in focusing experimental efforts toward those links most likely to exist. The second task, link forecasting, is routinely applied in online social networks, such as Facebook. New friendships are indeed recommended based on link prediction algorithms, so that individuals can efficiently find peers they are interested in [6,7].
The link prediction problem is a long-standing challenge at the intersection between computer science and statistical physics communities. Traditional algorithms include Markov chains and statistical models [8], while recent approaches from the physics community, such as random walk processes and maximum-likelihood methods, have been considered [9,10]. Link prediction algorithms can be classified mainly into two categories: similarity-based methods and probabilistic models [11,12]. Since the latter can be computationally unfeasible for large networks, a lot of attention has been devoted to the creation of good similarity scores. Many of these similarity methods are based on the same basic idea, that two nodes are likely to be linked if they share a common neighbor [13,14]. Despite its simplicity, this concept has proven to be quite useful for highly assortative networks, such as scientific collaboration networks [15]. However, as signaled by Jia et al. [16], the prediction power of any similarity-based link prediction algorithm is bounded due to the limited amount of links present in the network.
In network science, richly structured data can be represented by multilayered networks, in which each layer accounts for a different type of interaction [17,18]. For instance, social interactions can have different purposes (e.g., leisure versus work) and happen through various communication channels, including face-to-face interactions, e-mail, Facebook, phone calls, and so on. The idea of predicting links in multilayer networks has been explored during the last decade from several different points of view. For instance, Davis et al. [19] proposed a technique to include multirelational data for link prediction from a probabilistic point of view. Similarly, several extensions of probabilistic models to multilayer networks have been proposed [20][21][22][23][24][25]. Other works, rather than focusing on incorporating new data to already existing networks, used multilayer structures to focus on the temporal evolution of the networks [26,27]. Several studies extended the notion of neighborhood to multilayer networks [28][29][30][31], focusing on networks of two layers. However, a fundamental question is still unanswered: How can different kinds of relational data be exploited to improve the prediction of new interactions? For instance, to which extent are face-to-face interactions predictive of new Facebook friendships? Interestingly, it has been recently shown that the multiplex network representation can be redundant in some cases, as the information encoded in some layers can be effectively included in others, reducing the number of layers [32]. Therefore, how can link prediction algorithms optimize the information structured in a multiplex network representation, that can be suboptimally organized?
In this Rapid Communication, we address these questions by proposing a metric for link prediction in multiplex networks, based on a generalization of the Adamic-Adar method for single-layered networks [14]. Our metric fully exploits the complexity of the relationships that might be established across the fundamental components of complex systems, by considering all possible triadic closures in the corresponding multiplex representation. We show that this score, that can be applied to any multiplex topology composed by an arbitrary number of layers, is able to outperform other metrics based on single-layered similarity between nodes, across several social and biological systems. We show that the information encoded in different layers can be asymmetric with respect to the link prediction problem: For example, face-to-face interactions can be partially predictive of new Facebook friendships, but not vice versa.
We consider eight different data sets spanning several types of social, biological, and technological systems, represented as multiplex networks: (i) Copenhagen networks study (CNS), where four layers represent physical proximity, phone calls, text messages, and Facebook friendships among university students [33]. (ii) C. elegans genetic (CEG): Genetic and protein interactions of the C. elegans, where three layers represent direct, physical, and additive genetic interactions [34]. (iii) C. elegans neural (CEN): A neural network of the C. elegans, where three layers represent electric, chemical monoadic, and chemical polyadic interactions [35], (iv) CS-Aarhus (CSA): A social network of employees of the Computer Science Department at Aarhus, where five layers represent Facebook, leisure, work, coauthorship, and lunch interactions [36]. (v) CKM physicians (CKM): A social network of physicians, where three layers represent who they ask for advice, who they discuss cases with, and who are their friends [37]. (vi) EU air (EUA): An air transportation network of Europe, where 27 layers represent airlines routes [38]. (vii) Lazega (LAZ): A social network of partners and associates of a corporate law partnership, where three layers represent cowork, friendship, and advice [39]. (viii) Vickers (VIC): A social network of students in a school in Victoria, Australia, where three layers represent who they get on with, who are their best friends, and who they prefer to work with [40]. See Table S1 of the Supplemental Material (SM) [41] for details about the data sets.
In the following, we will contrast different algorithms for link prediction on these data sets. The quality of link prediction algorithms can be evaluated by two metrics: the receiver operating characteristics (ROC) curve, with the corresponding area under the curve (AUC) value, and the precision. The precision can be computed as n * /n, where n is the number of new links that we want to predict and n * is the amount of correct predictions among the top n links. Thus, it provides complementary information to the one given by the AUC. It is important to highlight that, due to the limited amount of links present in a network, the AUC of any similarity-based link prediction algorithm is bounded [16]. For instance, if similarity is based on common neighbors, two nodes without any neighbor in common will have a score equal to zero. The number of scoreless links bounds the maximum and minimum values of the AUC to AUC min = 1 2 (1 + p 1 )(1 − p 2 ) and AUC max = AUC min + p 1 p 2 , where p 1 (p 2 ) is the fraction of links with a score different from 0 among those links that will (will not) exist in the future (see Sec. 2 of the SM [41] for details). Note that only when p 1 = p 2 = 1, i.e., there are no scoreless links, it holds AUC min = 0 and AUC max = 1.
We propose a generalization of the Adamic-Adar (AA) score [14], one of the most common and successful methods for link prediction in social networks. The AA score between nodes u and v is given by the number of common neighbors weighted by their degree, where (u) represents the set of neighbors of node u and k w = | (w)| is the degree of node w. In a multiplex network, the AA score can be applied to different layers, depending on which layer α the set of neighbors w ∈ α (u) ∩ α (v) is considered, where α (u) represents the set of neighbors of node u in layer α. For example, let us consider that we are interested in predicting future phone calls among the participants in the CNS. The classic AA method considers the set of neighbors in the same phone calls layer. If the AA score is applied to the layer representing Facebook friendship, instead, the rationale is that two individuals are more likely to interact offline (phone call) if they share many friends on Facebook (i.e., common neighbors in the Facebook layer). The same reasoning applies to other layers. Table I shows the precision and AUC values (together with its theoretical bounds) to predict phone calls, obtained for the AA method applied to each layer of the CNS (excluding physical proximity interactions for being much denser than others). Interestingly, while the maximum precision (0.04) is obtained by applying the AA score to the same calls layer, the maximum AUC (0.69) is obtained by considering the Facebook layer. This implies that this kind of interactions (Facebook friendship) include useful information to predict new links not encoded in the phone calls layer. This is also reflected in a larger maximum theoretical bound of the AUC for the Facebook layer with respect to the phone calls layer. Note also that by using the aggregated network, in which all layers are projected onto a single one, one obtains maximum AUC but zero precision. This observation shows the need to go beyond singlelayered scores and combine them into a more general metric that fully exploits the multiplex nature of the networks taken into account. Note, indeed, that single-layered metrics considered triadic relations among three nodes u, v, and w, in which the two links u-w and v-w both lie in the same layer. However, triadic relations in multiplex networks can be far richer [42,43]. Figure 1 shows different kinds of triadic relations in multiplex networks. Let us indicate as x the layer on which the link u-v is to be predicted. One can distinguish four types of triadic relations depending on the location of the (u, w) and (v, w) links: Within this formalism, one can consider a score that counts the common neighbors closing triads of each type, and weight each contribution by the logarithm of the degree, as in the Adamic-Adar score, This expression is the generalization of the Adamic-Adar score for multiplex networks (MAA) with an arbitrary number of layers, in which the links to be predicted all lay in the same layer x. Several considerations are in order. First, the contribution of each triad (u, v, w) ∈ T αβ is weighted by the square root of the logarithm of the degree of node w in the two layers involving α and β. With this choice, the original weight 1/ ln(k w ) is naturally recovered for α = β = x. Second, note that different layers of a multiplex network may show very different densities, as shown in Table S1 of the SM [41]. In case of similarity scores based on the number of common neighbors, as in this case, denser layers will have more triads and thus will be less informative. We take this into account by weighting the contribution of each type of triadic relation by the square root of the average degree of the layers involved, √ k α . Third, the coefficients η xα before each term allow us to control the relative weight of each type of triadic closure in the total score of the link. We choose them in a way that η xα corresponds to the weight of layer α. Without lack of generality, we choose α η xα = 1. Fourth, the application of the AA score to layer α, corresponding to triad closures [ Fig. 1(c)], is recovered by setting η xα = 1. The original AA score in single-layered networks [ Fig. 1(a)] is recovered by simply setting η xx = 1. Figure 2 shows the AUC [ Fig. 2(a)] and precision [ Fig. 2(b)] of the MAA metric as a function of the coefficients η xα , for three of the eight data sets under consideration. Others are shown in Figs. S1 and S2 of the SM [41]. For the sake of convenient visualization, we consider only three layers for each network, to visualize the three coefficients in a triangle. For each network, we consider the prediction of links in each of the three layers. The coefficient η xα indicates the weight of layer α in the prediction of new links in layer x. For instance, in the CNS data set, the coefficient η 12 indicates the weight of Facebook friendship (represented in layer 2) in predicting new phone calls (layer 1). One can see that, in most cases, the maximum value of the AUC and the precision is achieved for nontrivial combinations of the coefficients, i.e., different from η xx = 1 which corresponds to the classical AA score, showed in the left corner of triangles. This is particularly true for the precision, whose maximum is achieved in some cases in the middle of the triangle, i.e., with similar contributions for each layer, as in the case of the CKM or CNS networks. The exact values of the coefficients maximizing AUC and precision for each data set are reported in Table S2 of the SM [41].
Therefore, Fig. 2 shows that the prediction of a certain kind of links can be improved by exploiting additional, related information, encoded in other layers. For instance, Facebook friendship can help in predicting new calls [i.e., the maximum AUC for this task is obtained for η 12 = 0.40-see plot (i) of Fig. 2(a) and Table S2 of the SM [41]], or additive genetic interactions and physical association can be predictive Varying the values of two coefficients, the third is naturally fixed. Each column corresponds to a different data set, represented as a multiplex network of three layers. Each row corresponds to a prediction in a different layer x (see Table S2 of the SM [41] for the corresponding interactions). A cross indicates the maximum value for each plot, corresponding to the combination of coefficients (η x1 , η x2 , η x3 ), reported in Table S2 of the SM [41], that maximizes AUC or precision for the prediction of new links in layer x.
of direct protein interactions in C. elegans [i.e., the maximum precision is obtained for η 13 = 0.17 and η 12 = 0.22-see plot (ii) of Fig. 2(b) and Table S2 of the SM [41]]. Interestingly, this effect can be asymmetric: New offline interactions (calls and texts) are not predictive of Facebook friendships, as the corresponding coefficients η 21 and η 23 for this prediction task are zero. This is shown in plots (vii) of Figs. 2(a) and 2(b): The maximum value of the precision and AUC is obtained for η 22 1, in the left corner of the plots (see also Table S2 of the SM [41]). This implies that not all layers add valuable information for a specific link prediction task. In this case, a complete multiplex representation is redundant and such a layer can be effectively included in the others without missing relevant information.
Furthermore, we test if the MAA metric is able to optimally extract information from the multiplex representation, compared with the AA score applied to the aggregate network, that includes the same amount of information. Table II shows that the MAA metric outperforms the classical AA score with respect to both AUC and precision, in all data sets under consideration. Finally, in Table II we compare the MAA metric with other, state-of-the-art metrics for link prediction applied to the aggregated network representation, that includes all information available, in particular, common neighbors (CN), Jaccard's coefficient (JC), and preferential attachment (PA), which are based on the one-step neighborhoods of the nodes such as the AA score [44], and the Katz distance [45], which instead is based on path length. Table II shows the prediction of links in the first layer of each data set, and predictions in other layers are shown in Tables SIII and SIV of the SM [41], with similar results. One can see that the MAA metric outperforms the precision of all other metrics in all but one case (the Lazega data set), while it outperforms the AUC of other methods in five of eight data sets under consideration.
Before concluding, we stress that the metric encoded in Eq. (2) is different from previous extensions of link prediction to multilayer networks. Similarly, other approaches calculate the score of each layer and aggregate all of them (possibly with some weights), effectively neglecting structures of types T xα and T αβ [25,[46][47][48].
To sum up, we proposed a general method for link prediction that fully exploits different kinds of relational data encoded in several social and biological networks. Our metric is a generalization of the Adamic-Adar score for multiplex networks with an arbitrary number of layers, and it is able to  outperform single-layered AA scores in all considered data sets. The MAA metric also outperforms several well-known link prediction algorithms, such as the Jaccard's coefficient or the Katz distance. The coefficients η xα that maximize the MAA score have an interesting interpretation, as they correspond to the weight to be assigned to each layer in order to optimize the information structured in the network for the link prediction task, indicating which layers are redundant. Interestingly, this effect can be asymmetric with respect to predictions in different layers. The computational complexity of the MAA metric is similar to other similarity-based scores.
With respect to the classical AA score, it increases with the number of layers in the multiplex network, which is usually small. Note that the triadic relationships need to be computed just once and stored, then the whole range of coefficients can be scanned to obtain the ones that maximize the MAA score. In future works, it would be interesting to generalize to multiplex networks other metrics based on single layers, such as the Katz distance, which is based on paths that can be reconstructed across layers.
We acknowledge support from Intesa Sanpaolo Innovation Center. Y.M. acknowledges partial support from the Government of Aragón and FEDER funds, Spain through Grant No. E36-20R to FENOL, and by MINECO and FEDER funds (Grant No. FIS2017-87519-P). D.P. and M.S. acknowledge financial support from the project Casa nel Parco (POR FESR 14/20 -CANP -Cod. 320 -16) funded by Regione Piemonte. The funders had no role in study design, data collection, and analysis, decision to publish, or preparation of the manuscript.