Mapping flows on sparse networks with missing links

Errors in empirical data for constructing networks can cause community-detection methods to overfit and highlight spurious structures with misleading information about the organization and function of a complex system. Here we show how to detect significant flow-based communities in sparse networks with missing links using the map equation. Since the map equation builds on the Shannon entropy estimation, it assumes complete data such that analyzing undersampled networks can lead to overfitting. To overcome this problem, we incorporate a Bayesian approach with assumptions about network uncertainties into the map equation framework. We validate the Bayesian estimate of the map equation with cross-validation, using Grassberger entropy estimation in the map equation so that we can compare unbalanced splits. Results on both synthetic and real-world networks show that the Bayesian estimate of the map equation provides a principled approach to revealing significant structures in undersampled networks.


I. INTRODUCTION
The function in social and biological systems emerges from how their components interact [1,2]. The interactions often consist of movements by some entity, such as passengers, money, or information, that generate systemwide flows across interaction networks with non-random structure, typically modular. Therefore, unraveling how these systems work requires an understanding of how the network flows organize in communities [3][4][5][6][7]. But identifying flow communities is only useful if they are significant.
Optimizing the map equation -an informationtheoretic objective function that finds regularities in network flows by estimating a modular description length of those flows -with the search algorithm Infomap gives significant flow-based communities when enough links are known [4,8]. However, if many links are missing, the map equation may highlight spurious communities resulting from mere noise. While there are generative methods that can deal with uncertain network structures, including link-prediction algorithms [9][10][11] and network reconstruction approaches that often build on the stochastic block model [12][13][14][15][16], no method can reliably identify flowbased communities in networks with missing links.
The map equation estimates the modular description length of network flows with the Shannon entropy [17]. With missing data, the Shannon entropy underestimates the actual entropy of the complete data [18]. Consequently, when a network has many missing links, the map equation underestimates the actual description length of the complete network, capitalizes on details in the observed network, and favors network partitions with many small communities. Moreover, underestimating the entropy in networks with missing links also causes problems * jelena.smiljanic@umu.se for standard procedures that evaluate model-prediction performance, including cross-validation: When the modular description length depends on the number of observed links, it depends on the number of cross-validation folds such that only balanced but wasteful two-fold splits of a network into training and test networks give useful results.
To overcome these problems, we present two regularization methods based on entropy estimation for undersampled discrete data. First, we incorporate a Bayesian approach in the map equation framework [19] and derive a closed-form formula for the posterior mean of the map equation under the Dirichlet distribution of network flows. Second, to enable more effective cross-validation, we replace the entropy terms in the map equation with Grassberger entropy estimation [20].
We show that the Bayesian estimate of the map equation outperforms the standard map equation in the undersampled regime, on both synthetic and real-world networks. Also, compared with the degree-corrected stochastic block model [21,22], this approach gives solutions that are more robust to changes in the fraction of removed links in the analyzed networks. Moreover, with Grassberger entropy estimation, the modular description length becomes nearly independent of the amount of data: Instead of wasteful equal splits, we can use most links in the training network to detect communities with Infomap and validate them using the remaining links in the test network. The two complementary solutions help us reduce overfitting and allow us to detect significant flow-based communities in networks with many missing links.

II. MAPPING FLOWS ON COMPLETE NETWORKS
The map equation is an information-theoretic objective function for community detection based on the min-imum description principle, which states the equivalence between data compression and identifying regularities in data. Building on this principle, the map equation estimates the per-step theoretical lower limit of the average code word length needed to describe network flows with a modular description [4,8]. When the link themselves do not represent flows, we can model the network flows with a random walker traversing the network. The goal is to identify the network partition that maximally compresses the modular description, which, at the same time, best captures the modular regularities of the network flows.
For simplicity, we consider modular descriptions with a two-level community hierarchy. In a network that has a well-defined community structure, the network flows stay for a relatively long time within communities. Therefore, instead of using unique code words for each node, the map equation reuses short code words in modular codebooks for encoding movements of the random walker between nodes within the same community for better compression. For a uniquely decodable description, this approach requires an additional index codebook to encode transitions between communities.
The map equation measures the theoretical lower limit of the code length using the Shannon entropy [17]. For partition M of nodes α = 1 . . . V in communities i = 1 . . . m, the map equation takes as input the probability that the random walker enters community i, q i , the probability to visit node α, p α , and the probability to exit community i, q i . With p i = q i + α∈i p α for the total use rate of module codebook i, the average perstep code length needed to describe random walker movements within community i is Similarly, the average per-step code length needed to describe random walker transitions between communities is given by where q = m i=1 q i is the total use rate of the index codebook. Therefore, we can express the map equation as the sum of the average code length of all codebooks weighted by their use rate: To identify the partition that minimizes the map equation, the search algorithm Infomap explores the space of possible solutions in a stochastic and greedy fashion.

III. MAPPING FLOWS ON SPARSE NETWORKS WITH MISSING LINKS
Combined with Infomap, the map equation is an accurate method for community detection when complete network data are available [23]. However, empirical network data can contain incomplete data or measurement errors that cause missing and spurious links. When applying the map equation to such uncertain network data, it may identify spurious communities with misleading information about the underlying network structure and function.
We focus on the case where some of the links are missing, so-called false-negative links, which is a common problem in social and biological networks. When links are missing, the sample estimates of the random walker's transition probabilities lose precision. Worse yet, when plugging in these estimates into the Shannon entropy, the obtained entropy estimator has a negative bias and underestimates the entropy terms of the index and module codebooks [18]. As a consequence, the map equation favors more and smaller communities [11] (Fig. 1). This effect is obvious when so many links are missing that actual communities become sparse or even form disconnected components. Then the map equation cannot detect the actual communities, instead overfitting and identifying spurious communities from mere noise in the network.
To overcome overfitting, we incorporate a Bayesian estimate of the map equation.
An informative comparison between the standard map equation and a map equation with corrected entropy terms must take into account the structural properties of the detected communities. The simplest way would be to compare detected communities to planted communities. However, this approach only works for synthetic networks and, not for real networks. Even for synthetic networks, comparing communities has limitations: Below the detectability limit [24], no method can recover the planted partition, and we need an alternative approach to evaluate the performance. We use cross-validation to test for overfitting.
The idea is first to split the network data into training and test sets and then use Infomap to identify the partition that maximally compresses the description length of the training network. If Infomap successfully recovers a significant partition of the training network, it will also successfully compress the description length of the test network. The opposite happens when there is not enough evidence in the data. Then Infomap overfits on the training network and detects a partition with long description length of the test network. Thus, if Infomap detects a significant partition without overfitting, we should have L test (M)/L train (M) ≈ 1. Conversely, if Infomap overfits we expect L test (M)/L train (M) > 1. However, the description length varies with the fraction of observed links for the standard map equation (Fig. 2). Consequently, this procedure only works for equal splits when the standard map equation underestimates the true description length of training and test networks to the same degree. Since equal splits waste half of the links on the test network, the training network of already sparse networks will be severely undersampled and possibly below the detectability limit. To make the description length less dependent on the fraction of observed links, we incorporate Grassberger entropy estimation [20] into the map equation. The map equation with Grassberger entropy estimation enables effective cross-validation.

A. Bayesian estimate of the map equation
Different methods have been proposed to address the problem of entropy underestimation [19,20,[25][26][27][28]. Methods based on bias reduction [20,25,26] have a large variance in the undersampled regime, and they cannot prevent overfitting of the map equation. Instead, we use a Bayesian approach proposed by Wolpert and Wolf in Ref. [19] to estimate the function of probability distributions. This method not only prevents overfitting to noisy structures better than other Bayesian estimators but also enables an analytical estimation of the map equation and a computationally efficient implementation in Infomap.
In general, we seek the Bayesian estimatorf B of a function f (ρ ρ ρ) that takes a discrete probability distribution ρ ρ ρ = (ρ 1 , ρ 2 , . . . , ρ m ) as input. When ρ ρ ρ is not given and we have only observations n n n = (n 1 , n 2 , . . . , n m ), with m i=1 n i = N sampled according to the distribution ρ ρ ρ, we must estimate f (ρ ρ ρ) using the observed data n n n. The Bayesian estimator for f (ρ ρ ρ) is the posterior average, where P (ρ ρ ρ|n n n) is the posterior over the unknown distribution ρ ρ ρ given by Bayes' rule, P (ρ ρ ρ|n n n) = P (n n n|ρ ρ ρ)P (ρ ρ ρ) P (n n n) .
To obtain P (ρ ρ ρ|n n n), we choose an appropriate prior probability distribution P (ρ ρ ρ) and use the fact that the likelihood and the total probability of the data P (n n n) = dρ ρ ρP (n n n|ρ ρ ρ)P (ρ ρ ρ).
Applied to the map equation, we seek the Bayesian estimator of f (ρ ρ ρ) = L(M). Assuming undirected and unweighted links, the transition rate estimates are [29]: where k α is the degree of node α and k i = k i is the degree of module i, the number of links that connect nodes of module i with nodes of other modules j, j = i. However, when the information about links is incomplete, the actual values of node and module degrees can deviate from these estimates. Therefore, we must apply a probabilistic approach, or the map equation will overfit and exploit spurious network structures.
To develop a Bayesian treatment of the map equation, we specify a prior distribution P (p α , q i , q i ) over the transition rates p α , q i , and q i . A convenient choice is the Dirichlet distribution, which has simple analytical properties and can be interpreted as a probability distribution over the multinomial distribution of the transition rates, Here Γ(x) is the gamma function and a 1 , . . . , a V , a 1 , . . . , a m , and a 1 , . . . , a m are the parameters of the distribution. While

we can use normalized transition rates because the map equation is scale invariant (see Appendix A).
We obtain the posterior distribution of the transition rates in Eq. (5) by multiplying the Dirichlet prior by the likelihood function and normalizing: By combining this distribution and the expanded form of the map equation, in Eq. (4), and integrating, we obtain a closed formula for the posterior average of the map equation, where u x = k x + a x and ψ(x) is the digamma function.
Proper selection of prior parameters a a a is important for good performance. The parameters a a a reflect our prior assumption of the link distribution in the network before we observed the network data. After seeing the data, we update our assumption by increasing the value of a x by k x and obtain the posterior distribution. For a sparse, undersampled network, therefore, the prior parameters a a a dominate the posterior link distribution. Conversely, as the network density increases, the posterior distribution becomes sharply peaked and the network data dominate the posterior link distribution.
We consider as an uninformative prior an Erdős-Rényi network with V nodes, where each pair of nodes is connected with some constant probability p [30]. The average degree is k = pV and sets the prior parameters a a a. We aim to choose the average degree k such that the prior prevents the map equation from overfitting in the undersampled network, but also enables the map equation to detect well-formed communities. Since the random network experiences a phase transition from disconnected to connected phase at k = ln(V ) [30], a ∼ ln(V ) is a principled prior. For k ln(V ), the random network has isolated components and the prior cannot prevent overfitting, while for k ln(V ) well-formed communities can merge such that the map equation underfits.
This Bayesian estimate of the map equation extends to weighted networks where complete information about link weights is missing. If the link weights represent flows such that no flow modeling is necessary, the method also works for directed networks.
We have implemented the Bayesian estimate of the map equation in Infomap, available for anyone to use [31]. While we restrict our paper to the two-level formulation of the map equation for the sake of simplicity, we also provide the code for the Bayesian estimate of the multilevel map equation (see Appendix B).

B. The map equation with Grassberger entropy estimation
To perform cross-validation, we randomly remove a fraction r of links from the network, using them as test data, and the remaining links as training data. With E for the total number of links in the network and k α for the degree of node α, the probability that k α links of node α are removed after removing E = rE edges from the network follows the hypergeometric distribution: If E, E , and k α are sufficiently large, the hypergeometric distribution converges toward the Poisson distribution, where the parameter λ = E kα E = rk α such that k α = rk α .
The method we propose for effective cross-validation is based on Grassberger entropy estimation. For a given incomplete set of observations (n 1 , n 2 , . . . , n m ), Grassberger entropy estimation assumes that they come from Poisson distributions with mean values (z 1 , z 2 , . . . , z m ) and aims to construct a function φ(n) that minimizes the error |z i ln(z i ) − E(n i φ(n i ))| across all values of z i [20]. The solution that minimizes the error is a recursive func- tion φ(n) = G n defined as where γ is Euler's constant [20]. While we cannot use Grassberger entropy estimation for weighted or directed networks, where visit rates correspond to the PageRank of the nodes [8], it does work for unweighted and undirected networks, where node visit and module transition rate estimates are given by link counts, Eqs. (8)- (10). Assuming incomplete observations, we can incorporate Grassberger entropy estimation into the map equation, Eq. (13), which takes the form Grassberger entropy estimation also works for the multilevel formulation of the map equation [32]. Grassberger entropy estimation can dramatically reduce the code length dependency on network density in cross-validation. For planted partitions, when the test network uses as few as 5% of the links, the difference in code length between training and test networks is negligible ( Fig. 2(b)). Because Grassberger entropy estimation makes the modular description length nearly independent of the amount of data, we can use most links in the training network to reliably detect communities with Infomap.

IV. RESULTS AND DISCUSSION
We first analyze a synthetic network with planted community structure and a real-world Jazz collaboration network [33]. We generate the synthetic network with the LancichinettiFortunatoRadicchi (LFR) method [34]. It has V = 1000 nodes, average node degree k = 16, and nodes partitioned into M = 35 communities. The mixing parameter µ = 0.3 is the probability that a randomly chosen link will connect nodes from different communities. In the Jazz collaboration network, each node represents a band and two nodes are connected if there is at least one musician who has played in both bands. The number of nodes is 198 and the number of links is 2,742. For this network, there is no information about groundtruth communities and no consensus about an optimal community partition [35,36]. To illustrate the behavior of the community detection method in sparse networks with missing links, we randomly remove a fraction r of links from the networks, where for each value of r, we average the results over 100 samplings.
Using these two networks, we compare the performance of the standard map equation, the Bayesian estimate of the map equation with different values of Dirichlet prior parameter a α , and the degree-corrected stochastic block model [21,22]. We are interested in the number of communities, the adjusted mutual information (AMI), and cross-validation results. Since the map equation and the degree-corrected stochastic block model use stochastic search algorithms to detect communities, we average the results over ten searches for each of 100 network samplings.
We analyze the Bayesian approach for prior a ∼ ln(V ). For the node degree, therefore, we use a α = C ln(V ), where α = 1 . . . V and C is a constant that we need to specify. For the module degree, we use a i = a i = provides the best solution: when sufficient network data are available it distinguishes significant communities from mere noise, while in the undersampled regime it detects no community structure. Results are averaged over 100 network samplings and ten algorithm searches. The standard error of the mean is never higher than 0.58.
more than approximately 55% of the links (Fig. 3(a)). As we remove more links, the network also becomes sparse within communities. In the undersampled regime below the detectability limit where it is not possible to recover the planted partition, the map equation overfits to random fluctuations and favors more, smaller communities. In Fig. 3(a), we also show the number of communities that Infomap returns for the Bayesian estimate of the map equation for different values of the prior constant C. For C < 1, the random prior network is weakly connected and cannot prevent overfitting in the undersampled regime. In contrast, for C > 1, the random prior network is densely connected and prevents the map equation from distinguishing communities in the original network from noise induced by the prior: the map equation underfits even when sufficient network data are available. The results confirm that a α = ln(V ) with prior constant C = 1, at the critical point where a giant connected component emerges in the random prior network, best balances over-and underfitting and prevents detection of spurious communities. Moreover, the amount of noise that this prior network induces in the original network is so low that it does not wash out any significant community structure. The degree-corrected stochastic block model shows a different behavior. As the fraction of removed links increases, the number of communities that the degreecorrected stochastic block model detects decreases continuously. For example, it detects more communities than in the planted partition when less than 30% of the links are removed from the synthetic network.
Similar behaviors appear accentuated when we apply the methods to the real-world Jazz collaboration network ( Fig. 3(b)). For the standard map equation, the number of detected communities increases with the number of missing links, whereas the degree-corrected stochastic block model shows the opposite trend. Unlike when applied to the synthetic network, the various map equation variants favor different partitions already before removing any links. The Bayesian estimate of the map equation detects a smaller number of communities than the standard map equation, and its performance depends on the choice of the prior. For a α = 0.5 ln(V ), the average number of communities is relatively stable when more than 50% of the links remain. However, if we remove more than 50% of the links, the number of communities increases. That is, the prior is too low. As for the synthetic network, the prior a α = 2 ln(V ) is too high and causes underfit: the method detects no community structure when we remove more than 10% of the links. Again, a α = ln(V ) offers a good tradeoff. The number of communities is approximately constant as long as at least 50% of the links remain and then decreases to 1 when fewer than 40% of the links remain, when the method deduces that there no longer exists any significant community structure. We illustrate differences in the community structure of the Jazz collaboration network induced by missing links for the standard and Bayesian map equation with alluvial diagrams [37] (Fig. 4). A prior assumption of missing links prevents the map equation from splitting communities when they lose links.

B. Adjusted mutual information
Adjusted mutual information (AMI) is a standard measure to compare two different partitions [38]. For the synthetic network, we compare identified partitions to the planted partition. The standard map equation successfully recovers the planted partition when more than 60% of the links are available (AMI= 1). When we remove more links, the accuracy decreases ( Fig. 5(a)). The Bayesian estimate of the map equation with prior constant C = 0.5 has almost the same accuracy. If we use C = 1 instead, the method performs slightly better when we remove 40 − 60% of the links. Again, the Bayesian estimate of the map equation with prior constant C = 1 deduces that there no longer exists any significant community structure when we remove more than 65% of the links and AMI= 0.
To measure the AMI for the Jazz collaboration network, which has no planted partition, we compare the partitions that the community detection methods return for networks with different fractions of missing links to the partitions they return for the complete network. For the complete network, we measure the average AMI over ten searches. Figure 5(b) shows that the Bayesian estimate of the map equation with prior a α = ln(V ) is the most consistent method when it is possible to detect significant communities.

C. Cross-validation
Cross-validation allows us to compare model-selection performance without planted partitions. We validate the significance of network partitions returned by Infomap for training networks with 1 − r links using the standard map equation and the Bayesian estimate of the map equation (Fig. 6).
As the link density of the training network decreases below the detectability limit, partition obtained using the standard map equation poorly compresses the test network; the standard map equation mistakes noisy substructures in the sparse training networks for actual communities. In contrast, the Bayesian estimate of the map equation with prior constant C >= 1 achieves better performance on sparse training networks and identifies partitions with fewer communities that better compresses the test network.
To complement with results for other networks, we provide summary statistics for six real-world networks often used to evaluate the performance of community detection algorithms in Table I. The networks include a collaboration network in Astrophysics extracted from the arXiv (AstroPh) [39], the network of e-mails exchanged between members of the University Rovira i Virgili (email) [40], a collaboration network of authors with Erdős number 1 (ErdősN1) [41], the American College Football network (football) [42], the PGP social network of trust (PGP) [43], and the network of political weblogs (polblogs) [44]. In all networks, the standard map equation returns partitions with higher number of communities when links are missing. Except for the football network, the number of detected communities increases three or more times compared with the number of communities detected in the complete network. In all networks but the polblogs network, the training partitions with many communities poorly compresses the test network. In contrast, except for the AstroPh network, the Bayesian estimate of the map equation with prior constant C = 1 highlights partitions with fewer communities that better compresses the test networks.

V. CONCLUSION
We have derived a Bayesian approach of the map equation that imposes prior information about the network structure to reduce overfitting for sparse networks with missing links. Using an uninformative Dirichlet prior, we show that the Bayesian estimate of the map equation outperforms the standard map equation on both synthetic and real-world networks with missing links. With a properly chosen prior constant, the proposed method successfully balances the impact of the imposed prior against the observed network data: The Bayesian estimate of the map equation provides a principled approach to reducing overfitting and detecting significant communities in two or more levels. We also show how to validate that the communities are significant using cross- We anticipate that more reliable flow-based community detection of undersampled networks will be useful in many applications, including better prediction of miss-ing links.
Proposition. The map equation, is scale invariant function.
Proof. If we scale the transition rates p α , q i and q i by a constant K, where K > 0, and change L(M) to Kq i log 2 (Kq i ) it holds such that Now we can use = L(M)P (p p p , q q q , q q q |k k k, a a a)dp p p dq q q dq q q (A9) where posterior probability distribution P (p p p , q q q , q q q |k k k, a a a) ∝ V α=1 (p α ) kα+aα−1 m i=1 (q i ) ki +ai −1 (q i ) ki +ai −1 .

(A10)
As a result we obtain where u x = k x + a x and ψ is digamma function, ψ(x) = d dx ln(Γ(x)).
is the total code word use rate of module codebook ij . . . l.
To obtain the Bayesian estimate of the multilevel map equation, we use Eq. (B1) to calculate the posterior average according to Eq. (4). Following the same procedure described in Sec. III A, we obtain a formula for the Bayesian estimate of the multilevel map equation,