Reconstructing networks with unknown and heterogeneous errors

The vast majority of network datasets contains errors and omissions, although this is rarely incorporated in traditional network analysis. Recently, an increasing effort has been made to fill this methodological gap by developing network reconstruction approaches based on Bayesian inference. These approaches, however, rely on assumptions of uniform error rates and on direct estimations of the existence of each edge via repeated measurements, something that is currently unavailable for the majority of network data. Here we develop a Bayesian reconstruction approach that lifts these limitations by not only allowing for heterogeneous errors, but also for single edge measurements without direct error estimates. Our approach works by coupling the inference approach with structured generative network models, which enable the correlations between edges to be used as reliable uncertainty estimates. Although our approach is general, we focus on the stochastic block model as the basic generative process, from which efficient nonparametric inference can be performed, and yields a principled method to infer hierarchical community structure from noisy data. We demonstrate the efficacy of our approach with a variety of empirical and artificial networks.


I. INTRODUCTION
The study of network systems of various kinds constitutes a significant fraction of contemporary interdisciplinary research in physics, biology, computer science and social sciences, among other disciplines [1].This is motivated in large part by the surging availability of network data during the past couple of decades, which describe the detailed interactions among constituents of large-scale complex systems, such as transportation networks, cell metabolism, social contacts, the internet, and various others.Despite the widespread growth of this field, its relative infancy is still noticeable in some aspects.In particular, even though sophisticated and successful models of network structure and function have been proposed, as well as powerful data analysis methods, most studies of empirical data are performed without taking into account measurement error.Most typically, real networks are represented as adjacency matrices, sometimes enriched with additional information such as edge weights and types, as well as various kinds of node properties, the validity of which is simply taken for granted.But as is true for any empirical scenario, network data is subject to observational errors: parts of the network might not have been recorded, and the parts that have might be wrong.Although this problem has been recognized in the past in several studies [2][3][4][5][6][7][8][9][10][11], the practice of ignoring measurement error is still mainstream, and robust methods to take it into account are underdeveloped.This is in no small part due to the fact that most available network data contain no quantitative error assessment information of any kind, thus preventing primary experimental uncertainties to be propagated up the chain of analysis.* t.peixoto@bath.ac.uk In this work we formulate a principled method to reconstruct networks that have been imperfectly measured.We do so by simultaneously formulating generative models of network structure -that incorporate degree heterogeneity, modules and hierarchies -as well as models of the noisy measurement process.By performing Bayesian statistical inference of this joint model, we are able to reconstruct the underlying network given an imperfect measurement affected by observational noise.Importantly, our method works also when a single measurement of the underlying network has been made, and the noise magnitudes are unknown.This means it can be directly applied to the majority of network data without available error estimates.In addition to this, our method is capable of extracting hierarchical modular structure from such noisy networks, thus generalizing the task of community detection to this uncertain setting.
Our method is equally applicable when information on measurement error is available, either as repeated measurements or as estimated edge probabilities.For this class of data, we construct a general model that allows for heterogeneous errors, that vary in different parts of the network.We show strong empirical evidence for the existence of this kind of heterogeneity, and demonstrate the efficacy of our method to include it in the reconstruction.
Our method shares some underlying similarities with well known model-based approaches of edge prediction [5,6], but is different from them in fundamental aspects.Most importantly, model-based edge prediction methods yield relative probabilities of edges existing or not, given a generative model fitted to the observed data.These relative probabilities can be used to reconstruct a network provided one knows how many edges are missing or spurious.Our method obviates the need for this information (which is in general unknown), and yields not only a reconstructed network, but also the uncertainty estimate that must come with it, via a posterior distribution over all possible reconstructions.Thus our method realizes the underlying promise of reconstruction that motivates most edge prediction methods, but in a principled and nonparametric way.
We form the basis of our reconstruction scenario on Ref. [10], which defined a statistical inference method based on multiple measurements of network data, but here we use a different approach based on nonparametric Bayesian inference, combined with community detection.This yields a more powerful method that, differently from Ref. [10], can be applied also when the network data does not contains any kind of primary error estimate, such as when the edges and nonedges have been measured only once.
This work is organized as follows.In Sec.II we formulate our Bayesian reconstruction framework.In Sec.II A we present our measurement model, and in Sec.II B we illustrate the use of our reconstruction method with some examples.In Sec.II C we perform a detailed analysis of the reconstruction performance of the method, as well as its use to provide estimates of various network properties.In Sec.II D we employ our approach to some empirical network data without primary error estimates, and evaluate their reliability.In Sec.II E we extend our method to heterogeneous errors, and use it to analyze network data with multiple measurements.In Sec.III we show how our method can be extended to situations where the arbitrary error estimates are extrinsically provided, and we finalize in Sec.IV with a conclusion.

II. BAYESIAN NETWORK RECONSTRUCTION
The scenario we consider is one where instead of a direct observation of a network A, we perform a noisy measurement D that contains only indirect information about A. The task of network reconstruction is then to obtain A from D. The approach we take is to perform statistical inference, where first we model the network generating process via a probability where θ are arbitrary model parameters.The entire data generating process is then completed by modelling also the noisy measurement, conditioned on the generated network A (the "true" network) and some further parameters φ.Given this general setup, the reconstruction procedure consists of determining A from the posterior distribution P (A|D) = P (D|A)P (A) where P (D|A) = P (D|A, φ)P (φ) dφ, is the marginal probability of the measurements D, and is the prior probability for A, summed over all possible parameter choices, weighted according to their (hyper-)prior probabilities.The remaining term P (D) = A P (D|A)P (A) is a normalization constant that corresponds to the total probability -or evidence -for the observed measurement.In the above, the probabilities P (θ) and P (φ) encode our prior knowledge (or lack thereof) about the network generation and measurement processes, respectively.With these at hand, Eq. 3 assigns the probability of a given network A being responsible for measurement D. Importantly, this distribution defines an ensemble of possibilities for the underlying network A that incorporates the amount uncertainty resulting from the measurement.This contrasts with reconstruction approaches that attempt to reproduce a single network, although within the above framework we could also attempt to find the single most likely reconstruction that maximizes Eq. 3, i.e. a maximum posterior point estimate.However, as we will see below, this is not the most appropriate point estimate, as it tends to incorporate noise from the data, biasing the reconstruction.Instead, we should consider the consensus of the full posterior distribution, which can also give us an estimation of uncertainty.
The above framework is general, and can be used for any kind of generative and measurement processes.Here, we are interested in those that can be used to describe the large-scale modular structures of networks, characterized by the partition of the nodes into groups b = {b i }, where b i ∈ {1, . . ., B} is group membership of node i.The simplest and most commonly used model for this is the stochastic block model (SBM) [12], where ω rs is the probability of an edge existing between nodes of groups r and s.Alternatively, we could also consider a more realistic version called the degree-corrected SBM (DC-SBM) [13], where λ rs controls the number of edges between groups r and s and κ i the expected degree of node i.This model variant decouples the degrees from the group memberships, allowing for arbitrary degree variability inside modules, a feature often found to be more compatible with real networks [14].(Note that the DC-SBM generates multigraphs with A ij ∈ N, whereas the SBM above generates simple graphs with A ij ∈ {0, 1}, as our framework requires.In appendix D we amend this inconsis-tency.)Using the above, we compute the marginal network probability as with integrated over the remaining model parameters, weighted by their respective prior probabilities.However, although Eq. 9 can be computed exactly [14], the complete marginal of Eq. 8 cannot, as it involves an intractable sum over all possible network partitions.Hence, instead of computing directly the posterior of Eq. 3, we obtain the joint posterior 1 P (A, b|D) = P (D|A)P (A|b)P (b) which involves only quantities that can be computed exactly, except P (D), which as we will shortly see, is unnecessary for the inference procedure.We do the above without any loss, as the original posterior of Eq. 3 can be obtained by marginalization, i.e.
This means that if we can sample from the joint posterior P (A, b|D), we can compute any estimate ŷ of a network property y(A) (e.g. the clustering coefficient) over the full marginal P (A|D) by averaging it over the joint posterior, i.e.
The procedure we use to sample from the posterior distribution is Markov chain Monte Carlo (MCMC).We consider move proposals of the kind P (b |A, b) and P (A |A, b) for the partition and network, respectively, and accept the proposal according to the Metropolis-Hastings [15,16] probability 1 It is important to distinguish between the network generation given by the prior of Eq. 9 and the reconstruction given by the posterior of Eq. 10.The former is a generative process that, even if it closely captures the large-scale structure present in the underlying network, it may deviate from it in important ways, e.g.lack an abundance of triangles or other properties not well described by the SBM, and thus generates the true network with only a very small probability.In contrast, the posterior of Eq. 10 corresponds to a distribution of networks that are "centered" around the observed data, and will incorporate features that are present in it, even if they are not well described by the SBM prior (such as clustering, and other "small-scale" properties).
which enforces detailed balance.If the move proposals are ergodic, i.e. they allow every network A and partition b to be proposed eventually, this algorithm will generate samples from the posterior distribution P (A, b|D) after a sufficiently large number of iterations (usually determined by requiring that statistical properties of the chain, such as average log-probability, become stationary).The ratio in Eq. 13 can be determined exactly without computing the intractable constant P (D) in Eq. 10, making this method asymptotically exact.We give more technical details of our MCMC procedure in Appendix B. The above setup is still sufficiently general that it can be used with any variant of the SBM.In particular, here we will make extensive use of the hierarchical DC-SBM (HDC-SBM) [14,17], which differs from the DC-SBM in that a nested hierarchy of priors and hyperpriors is used in place of the single prior P (λ|b) for the connections between groups.In this model, groups are clustered hierarchically into meta-groups, yielding a nested hierarchical partition {b l }, where b l is the partition of the groups in level l.As discussed in Refs.[14,17], this choice of structured priors removes a tendency of noninformative priors to underfit [18], and enables the detection of structures at multiple scales, while at the same time remaining unbiased with respect to different types of mixing patterns.Its posterior distribution is obtained in the same fashion, following the framework above, simply by replacing b → {b l }.
In the following, whenever we mention that we sample from the posterior P (A|D), it is meant we sample from the joint posterior P (A, b|D), and marginalize over b, as described above.The same is true when using the hierarchical model, i.e. we sample from P (A, {b l }|D), and marginalize over the hierarchical partitions {b l }.
The main difference from typical community detection based on statistical inference is that here we are not only interested in detecting modules in networks, but also inferring the network itself.Therefore, both the network and its partition into (hierarchical) groups are inferred from indirect data.As we will see, the simultaneous detection of modules offers a substantial advantage to the reconstruction task, as it allows correlations among edges to inform it.This means that we are able to perform reconstruction in situations which would otherwise be impossible.But before we proceed, we need to model the measurement process itself, as we do in the following.

A. Noisy network measurements
Here we will consider the scenario used in Ref. [10], where the edges of a network are measured directly and repeatedly, but the process is noisy, and potentially distorts the network.In particular, we will assume that for each node pair (i, j) we perform n ij distinct measurements, and record x ij positive outcomes, i.e. an edge is observed.For each observation, we have a probability p of observing a missing edge (i.e. a false negative) and a probability q of observing a spurious edge (i.e. a false positive), depending in each case if the underlying network possesses or not an edge (i, j).Thus, for each edge the observation probability is distributed according to a binomial distribution, with a success rate that depends on whether an edge exists in the underlying network, i.e.
Thus, the joint likelihood for the whole set of measure- written in terms of the following summary quantities, where M is the total number of measurements (edge or nonedge), X is the total number of observed edges, E is the total number of measured edges and T is the total number of correctly observed edges. 2 From this, we also identify the total number of false positives (spurious edges) as X − T and of false negatives (missing edges) as E − T .
To proceed with our calculation we need to specify the degree of prior knowledge we have on the error rates p and q.We can express this most naturally with a Beta distribution, where B(x, y) = Γ(x)Γ(y)/Γ(x+y) is the Euler beta function, and Γ(x) is the gamma function, and likewise for P (q|µ, ν), with hyperparameters µ and ν.As illustrated in Fig. 17 of appendix A, a value of α = β = 1 encodes a maximum amount of prior ignorance with respect to p, which is then uniformly distributed in the unit interval.Conversely, values α → ∞ and β → ∞ converge to a Dirac delta function centered at α/(α + β), amounting to a maximum certainty for a particular value of p, and therefore intermediary values of α and β interpolate between these two extremes (and analogously for q with µ and ν).With this, we can compute the integrated likelihood The noninformative case α = β = µ = ν = 1 simplifies further to The above noninformative generative process can also be equivalently interpreted as first choosing the number of false positives X − T uniformly from the interval [0, M − E] and then selecting them uniformly at random from the possible set with M−E X −T elements, and similarly choosing the number of false-negatives E − T uniformly in the interval [0, E] and the false-negatives from the set of size E E−T = E T .With the integrated likelihood in place, we can finally complete the posterior distribution of Eq. 3 with D = (n, x), which in this case becomes, P (A|n, x, α, β, µ, ν) = P (x|n, A, α, β, µ, ν)P (A) P (x|α, β, µ, ν) .
For P (A) we will use the SBM and sample A using MCMC from the joint posterior P (A, b|n, x, α, β, µ, ν), as discussed previously.
Even though we have integrated over the error probabilities p and q in the above, we can nevertheless obtain their posterior estimates by averaging from the above posterior P (p|n, x, α, β, µ, ν) = A P (p|n, x, A, α, β)P (A|n, x, α, β, µ, ν), (22) using the posterior for p conditioned on the network A, and likewise for q with (24) In the following, we will most often assume the noninformative case α = β = ν = µ = 1, corresponding to the maximum lack of prior knowledge about the measurement noise.In order to unclutter our expressions, if this is the case we will simply omit those hyperparameters from the posterior distribution, i.e.P (A|n, x) ≡ P (A|n, x, α = 1, β = 1, µ = 1, ν = 1).

Single edge measurements
As we increase the number of measurements n ij of each pair of nodes, we should expect also to increase the accuracy of the reconstruction, resulting in a posterior distribution P (A|n, x) that is very sharply peaked around the true underlying network.Although this scenario is plausible, and indeed desirable under controlled experimental conditions, this is not representative of the majority of the network data that are currently available.In fact, inspecting comprehensive network catalogs such as KONECT [19] and ICON [20] reveals a very pauper set of network data that can be cast under this setting of repeated measurements.On the contrary, the vast majority of them offer only a single adjacency matrix without quantitative error estimates of any kind.Needless to say, this is no reason to assume that they do not, in fact, contain errors, only that they have not been assessed or published.
Here we propose an approach of assessing the uncertainty of this dominating kind of network data by interpreting it as a single measurement with unknown errors rates, using the framework outlined above.In more detail, we assume that n ij = 1 for every pair i, j and that the single measurements x ij ∈ {0, 1}, correspond to the reported adjacency matrix.The lack of knowledge about the underlying error rates p and q can be expressed by choosing α = β = µ = ν = 1, in which case it is assumed that they both lie a priori anywhere in the unit interval. 3t first we may wonder if this approach has any chance of succeeding, since the lack of knowledge about the error rates means that the network could have been modified in arbitrary ways, such that the true underlying network is radically different from what has been observed.Indeed, if we define the distance between measured and generated networks, (25) which equals the sum of false negatives and false positives, we have that according to Eq. 20, the expected distance over many measurements is which is half the maximum possible distance of N 2 , which might lead us to conclude that our noise model will invariably destroy the network beyond the possibility of reconstruction, regardless of its original structure.What changes this picture is the fact that the posterior distribution P (A|x, n) of Eq.21 will in fact be more concentrated on the generated network than the implied by the above, and ultimately will depend crucially on our generative process P (A).The first point can be made by assuming a fully random generative model, which means that the true networks being measured are assumed to be completely random, given a particular density ω.The full prior can be obtained by a noninformative assumption P (ω) = 1, which yields with E = i<j A ij = E being the total number of edges, which is equivalent to sampling to the total number of edges from the interval [0, N  2 ] and then a fully random graph with that number of edges.Combining this with Eq. 20, yields the posterior distribution, which can be written as the product of two conditional probabilities, with corresponding to the uniform sampling of A with exactly E − T false negatives and X − T false positives, and with [• • • ] being the Inverson bracket that equals 1 if the condition inside it is true, or 0 otherwise, determines the posterior probability of the number of false negatives and false positives, up to a normalization constant.Although this distribution decays for values of E larger than 0, the decay is slow with ∼ 1/E, and hence, on average, the inferred networks A sampled from P (A|x, n) will be dense, yielding large distances d(A, A * ) if the true generated network A * is sparse.Although the posterior distribution of false negatives and positives resulting from P (T , E|x) is not uniformly distributed in the allowed interval, it is also not sufficiently concentrated to enable any reasonable accuracy in the reconstruction, regardless of how large the network is.What changes this considerably is to replace the fully random model of Eq. 28 by a more structured model.The key observation here is that the modifications induced by the error rates p and q affect uniformly every edge and nonedge, and thus with structured models we can exploit the observed correlations in the measurements x to infer the underlying network A, and in fact even the error rates p and q, which are a priori unknown.
We illustrate this by considering the non-degreecorrected SBM, where networks are generated with probability The final likelihood for the measurements x in this case will be identical to an effective SBM, given by where are the new effective SBM probabilities that have been scaled and shifted by the measurement noise.Suppose, for simplicity, that we know the true network partition b, and that the number of groups is very small compared to the number of nodes in each group.In this situation, the posterior distribution for ω should be tightly peaked around the maximum likelihood estimate ω , where e rs = ij x ij δ bi,r δ bj ,s is the number of observed edges between groups r and s (or twice that for r = s) and n r is the number of nodes in group r.The joint posterior distribution for p and q will then be asymptotically

Probability density
(p, q) = (0, 0) (p, q) = (0.5, 10 −7 ) (p, q) = (0.7, 10 −5 ) (p, q) = (0.9, 10 −4 ) (c) Figure 1.(a) Illustration of a hypothetical measured network, with a priori unknown errors, but from which error estimates can be made: the lack of edges between groups 2 and 3, 3 and 4, 2 and 4, and 1 and 3 implies that the probability q of missing edges is likely to be low.Similarly, the large internal density of group 3 (which forms a clique of 10 nodes) implies that the missing edge probability p must be low as well.(b) How the network in (a) would look like for higher values of p and q.(c) The distribution of marginal edge probabilities pij between every node pair, for a fit of the HDC-SBM on the openflights data (see Appendix E), measured with different values of the noise parameters (p, q).As the noise magnitudes increase, the probabilities become less heterogeneous, and concentrate in narrower intervals.Hence, the inference of broad connection probabilities from data rules out the existence of strong noise in the measurement.
given by up to normalization, where [• • • ] is again the Inverson bracket.The constraints above imply that the inferred error rates will be bounded by the maximum and mini-mum inferred connection probabilities, i.e.
q ≤ min rs e rs n r n s , These bounds mean that if we have not observed many edges between groups r and s, this implies that q could not have been very large.If instead we do observe many edges between these groups, then this means that the value of p could not have been very large either (see Fig. 1a and b).This holds for every pair of groups r and s, but the values of p and q are global.Therefore, as long as the inferred SBM probabilities are sufficiently heterogeneous, they should constrain the inferred error rates to narrow intervals -which will also constrain the inferred number of false negatives and false positives (see Fig. 1c). 4 On the other hand, if the model probabilities are homogeneous, the posterior distribution for the errors will be broad, and the quality of the reconstruction will be poor.Therefore, the success of this approach depends ultimately on the observed networks being sufficiently structured, and of our models being capable of describing them.
The above means that we have a better chance of accurate reconstruction if our models are capable of detecting heterogeneous connection probabilities among nodes.A fully uniform model like the Erdős-Renyi of Eq. 28 (equivalent to a SBM with only one group) will exhibit the worse possible performance.The DC-SBM, on the other hand, should in general perform better than the SBM, since it is capable of capturing degree heterogeneity inside groups, which is a common feature of many networks [13,14].The HDC-SBM [14,17] should perform even better, since its tendency not to underfit means it can detect statistically significant structures at smaller scales.
Finally, it must also be noted that when performing only single measurements, there remains an unavoidable identification problem, where it becomes impossible to fully distinguish a network that has been sampled from a SBM with parameters ω and error rates p and q from the same network sampled from a SBM with parameters ω given by Eq. 36 and error rates p = q = 0 (and in fact any interpolation between these two extremes).This uncertainty, however, will be reflected in the variance of the posterior distribution, and serves as a worse-case estimation of the error rates, which ultimately can be improved either by incorporating better prior knowledge (e.g. via the hyperparameters α, β, ν and µ) or performing multiple measurements.

B. Empirical examples
Before we proceed further with a systematic analysis of our reconstruction method, we illustrate its behavior with some empirical data that are likely to contain errors and omissions.We begin with the network of social associations between 62 terrorists responsible for the 9/11 attacks [21,22].The existence of an edge between two terrorists is established if there is evidence they interacted directly in some way, e.g. if they attended the same college or shared an address.Clearly, this approach is inherently unreliable, as investigators may either fail to record evidence, or the evidence recorded may be simply erroneous.Nevertheless, although this potential unreliability was acknowledged in Refs.[21,22], is was not assessed quantitatively, and the data presented there is a single adjacency matrix with no error estimates.Therefore it serves as a suitable candidate for the application of our reconstruction method.When applied to this dataset, our approach yields the results seen in Fig. 2, which shows the marginal posterior probability of each possible edge in the network, in addition to the hierarchical modular structured captured by the HDC-SBM.Our method identifies the organization into a few largely disconnected cells, typical of terrorist groups.When ranking the potential edges according to their marginal posterior probability, as shown in Fig. 2c, we have that all observed edges are more likely to be true edges than any of the nonedges, indicating a fair degree of inferred reliability.The observed nonedges have a probability substantially smaller than the observed edges of being edges, with the sole exception of a connection between Mohamed Atta (one of the main leaders) and Waleed al-Shehri, which was not considered in Refs.[21,22], but to which our method ascribes a reasonably high probability of 0.48.Atta is connected to all members of al-Shehri's group, and according to the HDC-SBM the sole missing link between them is therefore suspicious.Indeed, journalistic reports place both individuals occasionally sharing an apartment in Berlin, 5 and meeting at least once in Spain, 6 prior to the attacks, which seems to corroborate our reconstruction.The remaining observed nonedges have a probability of 0.15 or smaller, which should not be outright discarded, and could serve as candidates for further investigation.
We now move to another social network, namely the interactions between 34 members of a karate club, originally studied by Zachary [23].This network has been widely used to evaluate community detection methods, after its use for this purpose in Ref. [24].It was recorded just before the split of the club in two disjoint groups

Posterior marginal probability of an edge
Observed edge Observed nonedge (c) Figure 2. Network of social associations between 9/11 terrorists [21,22].This network was measured by potentially unreliable means, but no quantitative error estimates are known, and no repeated measurements were made.In (a) and (b) is shown the inferred network according to our method -which does not require direct error estimates or repeated measurements -where the edge thickness indicates the posterior marginal probability of an edge existing.In (a) the inferred hierarchical structure is shown, with pie charts on the nodes indicating the marginal probabilities of group memberships, and in (b) a spatial layout of the same network shows the lowest level of the hierarchy as the node colors.The edge shown in red is inferred as existing with a large probability, despite not being measured.Other potentially missing edges are also shown in red, with a probability given by their thickness and opacity.In (c) is shown the marginal probability of edge existence for all node pairs, indicating a fair amount of inferred reliability -with the exception of the single missing edge highlighted in (a) and (b) -despite the lack of direct error estimates in the data.The horizontal line marks a 1/3 probability as a visual aid.The missing edge corresponds to a connection between Mohamed Atta and Waleed Alshehri, which was not considered in Refs.[21,22], but is corroborated by reports that they shared an apartment in Berlin, and met previously in Spain.
after a conflict, and many community detection methods are capable of accurately predicting the split by detecting communities from this snapshot.However, not only does the original publication of Ref. [23] omits any assessment of measurement uncertainties, but also it clearly contains one obvious error: the adjacency matrix A published in the original study, although it is supposed to be symmetric, contains two inconsistent entries with A ij = A ji , for (i, j) = (23,34), creating an ambiguity about the existence of this particular edge. 7The authors of Ref. [24] made the decision of assuming A 23,34 = 1, even though there seems to be no obvious reason to decide either way a priori.The vast majority of other works in the area followed suit (possibly inadvertently), thus incorporating this potential, though arguably small, error in their analysis.Here we tackle this reconstruction problem by mapping the uncertain dataset of Ref. [23] to our framework.Since each node pair (i, j) was also presented reversed (j, i), we consider these as independent measurements, such that n ij = 2 for every pair (i, j).Since the measurements were consistent for all but one pair, we have x ij = 2 or 0, except for the offending entry with 7 To the best of our knowledge, this issue was first identified by Aaron Clauset [25], who assembled the alternative dataset with A 23,34 = 0 and hence 77 edges (as opposed to the more common variant with A 23,34 = 1 and 78 edges) and made it available in his website c.a. 2015, http://santafe.edu/~aaronc/data/zkcc-77.zip.
x (23,34) = 1.Based on this we employed our reconstruction approach to obtain P (A|n, x), using as generative processes the Erdős-Rényi (ER) model (equivalent to a SBM with only one group, B = 1), the configuration model (CM) (equivalent to a DC-SBM with B = 1) and the HDC-SBM.As we see in Fig. 3, the ER model is incapable of disambiguating the data, as it cannot be used to detect any structure in it, and ascribes a posterior probability of 0.5 to the uncertain edge.Both the CM and the HDC-SBM, however, ascribe high probabilities for the edge, of 0.87 and 0.93, respectively.The CM approach is able to recognize that since node 34 is a hub in the network, an edge connecting to it more likely to occur than not, and the HDC-SBM can further use the fact that both nodes belong to the same group.Therefore, it seems like the choice made by the authors of Ref. [24] of assuming A 23,34 = 1 was fortuitous, and the de facto instance of this network used by the majority of researchers is the one mostly likely to correspond to the original study.
In the following we move to a systematic analysis of the reconstruction method, based on empirical and simulated data.

C. Reconstruction performance
Before we evaluate the performance of the reconstruction approach, we must first decide how to quantify it.As a criterion of how close an inferred network Â is to the true network A * underlying the data we will use the distance of Eq. 25, A successful reconstruction method should seek to find an estimate Â that minimizes this distance.However, since we do not have direct access to the true network A * , the best we can do is to consider the average distance over the posterior distribution given the noisy data, where is the marginal posterior probability of edge (i, j).If we minimize d( Â) with respect to Â, we obtain for π ij = 1/2.Eq. 44 defines what is called a maximum marginal posterior (MMP) estimator, and it leverages the consensus of the entire posterior distribution of all possible networks for the estimation of every edge.Operationally, it can be obtained very easily by sampling networks from the posterior distribution, and computing how often each edge is observed, yielding an estimate for π and hence Â.
Given the above criterion, we evaluate the reconstruction performance by simulating the noisy measurement process.We do this by taking a real network A * (which for this purpose we are free to declare to be error free), and obtaining a measurement x given error rates p and q, and measuring each edge and nonedge the same number of times n ij = n.We choose p arbitrarily and , where E is the number of edges in A * , so that the measured networks have the same average density as A * .Given a final measurement x, we sample inferred networks A from the posterior distribution P (A|x, n) and compute the MMP estimate Â from the marginal distribution π.The quality of the reconstruction is then assessed according to the similarity to the true network A * , S( Â, A * ) ∈ [0, 1], defined as where d( Â, A * ) is the distance defined in Eq. 25.A value of S( Â, A * ) = 1 indicates perfect reconstruction, and S( Â, A * ) = 0 the situation where Â and A * do not share a single edge. 8 8 Note that S( Â, A * ) differs from the measure of accuracy commonly used in binary classification tasks, defined as the fraction of entries in A * (both zeros and ones) that were correctly estimated in Â, which in this case amounts to 1 − d( Â, A * )/ N 2 .This is because we are more typically interested in reconstructing sparse networks, where the number of zeros (nonedges) is far larger than ones (edges), such that d( Â, A * ) N 2 , for all choices of sparse Â and A * , causing the accuracy to approach one simply because Â shares most of its nonedges with A * , even if they do not have a single edge in common.The similarity S( Â, A * ) fixes this problem by normalizing instead by the total number of edges observed in both networks.Note, however, that a value of S( Â, A * ) = 0 does not imply that the distance d( Â, A * ) is maximal, only that it is large enough for both networks not to share any edge.[(d) and (g)] KL divergence between true and inferred degree distributions, as discussed in the text.In all cases [(a) to (h)] the dashed curve shows the corresponding value obtained directly with the measured data with n = 1, and the solid horizontal line marks the true value corresponding to perfect reconstruction.
In Figs.4a and e are shown the results of this procedure with the political blogs and openflights networks (see Appendix E).As a baseline, in both figures we show the direct similarity S(x, A * ) of the data obtained with n = 1 to the true network A * , as dashed curves.In both cases the similarity of the inferred network S( Â, A * ) to the true network is larger than the one obtained with the direct observation S(x, A * ) for the vast majority of the parameter range, indicating systematic positive reconstruction even with single measurements.Expectedly, the quality of reconstruction increases progressively with a larger number of measurements n, with the similarity eventually approaching one.Although perfect reconstruction is not possible with single measurements when the noise is large, it is a noteworthy and nontrivial fact that the distance to the true network always decreases when performing it.This is only possible due to the use of a structured model such as the HDC-SBM that can recognize the structure in the data and extrapolate from it.If one would use a fully random model in its place, the similarity would be zero in the entire range, if n = 1 (although it would improve for n > 1).
A particularly interesting outcome of the successful reconstruction is that the noise magnitudes p and q can be determined as well, even though they are not a priori known.As shown in Fig. 5 the posterior probability for p and q are very close to the true values used, even for single measurements.(The precision of the inferred values of q is generally higher than of p, as we are dealing with sparse networks, with vastly more nonedges than edges.) For the openflights data the accurate noise recovery only occurs for moderate magnitudes, and a strong discrepancy is observed for values around p 0.5.In such situations, prior knowledge of the noise values could have aided the reconstruction for n = 1, but otherwise any benefit from this information would have been marginal.Again, the noise recovery becomes asymptotically exact as we increase the number of measurements, and is already very accurate for n = 2.
We note that the results of Fig. 4 remain largely unchanged if the underlying network considered is sampled from the DC-SBM with parameters inferred from the original data (not shown).

Estimating summary quantities
In addition or instead of the network itself, we may want to estimate a given scalar observable y(A) that acts as a summary of some aspect of the network's structure.In this case, we should seek to minimize the squared error with respect to the true network A * , where ŷ is our estimated value.Like before, without knowing A * the best we can do is minimize the squared error averaged over the posterior distribution, Minimizing σ 2 ŷ with respect to ŷ yields the posterior mean estimator, We can also obtain the uncertainty of this estimator by computing its variance of Eq. 47, so that the uncertainty of ŷ is summarized by its standard deviation, σ ŷ .
It is important to emphasize that in general ŷ = y( Â), with Â being the MMP estimator of Eq. 44.In other words, the best estimate for y(A * ) (i.e. with minimal squared error) is not the same as the value obtained for the best estimate of A * (i.e. with minimal distance).
In Figs.4b, c, f, and g we see the results of the same experiment described above, where we attempt to recover the average local clustering coefficient and the degree assortativity of the original network.As with the similarity, the inferred values are closer to the true network's.However, in this case the values for n = 1 are substantially closer to the true value for a large range of noise magnitudes, and is often indistinguishable from it.This means that even in situations where the posterior distribution of inferred networks yields a relatively poor similarity to the true network, as it cannot precisely correct the altered edges and nonedges, it still shares a high degree of statistical similarity with it, and can accurately reproduce these summary quantities.

Estimating degree distributions
We can also estimate degree distributions pk , defined as the probability that a node has degree k, by treating them like a collection of scalar measurements, and minimize the squared error k (p k − p k (A)) 2 averaged over the posterior distribution, which yields the same posterior mean estimator used so far, The same estimator is also obtained when minimizing the Kullback-Leibler (KL) divergence, over the posterior, which offers a more convenient way to compare distributions, as it can be interpreted as the amount of information "lost" when pk is used to approximate p k (A).
For the estimation of the degree probabilities p k (A) for each individual network sampled from the posterior, we model the degrees k = {k i } as a multinomial distribution9 where n k is the number of nodes of degree k.The probabilities themselves are modelled by a uniform Dirichlet mixture, i.e., sampled uniformly from a simplex constrained by the normalization where K is the largest possible degree.With this, the the posterior mean becomes This estimation is superior to the more naive p k = n k /N , as it is less susceptible to statistical fluctuations due to lack of data, such as when n k = 0, although it approaches it for N K and n k 1.In Figs.4d and h are shown the KL divergence between the inferred and true distributions, for the same experiments as before.Like with the local clustering and assortativity coefficients, the reconstructed degree distributions remain very close to the true one, despite the continuously decreasing similarity for larger noise magnitudes.In Fig. 6 can be seen the true, measured and reconstructed distributions for the political blogs network, for a value of (p, q) = (0.41, 0.0094).Despite the relatively high noise magnitudes, a single measurement of the network does fairly well in reconstructing the original distribution, failing mostly only for degrees zero and one, despite the significant distortion caused by the noisy measurement process.

Edge prediction: network de-noising and completion
The reconstruction task we have been considering shares many similarities with the task of model-based edge prediction [5,6], but is also different from it in some fundamental aspects.Most typically, edge prediction is formulated as a binary classification task [7], in which to each missing (or spurious) edge is attributed a "score" (which may or may not be a probability), so that those that reach a pre-specified discrimination threshold are classified as true edges (or true nonedges).This threshold is an input of the procedure, and usually the quality of the classification is assessed by integrating the true positive rate versus the false positive rate [a.k.a. the Receiver Operating Characteristic (ROC) curve] for all discrimination threshold values.This yields the Area Under the Curve (AUC), which lies in the unit interval, and can be equivalently interpreted as the probability that a randomly selected true positive will be ranked above a randomly chosen true negative.Thus, a value of 1/2 indicates a performance equivalent to a random guess, and a value of 1 indicates "perfect" classification (note that a classifier with AUC value of 1 still requires the correct discrimination threshold as an input to fully recover the network).
In contrast, the reconstruction task considered here yields a full posterior distribution P (A|n, x) for the inferred network A. Although this can be used to perform the same binary classification task, by using the posterior marginal probabilities π ij as the aforementioned "scores," it contains substantially more information.For example, the number of missing and spurious edges (and hence the inferred probabilities p and q) are contained in this distribution, and thus do not need to be pre-specified.Indeed, our method lacks any kind of ad hoc input, such as a discrimination threshold (note that the threshold 1/2 in the MMP estimator of Eq. 44 is a derived optimum, not an input).This means that absolute assessments such as the similarity of Eq. 45 can be computed instead of relative ones such as the AUC.Furthermore, the reconstruction approach can be used to recover summary quantities and perform error estimates, which is usually not directly possible in the binary classifier framing.In addition, reconstructed networks can contain spurious and missing edges simultaneously, whereas with traditional edge prediction methods, they each require their own binary classification (with their own discrimination thresholds).
When doing edge prediction, one often distinguishes recovering from the effects of noise (i.e. an edge has been transformed into a nonedge, or vice versa) -to which we refer as de-noising -and from a lack of observation (i.e. a given entry in the adjacency matrix is unknown) -to which we refer as completion.In each scenario the scores are computed differently, yielding different classifiers.When performing reconstruction with our method, we inherently allow for any arbitrary combination of denoising and completion: if an entry is not observed, it has a value of n ij = 0, which is different from it being observed with n ij > 0 as a nonedge x ij = 0.If the noise parameters p and q are zero, recovery via the posterior distribution amounts to a pure completion task for the entries with n ij = 0, and likewise we have a pure denoising task if n ij > 0 for every pair (i, j), otherwise we have a mixture of these two tasks.
In Fig. 7 we illustrate some of these tasks, performed using our framework for the openflights dataset, which we found to be representative of the majority investigated.In Fig. 7a and b are shown the results for edge (q = 0) and nonedge (p = 0) de-noising, respectively.Given that this network is sparse, the probability of an edge is on average much smaller than that of a nonedge, which means that the edge de-noising task is significantly harder than nonedge de-noising, for which very high accuracy can be obtained even for n = 1 measurement per edge.Nevertheless, positive reconstruction is possible in each case, approaching a similarity of 1 as the number of measurements is increased.
We also perform network completion by choosing a fraction f of edges or nonedges, for which zero measurements are performed, n ij = 0, while the remaining entries are observed n times, n ij = n.In Fig. 7c and d are shown the reconstruction results for edge and nonedge completion, respectively.Like for de-noising, nonedge completion is easier, approaching near perfection for the entire range of parameters, and for the same reason as before.For the completion tasks, however, the number of observations n for the non-affected entries has a negligible effect in the reconstruction, and we observe near-optimal performance already for n = 1.
Although the number of edges and nonedges affected is the same for both our de-noising and completion examples, the latter yields a larger rate of successful reconstruction for both edges and nonedges.This is understood by noting that these tasks have a different number of unknowns.In the case of edge completion, on the one hand, for a given finite fraction f of non-observed edges, we have O(E) unknowns, which for sparse networks is O(N ).For edge de-noising, on the other hand, for any fraction p of missing edges, for sparse networks we have in principle O(N 2 ) possibilities for their placements, corresponding to all observed nonedges.For nonedge denoising and completion, the difficulty is comparable: For any fraction f = O(1/N ) left unobserved, or q = O(1/N ) transformed into spurious edges, there are O(N ) unknowns, if the network is sparse.However, the actual number of unknowns for nonedge completion is strictly smaller, as it must involve only the fraction not observed, whereas for de-noising it involves every observed edge.
This difference in performance shows how the correct interpretation of the data can be crucial -as absence of evidence is not evidence of absence.Unfortunately, most available datasets fail to make this distinction, including those few which actually provide some amount of error assessments, as they do not indicate which pairs of nodes have not been measured at all.

Detectability of modular structures
Our approach generalizes the task of community detection for networks with measurement errors.However, even in the case of error-free networks with planted community structure, this task is not always realizable.This is most often illustrated with a simple SBM parametrization known as the planted partition model (PP), with equal-sized groups, n r = N/B.As has been shown in Ref. [29], the detection of communities from networks sampled from this model undergoes as phase transition, and becomes impossible for parameter values satisfying where k = N [ω in +(B −1)ω out ]/B is the average degree of the network.This transition means that even though a PP model may contain assortative community structure with ω in > ω out , the individual samples from the generative model will be indistinguishable from a fully random graph if the inequality of Eq. 55 is fulfilled, and hence will contain no information useful for the recovery of the planted communities.When considering measured networks, it is expected that the introduced errors will make the detection task more difficult, as the noise will remove information from the data.As we have seen in Sec.II A 1, when a single measurement of a SBM network is made with noise parameters p and q, it becomes indistinguishable from a SBM sample with effective probabilities ω , given by Eq. 36.Applying this to the PP model, yields a transition according to For positive error magnitudes p > 0 or q > 0, the above threshold will be larger than Eq.55.This highlights how measurement noise can hinder the detection of largescale structures if they are sufficiently weak, and induce a phase transition in their detection.This also means that the reconstruction of the networks themselves will be affected by the same transition, as our approach hinges on the detectability of these large-scale structures.In Fig. 8 are shown the reconstruction results for PP network samples with B = 2 groups, for simulated measurements always using q = 0, but with either p = 0 or p = 1/2.Without measurement noise, p = 0, the detectability of the planted partition is possible all the way down to the detectability threshold of Eq. 55.Despite the lack of noise, the similarity with the true network is only slightly above 0.6 in the detectable region.This is because the probabilities in this ensemble are not sufficiently heterogeneous to rule out high noise values, as some of the empirical networks we have considered.Below the transition, the similarity falls to zero, as the network becomes indistinguishable from a fully random one.Interestingly, this partial uncertainty about the network does not affect the inference of the node partition.If we increase the noise to p = 1/2, the partition recovery is possible up to the threshold of Eq. 56 when only n = 1 measurements are made.However, after sufficiently increasing n, the effects of noise are diminished, and the original threshold can be achieved.In this case, the similarity also becomes high even below the detectability threshold, where the community structure itself cannot be recovered.This is because the repeated measurements themselves yield sufficient information about the network structure, and the reconstruction no longer needs to rely on the network structure itself.

D. Reconstruction of empirical data and uncertainty assessment
A central advantage of our method is that it can be used to reconstruct noisy networks when only a single measurement has been made for each entry in the adjacency matrix, and no error assessment is known.As the majority of network data can be cast into this framework, our method can be used to reconstruct them and give uncertainty assessments for quantities of interest.In this section we discuss a few empirical examples.
We focus first on the neural network of the Caenorhabditis elegans worm.It has been used extensively as a model organism, and it had its full neural network mapped in 1986 by White et al [28].The network measurement has been done by electron microscopy of transverse serial sections of the animal's body of about 50 nm thickness, amounting to around 8000 images.Based on these images, the network was reconstructed by painstaking manual tracing of the neuron paths across the different images.The reliability of the reconstruction procedure was discussed in Ref. [28], where human error in tracing the neuron bundles, the orientation of the neurons with respect to the transverse section, and poor image quality were identified as the main sources of potential errors.White et al. employed a series of error mitigating procedures, such as detecting basic connection inconsistencies, exploiting the partial bilateral symmetry for suspect connections, and comparing with independent reconstructions of parts of the network.Although the authors of that work profess to be "reasonably confident" that the structure they present is "substantially correct," they do not exclude the possibility of remaining errors, nor do they quantify in any way the uncertainty of their measurements.Furthermore, the data commonly used for network analysis, which we also use here, has been manually compiled by Watts et al. [30], based on the original data of Ref. [28], and may contain further errors.The resulting data we use amount to N = 302 nodes and E = 2, 345 directed edges (note that five nodes were excluded in Ref. [30] for not having any connections. We include these nodes in our analysis, as it is suspicious that isolated neurons can exist, and thus is probably a symptom of missing data).When we employ our reconstruction procedure on the C. elegans data, we find the results shown in Figs. 9 and 10, and summarized in Table I.The MMP estimate of this network contains Ê = 2, 773 edges, but the posterior distribution is significantly broad, and contains on .Average similarity between the posterior samples and the measured C. elegans data as function of the hyperparameter β (with α = 1), which controls the prior belief on the probability p of missing edges (the average of which is shown in the x axis).For reference, the similarity for the MMP estimate is also shown.
average E = 3, 950 edges, meaning that there are many potential edges with low but non-negligible probabilities.We note that our reconstruction connects the isolated nodes in the data to the main hub in the network, which is an important neuron situated in the head of the worm.As seen in Fig. 10a the inferred degree assortativity coefficient is compatible with the value measured directly from data, and our method is capable of providing a confidence interval for this estimation.The same is not true for the average local clustering coefficient, as seen in Fig. 10b, which is not compatible with the value measured directly from data with any reasonable confidence.
For the C. elegans data, the inferred error rates are (p, q) = (0.4,6 × 10 −5 ).Although this corresponds to a very high accuracy with respect to spurious edges, it indicates a low accuracy with respect to missing edges, and it implies that almost half of the original edges were misrepresented as nonedges.Although the consensus of the posterior distribution (represented by the MMP estimate) is reasonably close to the original data, with a similarity of 0.93, the similarity averaged over the posterior distribution is only 0.74 indicating a fair amount of uncertainty.This seems to contradict the qualitative assessment of Ref. [28], which argued in favor of the reliability of their data.This discrepancy can be interpreted in two ways: 1.The assessment in Ref. [28] was too optimistic, and the data contains indeed more errors than anticipated; 2. The data actually contains fewer errors than our method predicts, but the true network itself is not sufficiently structured to rule out errors in a manner that can be exploited by our method.However, even if case 2 happens to be true, our method correctly projects an agnostic prior assumption about the error rates onto the posterior distribution, after being informed by the data.This means that more confidence on the data and the existence of fewer errors must be accompanied by either more data (e.g.repeated measurements), or a more refined prior information on the error rates, obtained either by calibration or a quantitative study of the methods employed in Ref. [28].As an illustration, in Fig. 11 is shown the posterior similarity with the date obtained with different choices of the hyperparameter β, using α = 1, which control the prior knowledge on the value of p, with an average given by p = α/(α + β).A high accuracy of the data, with inferred similarities approaching one, is only achieved by a prior belief on p being on the order of 0.01 or smaller.This means that one should trust the claimed high accuracy in Ref. [28] only if one is confident that the probability of an edge not being recognized as such was below one percent.This might very well be true, but would need to be substantiated with further evidence.Although in situations such as these our method cannot fully resolve the discrepancy without further data, it serves as the appropriate framework in which to place the issue, and shows that any analysis that takes the original measured data for granted, ignoring potential errors, inherently assumes more reliability than can be inferred from the data alone.
For other kinds of data, it is possible to obtain very accurate reconstructions with single measurements.As an example, we consider the network of collaborations in papers published in the cond-mat section of the arxiv.orgpre-print website in the period spanning from January 1, 1995 and March 31, 2005, where authors are nodes, and an edge exists if two authors published a paper together [31].This network was compiled by crawling through the website interface, and could contain errors due to incorrect parsing. 10When reconstructed using our method, however, we find that it is remarkably accurate, with very low error rates inferred as (p, q) = (3×10 −5 , 3×10 −9 ).As can be seen in Fig. 12, all inferred properties match very closely the direct measurementalthough our reconstruction is still useful in providing error estimates for them.
In Table I we provide a summary of reconstruction results with our method to several empirical networks.We observe a tendency of larger networks to be more accurate than smaller ones.This is not a trivial result of there being more data, but rather of these larger networks containing stronger structures which are informative of low measurement noise.If these networks were fully random, their reconstruction accuracy would have been very poor, regardless of their size. 10These kinds of data also tend suffer from name ambiguity problems, where the same author appears under different names, due, for example, to alternative spellings.But since this causes node duplications to occur, it cannot be corrected with our method, which can address only spurious and missing edges.

E. Heterogeneous errors
So far we have considered only the situation where the error probabilities p and q are the same for every pair of nodes in the network.Although it is easy to imagine a simplified scenario where the same measurement instrument is used in every case, it is also easy to imagine situations where this is not an adequate representation of how measurement is made.For example, in the case of the C. elegans neural network, the spatial proximity of the neurons may make it harder or easier to measure the edges and nonedges, thus impacting their error probabilities.
With this in mind, it is easy to generalize our framework to allow for individual error probabilities p ij and q ij , for missing and spurious edges between nodes i and j, respectively.Given a true underlying entry A ij between these two nodes, its measurement probability is Table I.Reconstruction results for empirical networks with single measurements per edge, and no available primary error assessments.Similarity refers to the average of S(A, A * ) over the posterior distribution.For each quantity (number of edges, degree assortativity, average local clustering) is shown the value directly obtained from the data (direct) and the average over the posterior distribution (estimated).The value of Be is the posterior average of the effective number of inferred communities e H(n) , with H(n) = − r (nr/N ) ln(nr/N ), where nr is the number of nodes in group r, being the entropy of the group size distribution.The values p and q are the posterior averages of the error rates.In all cases, the parentheses indicate the standard deviation over the posterior distribution.Dataset descriptions are given in Appendix E. given by Using the same Beta priors as before, we can integrate over p ij and q ij , obtaining With this we have the full conditional distribution for the measured network, ) with which we can obtain the posterior distribution of Eq. 3.However, unlike the case with uniform errors, the choice of hyperparameters is now vital.The noninformative assumption α = β = µ = ν = 1 applied above makes the likelihood independent of the planted network A, rendering the data completely uninformative as well.This means we must have some global information that specifies how the values of p ij and q ij are distributed.Although we could simply set (or fit) the values of the hyperparameters to values different from one, we favor a nonparametric approach, and we include the hyperparameters in the posterior distribution, P (A, b, α, β, µ, ν|n, x) = P (x|n, A, α, β, µ, ν)P (A|b)P (b)P (α, β, µ, ν) P (x|n) (60) which requires their own hyperprior distribution P (α, β, µ, ν).Here we will be agnostic and use a constant prior P (α, β, µ, ν) ∝ 1, with an unspecified and unnecessary normalization constant, as it cancels out in the posterior distribution. 11The inference algorithm is the same as before, but in addition to move proposals for the network A and node partition b, we make also move proposals for the hyperparameters.Like in the uniform case, we can obtain the posterior distribution for the error probabilities via their conditional posteriors, i.e.
and likewise for q ij with averaged over the posterior distribution.Posterior probability density We note that for heterogeneous error rates, the case with single measurements n ij = 1 become less interesting.If we replace n ij = 1 and x ij ∈ {0, 1} in the above equations, they become identical to Eq. 15 for the case with uniform errors, if we make the substitution In this situation, only the prior averages of p ij and q ij matter, not their variance.A uniform prior for α, β, µ and ν is equivalent to Beta priors with parameters (1, 0) for p and q computed via the equation above, 12 and hence this approach becomes completely identical to the one with uniform errors considered before.Therefore, there is no sufficient data in the single measurement case to detect heterogeneous errors of this kind, and thus a meaningful use of this method is confined to data with n ij > 1.
Note also that this implies that any error heterogeneity present in the data will be conflated with underlying network structure when single measurements are made.Ul-timately, this conflation can only be resolved by making multiple measurements.We consider two datasets which contain multiple measurements, in order to compare both approaches.We consider the reality mining dataset, which recorded proximity interactions between voluntary students over time [32].Following Ref. [10], as measurements we considered the state of the network during eight consecutive Wednesdays in March and April of 2005, so chosen to avoid weekly periodic events.In addition, we consider the human connectome, using data from the Budapest Reference Connectome [33] (which itself is based on primary data from the Human Connectome Project [34]).This dataset contain records of the neuronal connections of 418 individuals, each of which we considered as a separate measurement.
For both datasets considered -as it is arguably always true whenever multiple network measurements are made -it is debatable whether there is really a true single network behind the measurements, as our method assumes.For example, in the reality mining dataset, the underlying network could be changing over time, and the connectome can vary between individuals for physiological reasons, rather than measurement error.In each case, however, we are free to keep the mathematical structure of our model in place, and change its interpretation.We could, for instance, assume that the single network being inferred amounts simply to a consensus or a blue print of the network, and the "error" rates p ij and q ij indicate the variability of each single edge or nonedge around this blue print.Since both scenarios are generally conflated when making this kind of measurement, we can choose the interpretation that is most suitable according to the context.
In Fig. 13a and d are shown the distributions of the measured frequencies of edge occurrences, x ij , for both datasets.For the human connectome, we observe a very broad distribution, with occurrences present in the entire possible range.In Fig. 13b and e we see the simulated results by sampling parameters from the posterior distribution and generating new data from them, using in this case the model with uniform errors.Whereas the results for reality mining are reasonably close to the data, the results for the human connectome show an obvious discrepancy, where the generated data is concentrated around two modes, corresponding to the frequencies of edges and nonedges.Indeed, for the uniform model this separation is guaranteed to occur for any given p = 1/2 and q = 1/2 and a sufficiently large number of measurements.The fact that this is not observed in the data is a clear indication that the error rates are not uniform (or alternatively, but mathematically equivalently, that there is no single network behind the measurements).Indeed when using the nonuniform model, it recovers the observed frequency almost perfectly, as seen in Figs.13c  and f.
If we look more closely at the human connectome data, we see that both approaches give us different pictures Table II.Reconstruction results for empirical networks with multiple measurements per edge.For each quantity is show the value obtained using either the uniform or the nonuniform model, as indicated.The value of Be = e H(n) is the effective number of inferred communities, computed as H(n) = − r (nr/N ) ln(nr/N ), where nr is the number of nodes in group r.The values p and q are the posterior averages of the error rates.In all cases, the parentheses indicate the standard deviation over the posterior distribution.Dataset descriptions are given in Appendix E.
of the underlying network structure.As is summarized in Table II, the uniform model yields a sparser network, which nevertheless seems more finely structured, with close to 100 effective groups detected.Conversely, the nonuniform model yields a denser network, with a more uniform structure, and only half as many identified groups.In Fig. 14 we see more clearly the differences between both results.Both are capable of uncovering the hemispherical divisions and the partial bilateral symmetry of the connectome.The nonuniform model can detect a larger number of edges, but it yields larger probabilities of missing edges p ij which are heterogeneously distributed.In Fig. 14c it can be seen that the inferred p ij are strongly correlated with the detected group structure, and in particular seem to indicate a rather stable set of edges (low p ij ) that belong mostly to the left hemisphere.The uniform model, on the other hand, incorporates the variability of edge occurrences in the model itself, subdividing the groups further to accommodate it.Therefore, the nonuniform model gives a more faithful separation between the consensus and the variability around it.
In Fig. 15 we can see the posterior distributions of p ij , q ij for the nonuniform model, as well p and q for the uniform model, showing how the former is indeed significantly more heterogeneous than the latter.In Fig. 15c is also shown the distribution of posterior probabilities π ij for both models, in addition to the naive estimate πij = x ij /n ij .This naive estimate is crude, as it does not differentiate between the different sources of error (spurious or missing edge), and does not take into account the observed correlations between the different entries.Indeed, as the Fig. 15c shows, it leads to very different results, which are not correctly justified, and should be avoided.

III. INCORPORATING EXTRINSIC UNCERTAINTY ESTIMATES
So far we have considered only situations where direct error estimates on the edges originate from repeated measurements.However, there are situations where primary error estimates are made under different formats.Here we consider the scenario of Ref. [35], where an arbitrary measurement process is made which yields uncertainty assessments for each node pair, Q ij ∈ [0, 1], interpreted as conditionally independent probabilities, i.e.
In principle, we could use these probabilities as they are, and generate networks and measure their properties from this distribution.But we could also extract from this information the measurement process which it represents, and couple it with our reconstruction approach.This gives us the advantage of being able to use the large scale structure in the data to better inform our estimates of the underlying network.
The distribution P Q (A|Q) implies the following noisy measurement process, with normalization constant If we assume the prior on the edge uncertainties are identically distributed and conditionally independent, i.e.
we have Combining these together we have The above depends on an unknown prior P Q (Q).Determining it would require us to delve into the details of how this measurement is made, which is unavailable to us if all we know is P Q (A|Q).However since it is only a multiplicative constant that does not depend on the data or any latent variable, it will not affect the posterior distribution, and thus we do not need to determine it.The single aspect of this distribution that is relevant is its average, Q.By allowing only for a minor violation of the Bayesian ansatz, we can estimate this directly from data With this, we can couple this arbitrary noise generating process with our overall framework by taking D = Q, and obtaining the posterior distribution where P (A) assumes that the network has been generated by a SBM.Note that P (A|Q) = P Q (A|Q), as we are keeping the same noise generating process, but changing our prior assumption about the data.As desired, our prior is structured, and is capable of detecting large-scale patterns -latent groups of nodes and their probabilities of connections, as well as node degrees and hierarchical structure -to inform our inference.This also highlights the versatility of our framework, as we are free to replace the measurement model as appropriate.
Although our derivation is somewhat different, equations Eq. 65 to 71 above are the same as in Ref. [35].The resulting posterior of Eq. 72, however, is different, as our approach is nonparametric, and hence can be used to infer the number of groups, and does not involve any approximations that rely on the network being sparse or locally tree-like.
In Fig. 16 we show the results for the protein-protein interaction network of Escherichia coli, for which error estimates in the form of Q ij probabilities are provided [36].The probabilities are computed in an elaborate manner by combining seven sources of evidence for the existence of an interaction between two proteins.As seen in the figure, our method is able to detect prominent large-scale features that help shape the posterior distribution.The resulting posterior probabilities are fairly different from the primary error estimates, showing that these observed correlations can be very informative for the reconstruction process.

IV. CONCLUSION
We have presented a general nonparametric Bayesian network reconstruction framework that couples a noisy measurement model with the stochastic block model (SBM) as a generative process.The posterior distribution of this joint model yields simultaneously an ensemble of possibilities for the underlying network, as well as its large-scale hierarchical modular organization.As we have shown, this joint identification of the network structure enables the existence of correlations in the measured data to inform the network reconstruction.As a consequence, our method can be employed also when a single measurement of the network has been made -which is not possible with methods that do not exploit such correlations -and the error probabilities are unknown.This property makes our approach applicable to the dominating set of network datasets that do not provide primary error estimates of any kind, and can extract from them not only the most likely underlying network, but also error estimates for arbitrary network properties.
We have shown that our general methodology is versatile, allowing for different noise models.We have considered the situation where the error probabilities are heterogeneous, showing strong evidence for its existence in empirical data, and demonstrated the efficacy of our modified approach in capturing it.We have also shown how extraneous uncertainty estimations obtained with arbitrary methods can be incorporated into our approach, without requiring a detailed model for their generation.
The approach we have proposed is open ended, and admits many extensions and generalizations.For example, although the SBM can be used to exploit edge correlations if favor of reconstruction, this can be further improved by considering more realistic models that include other kinds of correlations such as triadic closure [37] or latent spaces [38,39].Furthermore, there is a wide range of possibilities for other kinds of noise models different from the ones considered here, including missing and duplicated nodes, and edge endpoint swaps (e.g. that can occur from crossings in imaging data).Additionally, network data often come with a wealth of node and edge annotations [40,41], with important special cases being weighted [42,43] and multilayer [44,45] networks.These extra data are potentially useful for reconstruction, although they also contain their own measurement errors.Determining the most appropriate and effective manner to model and exploit this extra information in reconstruction seems like fertile grounds for future work.
biasing with respect to group assortativity.The parameters d and do not affect the correctness of the algorithm, only the mixing time, which is typically not very sensitive, provided they are chosen within a reasonable range (we used d = 0.01 and = 1 throughout).When using the HDC-SBM, we used the variation of the above for hierarchical partitions described in Ref. [14].The move proposals above require only minimal bookkeeping of the number edges incident on each group, and can be made in time O(k i ), which is also the time required to compute the ratio in Eq.B3, independent on how many groups are currently occupied.
For the network move proposals we could have used simple edge/nonedge flips with with δ ∈ {−1, 1}.But in fact, since we operate with latent multigraphs, the moves are slightly different, as described in Appendix D. The correctness of the algorithm does not depend on the order or the frequency with which we attempt to update the entries (i, j), provided they are all eventually updated, so in principle we could choose them randomly each time.However, we have found this leads to poor mixing times, since most entries correspond to nonedges A ij = 0 which tend to remain in that state.Instead, we choose the entries to update with a probability given by the current SBM, with being the probability of selecting node i from its group b j , proportional to its current degree plus one, and m rs = e rs + 1 tu e rs + 1 is the probability of selecting groups (r, s), where e rs = ij A ij δ bi,r δ bj ,s .The above probabilities guarantee that every entry will be eventually sampled, but it tends to probe denser regions more frequently, which we found to typically lead to faster mixing times.This sampling can be done in time O(1), simply by keeping urns of vertices and edges according to the group memberships.The time required to compute the ratio in Eq.B3 is also O(1) for the DC-SBM and O(L) for the HDC-SBM, where L is the hierarchy depth, again independent of the number of occupied groups.
When combining both move proposals above for the partition and network, the time required to perform V node proposals and M edge proposals is O( k V + M ), where k is the average degree, which allows for the inference of very large networks, with up to millions of edges.A reference implementation of the above algorithm is freely available as part of the graph-tool library [47].
which is done simply by sampling from P (A, G, b|D) and ignoring the actual magnitudes of the G ij values, and the diagonal entries.This yields an almost identical MCMC algorithm to the one described in Appendix B, with the only difference that we keep track of the values of G ij , which are no longer binary, but automatically give us A ij [which are used for the computation of P (D|A)].The move proposals of the entries of G ij are done by unity changes, if G ij = 0 and δ = 1, 0 otherwise, (D8) again for δ ∈ {−1, 1}.
In the case of the DC-SBM, the degree correction happens for the multigraph G, and only indirectly for A. But since our model is nonparametric, and the degrees of G are also generated from their own priors, this gives us a perfectly valid and useful degree-corrected model for A as well.

Appendix E: Datasets
Here we give brief descriptions of the datasets used in this work, with properties listed in tables I and II.

Data without primary error estimates
Karate club: Social network between 34 members of a Karate club [23].The version used in Table I is the same one used in Ref. [24], with A 23,34 = 1 and hence 78 edges in total.In Table II, it was assumed that each repeated entry of the adjacency matrix reported in Ref. [23] amounted to a different measurement, so that n ij = 2 and x ij = 2A ij for all (i, j), except for x 23,34 = 1.9/11 terrorists: Social associations between 62 terrorists responsible for the 9/11 attacks [21,22].
American football: Network of American football games between Division IA colleges during the regular season in fall of 2000 [24].
Network scientists: Coauthorship network of scientists working on network science [51].
C. elegans neural: Directed neural network of the Caenorhabditis elegans worm [28], manually compiled by Watts et al. [30], based on the original data.The 5 nodes with zero degree omitted in Ref. [30] were included in our analysis, resulting in N = 302 nodes in total.
Power grid: Western states power grid of the United States [30].
Political blogs: Citations between political blogs during the 2004 presidential election in the United States [53].
DBLP citations: Citation network of DBLP, a database of scientific publications [54].
Openflights: Directed network of flights between worldwide airports, collected from the community-driven website http://www.openflights.org.
Reactome: Network of protein-protein interactions in humans.[55] cond-mat: Network of collaborations in papers published in the cond-mat section of the arxiv.orgpre-print website in the period spanning from January 1, 1995 and March 31, 2005 [31].
Enron email: Emails sent between employees of Enron between 1999 and 2003 [56].
Linux source: Network of Linux source code files, with directed edges denoting that they include each other [19].
Brightkite: Online social network from the defunct brightkite website.
PGP: Global web of trust of the Pretty-Good-Privacy (PGP) encryption protocol.Nodes are public keys, and directed edges indicate that one key digitally signed another [57].
Internet AS: Directed network of internet autonomous systems, ca.2009, as measured by the Center for Applied Internet Data Analysis (CAIDA), available at https://www.caida.org/data/.
Web Stanford: Directed network of hyperlinks between the web pages from the website of the Stanford University [58].
Flickr: Network of images in the image-sharing site http://flickr.com,where two images are connected if they share metadata, such tags, groups or location [59].

Data with primary error estimates
Reality mining: Proximity interactions between voluntary students over time [32].As measurements we considered the state of the network during eight consecutive Wednesdays in March and April of 2005.
School friends: Directed network of friendship between primary and high-school students [60].Each student have been asked repeatedly to list his or her best 5 female and 5 male friends.
Human connectome: Neuronal connections in the human brain, measured for 418 individuals, each of which we considered as a separate measurement [33].

Figure 3 .
Figure 3. Inferred Zachary's karate club network using the uncertain data from the original publication[23], which contains an ambiguous edge(23,34), as explained in the text.(a) Layout of the reconstructed network showing the posterior edge probabilities as edge thickness, according to the HDC-SBM, and the ambiguous edge in red.The node colors correspond to a sample from the posterior distribution of the node partitions.(b) Posterior probability density of the probability of edge(23,34), conditioned on the remaining edges and model parameters, for the different model variants indicated in the legend and explained in the text.The vertical dashed lines indicate the distribution averages, corresponding to the marginal posterior probability of the edge.

Figure 4 .
Figure 4. Reconstruction performance for political blogs (top row) and openflights (bottom row) networks.In each case, the empirical network was considered as the true network, and simulated measurements were made for several values of missing edge probability p, with a spurious edge probability q = pE/[ N 2 − E]. [(a) and (e)] Similarity of the MMP estimator to the true network, S( Â, A * ), as a function of p, and for several values of the number of repeated measurements, n. [(b), (c), (f), and (g)] Posterior average local clustering and degree assortativity coefficients, according to the same legend as (a) and (e).[(d) and (g)] KL divergence between true and inferred degree distributions, as discussed in the text.In all cases [(a) to (h)] the dashed curve shows the corresponding value obtained directly with the measured data with n = 1, and the solid horizontal line marks the true value corresponding to perfect reconstruction.

Figure 5 .
Figure 5. Inferred values of noise magnitude p and q as a function of the planted values, for the same simulated measurements described in Fig. 4, for the political blogs [(a) and (b)] and openflights [(c) and (d)] networks.

Figure 7 .
Figure 7. (a) Edge de-noising reconstruction performance for the openflights data, as a function of the missing edge probability p, for various n, and q = 0.The dashed curve shows the corresponding value obtained directly with the measured data with n = 1, and the inset shows the difference between the curve for n = 1 and the dashed curve.(b) Same as (a) but for nonedge de-noising, with p = 0.The values of q were chosen to yield the same number of affected nonedges as edges in (a).(c) Edge completion reconstruction performance as a function of fraction f of unobserved edges.The dashed line shows the value of similarity obtained by considering the unobserved edges as nonedges.(d) Same as (c) but for nonedge completion, as a function of the fraction f of unobserved nonedges.The dashed line shows the value of similarity obtained by considering the unobserved nonedges as edges.

Figure 8 .
Figure 8.(a) Normalized mutual information (NMI) between planted and inferred partitions for a PP model with N = 10 4 , B = 2, k = 10, and measurement errors q = 0 and p given in the legend, together with the number of measurements n.The black solid line marks the threshold of Eq. 55, and the blue dashed line the threshold of Eq. 56 with (p, q) = (1/2, 0).(b) Same as in (a), but for the similarity S( Â, A * ) between the inferred and true networks.

Figure 9 .
Figure 9. (a) Measured neural network of the C. elegans worm [28].(b) Marginal posterior distribution πij of the edges according to our reconstruction method, shown as edge colors.(c) Maximum marginal posterior (MMP) estimate of the network, with inferred missing edges shown in red, and spurious edges shown in green.

Figure 10 .
Figure 10.Reconstruction statistics for the neural network of C. elegans.(a) Posterior distribution of the degree assortativity coefficient.The black dashed line marks the mean of the distribution, and the blue dashed line the value obtained for the MMP estimate, Â.The red solid line marks the value computed directly from the data.(b) Same as (a) but for the average local clustering coefficient.(c) Measured and estimated degree distributions.(d) Posterior distributions for the error probabilities p and q.

3 β 11
Figure 11.Average similarity between the posterior samples and the measured C. elegans data as function of the hyperparameter β (with α = 1), which controls the prior belief on the probability p of missing edges (the average of which is shown in the x axis).For reference, the similarity for the MMP estimate is also shown.

Figure 12 .
Figure 12.Reconstruction statistics for the co-authorship network of arxiv.org.(a) Posterior distribution of the degree assortativity coefficient.The black dashed line marks the mean of the distribution, and the blue dashed line the value obtained for the MMP estimate, Â.The red solid line marks the value computed directly from the data.(b) Same as (a) but for the average local clustering coefficient.(c) Measured and estimated degree distributions.(d) Posterior distributions for the error probabilities p and q.

Figure 13 .
Figure 13.Distribution of edge occurrences, xij, for the reality mining (top row) and human connectome (bottom row) datasets.[(a) and (d)] Empirical data.[(b) and (e)] Generated from inferred parameters, according to the uniform model.[(c) and (f)] Generated from inferred parameters, according to the nonuniform model.

Figure 14 .
Figure 14.Reconstruction results for the human connectome.(a) Marginal posterior distribution of edges πij and inferred hierarchical partition, according to the model with uniform errors.The upper hierarchy branch corresponds to the right hemisphere.(b) Same as (a) but with the nonuniform model.(c) Inferred missing edge probabilities pij for the nonuniform model.(d) Same as (c) but for the spurious edge probabilities, qij.

Figure 15 .
Figure 15.Inferred uncertainties for the human connectome.(a) Posterior distribution of pij and qij, using the nonuniform model.(b) Posterior distribution of p and q, using the uniform model.(c) Distribution of posterior marginal edge probabilities πij, according to both model variants, as well as the naive estimate πij = xij/nij.

Figure 16 .
Figure 16.(a) Inferred E. coli protein interaction network, according to uncertain data Q, using the MMP estimator from the posterior P (A|Q).(b) Difference between (a) and the MMP estimator using the original uncertainties Q directly, via PQ(A|Q) (Eq.65).Green edges are those that are added in (a), and red ones are removed.(The hierarchical partition is the same as in (a), and is shown only as a visual aid.)(c) Distribution of marginal posterior probabilities πij and original uncertainties Qij.