Randomization algorithms for large sparse networks

In many domains it is necessary to generate surrogate networks, e.g., for hypothesis testing of different properties of a network. Generating surrogate networks typically requires that different properties of the network are preserved, e.g., edges may not be added or deleted and edge weights may be restricted to certain intervals. In this paper we present an efﬁcient property-preserving Markov chain Monte Carlo method termed CycleSampler for generating surrogate networks in which (1) edge weights are constrained to intervals and vertex strengths are preserved exactly, and (2) edge and vertex strengths are both constrained to intervals. These two types of constraints cover a wide variety of practical use cases. The method is applicable to both undirected and directed graphs. We empirically demonstrate the efﬁciency of the CycleSampler method on real-world data sets. We provide an implementation of CycleSampler in R, with parts implemented in C.


I. INTRODUCTION
In many applications it is useful to represent relationships between objects with a network in which vertices correspond to objects of interest and associations between objects are expressed with directed or undirected edges. The edges can also be weighted. Given such a network, one might be interested in questions such as community detection [1], clustering coefficients [2,3], centrality measures [4], shortest path distributions [5], or different measures of information propagation [6]. However, it is often useful to study whether a possibly interesting finding from a given network reflects a real phenomenon, or if it is merely caused by noise. A simple approach to this is to compare the original finding to findings from surrogate networks that share some relevant properties with the original network, but are otherwise inherently "random". For example, communities found in the original network should probably exhibit greater structure than communities in appropriately randomised networks. Usual solutions thus involve generating a number of surrogate networks by fixing some network properties of interest, and then drawing a uniform sample from the set of all networks satisfying the given properties.
Existing methods for sampling networks can be assigned into two categories: propertypreserving and structure-preserving methods. Property-preserving methods [2,[7][8][9] do not preserve network topology, i.e., they can introduce new edges and remove existing ones.
These approaches can be viewed as always considering a fully connected clique, inside which the edge weights are rearranged. Their aim is to preserve some network property of interest, such as node degrees or node weights. Property-preserving methods can be further divided into those preserving the property exactly or in expectation. For example, preserving node degrees exactly is relatively straightforward using, e.g., edge swaps [2,10]. Preserving higherorder statistics is often possible only in expectation [8]. That is, the expected value of the property remains equal to some given constraint, but its observed value in an individual sample may deviate from this constraint.
Structure-preserving methods [11], on the other hand, keep the network topology fixed (new edges are not inserted and existing ones are not removed), but usually maintain the desired property, e.g., node weights, only in expectation. Such approaches are usually based on maximum entropy models where surrogate networks are sampled simply by drawing edge weights from a parametrized i.i.d. distribution. While these methods are often computation-ally quite efficient, maintaining the desired property only in expectation may not be enough.
It is, e.g., conceivable that without additional constraints the network property of interest may occasionally take values that cannot be observed in real networks, and in these cases the sampled node weights may hence be unsuitable for the task of comparing an original finding to "random findings".
As a toy example, consider the network shown in Fig. 1, with six nodes and seven edges.
This network describes telephone calls between six individuals (the nodes) over an observation period of 24 hours. An edge between two nodes represents the total cumulative call duration (in hours) between two individuals. Our goal is now to sample networks having exactly the same edges as in the original network, i.e., we assume that people cannot communicate outside their friendship network and hence no new edges are created. We consider two different cases. First, we consider the case where the call durations between two individuals (the edge weights) can vary within an interval, but the node weights (sum of edge weights adjacent to a node) must stay fixed at their original values. That is, the total duration spent on the phone by every individual must remain the same. Second, we consider the case where both edge and node weights are allowed to vary within a given interval. In both cases the interval width is constrained by the simple fact that during a 24-hour period a person cannot spend more than 24 hours on the phone in total. Table I shows the range (min and max) of the edge and node weights for 10 000 samples for the two cases described above, obtained using (i) the method presented in this paper, termed CycleSampler, and (ii) a maximum entropy model. In Case 1, the CycleSampler method preserves the node weights exactly. (Notice that the range of node weights matches the node weights given in Fig. 1.) In Case 2 the CycleSampler method simultaneously preserves both edge and node weights within the interval 0-24 hours, while the maximum entropy method preserves these only in expectation. It is clear that the maximum entropy model easily satisfies constraints on edge weights, but violates the 24-hour node weight constraint. This is because the edge weights are sampled i.i.d. and thus for nodes having a large degree the sum of sampled weights on adjacent edges can easily increase beyond the maximum value allowed.
This limitation of the maximum entropy model is further highlighted in the numerical example presented in Figure 2. This example shows that even in a very simple case the majority of samples from a maximum entropy model will not satisfy hard constraints on with the edge weight denoting the total call duration in hours between two persons. The table on the right shows the node weights (sum of adjacent edge weights) for each node of the network. node weights, even if the expected value of node weights is preserved. Furthermore, the resulting distribution of edge weights is not uniform. The CycleSampler method proposed in this paper, on the other hand, produces a uniform sample that satisfies all constraints.
Summary of contributions. In this paper we present the CycleSampler method, which is a structure preserving sampling method for edge weights that explicitly maintains node weights within a given interval. This interval can be set to have zero width, in which case the node weights are maintained exactly in the sampled networks. The approach can be viewed as a generalisation of the property-preserving MCMC algorithm described in [2] to the structure-preserving case. However, the requirement to not introduce new edges or remove existing ones presents some nontrivial algorithmic challenges.
In short, our approach samples uniformly from the null space of the given network's incidence matrix (a binary matrix where vertices are rows and edges are columns), which requires constructing a basis for this null space. It is crucial that the basis is sparse, since the null space may be very high-dimensional (in the millions), and keeping a dense basis in memory, as found by textbook methods, is infeasible for larger networks. The problem of finding a sparse basis for general matrices has been studied previously [12,13], and some variants of it are NP-hard [14]. However, a sparse basis for the null space of an incidence matrix can be constructed very efficiently using a spanning tree of the original network [15]. This results in an efficient and scalable algorithm that can generate surrogate networks having millions of edges, while preserving node weights as described. We also present an is denoted w(e) ∈ R. We assume that there are no self-loops in the observed graph, i.e., |e| = 2 for all e ∈ E.
We use the neighbourhood function n(v), where v ∈ V , to represent the set of edges connected to a vertex v, i.e., We define the weight of a vertex v ∈ V as the sum of the weights of the edges connected to it, i.e., In colloquial terms, our task is to obtain a uniform sample of edge weights w(e), such that the weights of every edge and every vertex remain within a given interval. This problem can be formally defined as follows. and , obtain a sample uniformly at random from the set of allowed edge weights W * , given by Existing network sampling algorithms cannot be directly applied to solve Problem 1 for two reasons. Firstly, we aim to preserve network structure, i.e., we can only modify weights on existing edges, not introduce new edges. Secondly, we allow vertex weights to vary within given intervals, while other approaches either aim to maintain them exactly, or only in expectation.

B. Interval constraints on vertices
In Problem 1 we allow the weight of each vertex v ∈ V to vary on the interval This sampling problem becomes easier if these interval constraints are replaced with equality constraints, i.e., a variant of the problem where the vertex weights are preserved exactly.
To do this, we define an equivalent graph containing self-loops (edges of type e v = {v}) where the vertex weights are fixed, i.e., The variability in the vertex weights is absorbed in the self-loops. This graph with self-loops can be constructed using the transformation shown in Fig. 3. Using this scheme, any graph with interval constraints on vertex weights and no self-loops can be transformed to a graph with equality constraints on vertex weights and self-loops, and vice versa.
We can now rewrite Eq. (3) as follows: The problem hence becomes to obtain a uniform sample from the set W * of Eq. (4). Note that in Eq. (4) the set of edges E now contains self-loops on those vertices that originally had interval constraints. In the following we therefore assume -without loss of generalitythat there are only equality constraints on vertex weights, i.e., . These two graphs are equivalent in the sense that the allowed ranges of the weights of the edges e 1 , e 2 , and e 3 as the vertex weight (without the self-loop) w(e 1 ) + w(e 2 ) + w(e 3 ) are the same for both graphs. It follows that a set of allowed weights W * given by Eq. (3) defined without self-loops and ranges for vertex weights, can be equivalently defined by graphs with self loops and fixed vertex weights.

C. High level approach
We continue by outlining a sketch of our solution to Problem 1. At a high level our algorithm is a Markov chain Monte Carlo method similar to the algorithm proposed in [2].
It starts from the observed set of edge weights, introduces a small perturbation to a few of these at every step, and runs until convergence. Let C denote a collection of subsets of E, i.e., every E ∈ C is some (small) set of edges. The algorithm in [2] as well as ours can be sketched within a common framework as follows: 1. Initially, let the current state be the observed set of edge weights.
2. Select some E ∈ C uniformly at random.
3. Perturb the weights of every edge in E so that all constraints remain satisfied. (Exactly how this is done is described in detail below.) 4. Repeat steps 2-3 until convergence.
The main difference between the method in [2] and the method presented here concerns what the collection C contains. It is crucial to make sure that C is constructed such that the resulting Markov chain indeed converges to a uniform distribution over W * . In other words, every point in W * must be reachable from every other point in W * by a sequence of steps defined by C.
In the algorithm of [2], C contains all possible cycles of length four. Since in [2] the underlying graph is assumed to be a clique, there are plenty of such cycles, and it can be shown that these are enough for the Markov chain to reach a uniform distribution. Also, it is fairly easy to see that in cycles of length four the edge weights can always be adjusted in a simple manner so that all constraints remain satisfied. However, in our case finding a suitable C is complicated by the requirement of not introducing new edges. Simply choosing all cycles of length four from G is not enough. In the remainder of the paper we discuss our main technical contribution: an approach for constructing C in general undirected graphs (not only cliques) so that the resulting Markov chain converges to the uniform distribution over W * . In addition, we show how a simple transformation of the input graph allows us to extend the approach also to directed graphs.

D. Solution for undirected graphs
We first describe the solution for general undirected graphs. We assume for simplicity of discussion and without loss of generality that the input graph G consists of a single connected component. (In general, we can independently sample each of the connected components of the graph in the sampling process introduced later). The problem of constructing a suitable collection C of edges becomes easier if we view our sampling problem in terms of systems of linear equations. This way we can express our problem using known concepts from linear algebra.
As a first step, define the incidence matrix A ∈ {0, 1} |V |×|E| of the graph G in the usual manner as A ve = 2I (v ∈ e) /|e|. Here v ∈ V and e ∈ E and I( ) is an indicator function which equals unity if is true and is zero otherwise. Also, let W ∈ R |V | denote the vector of observed vertex weights defined by W v = W (v) for all v ∈ V , and denote by w * ∈ R |E| the vector of edge weights. Given these, sampling uniformly from W * of Eq. (4) is equivalent to the problem of sampling uniformly from the set By our assumption the original observed weight vector w is in W * . It follows that Aw = W and therefore w ∈ S.
For the moment, let us focus only on the underdetermined linear system Aw * = W. A simple known property of such systems is that their solution space can be expressed as a sum of a single known solution such as w ∈ S and any vector x from the null space of A. The It is easy to see that Aw * = W, where w * = w + x, for any x ∈ Null(A). Because of the constraints on edge weights, we cannot simply use any x ∈ Null(A). Instead, x must come from a convex subset of Null(A). Therefore, the problem of sampling uniformly from S is equivalent to the problem of sampling uniformly from said convex subset of Null(A).
This is a known problem and could be solved using textbook methods, such as those described, e.g., in [16]. Those approaches, however, must usually compute a basis for Null(A), which in general is a dense matrix of size |E| × dim(Null(A)), where the cardinality of the null space of A is in the same order of magnitude as |E|. While this is not a problem as long as the incidence matrix A is fairly small, even storing such a matrix clearly becomes infeasible for very large networks. However, since A is the incidence matrix of a network, a sparse basis is easily constructed by combining cycles of G, as shown, e.g., in [15].
In short, this works as follows. We first find a spanning tree T of G. Every edge that does not belong to T clearly induces a cycle when combined with edges in T . Given T , every such even-length cycle is directly an element of the basis, while odd-length cycles are paired together to form elements of the basis until it is complete. Details are given in Section II F below. Note that such a sparse basis is easily represented by a collection C of subsets of edges, together with appropriate weights for every edge.

E. Directed graphs
Next we show that the above discussed solution to the sampling problem for undirected graphs (Problem 1) can also be applied to directed graphs, by first transforming the directed graph into an equivalent undirected graph. The algorithm proposed in this paper can therefore directly be used to sample weights for both directed and undirected graphs.
and the incoming edges as The outgoing weight of a vertex is given by and the incoming weight by We are now ready to define the sampling problem for directed graphs.
, obtain a sample uniformly at random from the set of allowed weights W * D , given by We can solve Problem 2 by the algorithm used to solve Problem 1 by noticing that a directed graph can be easily transformed to an equivalent undirected graph, as stated by the following theorem.
with the bounds given by and Now, if w * is a uniform sample from W * we can obtain a uniform sample w * D from W * D in a straightforward way by setting w * D (f (e)) ← w * (e) for all e ∈ E.
Proof. The proof follows directly from the definitions.

F. Details of the algorithm
In this Section we give a detailed proof of the proposed algorithm, an overview of which is given above.

The CycleSampler Algorithm
Assume y i where i ∈ [l] (where l is no smaller than the dimension of the null space Null(A)) spans the null space Null(A), i.e., any null space vector x ∈ Null(A) can be formed as a linear combination of vectors in y i . Given this basis, we can obtain samples as follows: 1. First, transform the graph to a form where the vertex weights are fixed and the variability in them is described by self-loops, as in Fig. 3.

2.
Initially, let the current state be the observed set of edge weights, w * ← w, with the weight of self-loops initially set to zero. 4. Update w * ← w * + αy i and repeat from step 3 above.
Note that because W * is a simple convex space-an |E|-dimensional rectangle-we can find Proof. This follows from the facts that (i) because y i span the null space all of the points of the null space can be reached with a non-vanishing probability and that (ii) the transition probability from state w * ∈ S to state w * ∈ S is equal to the transition probability from state w * to state w * , i.e., the Markov chain satisfies the detailed balance condition for a uniform distribution.
It remains to find a complete basis y i of the null space Null(A). This can be done using a spanning tree of the connected graph G. In the actual implementation we use a spanning tree found by a standard breadth-first search, but any instance of a spanning tree can be used. We choose one of the vertices (it does not matter which) as the root vertex v root ∈ V of the tree. We denote by E s ⊆ E the |V | − 1 edges that appear in the spanning tree and by F = E \ E s the remaining |E| − |V | + 1 edges that do not appear in the spanning tree.
To make the derivation easier to follow we provide an example of the introduced concepts later in Sec. II G which can be read in parallel with the derivation below.
We further denote by E(v) ⊆ E s the set of edges in the spanning tree between vertex v ∈ V and the root vertex. For the root vertex v root we have E(v root ) = ∅. We define the depth depth(e), where e ∈ E s , of the edge e in the spanning tree to be the number of edges between e and the root vertex, the edges adjacent to the root vertex having a depth of zero. We further define the depth of vertex v by its distance from the root vertex, i.e., We define a cycle c(v, v ) ∈ R |E| for each edge {v, v } ∈ F by where n(e) ∈ {0, 1} |E| is a 0-1 vector defined by n(e) e = I(e = e ). We use the shorthand- We further split the edges not in the spanning tree F into clean edges, and dirty edges, Notice that the set of clean edges F c cannot contain self-loops, but the set of dirty edges F d may contain self-loops (i.e., v = v ). The non-zero elements of a cycle introduced by clean edges contain graph loops with an even number of edges, while a cycle introduced by a dirty edge contains a graph loop with an odd number of edges. Indeed, by taking a spanning tree of a graph and adding an edge not in graph we always get a unique graph cycle that forms the cycle basis of the graph. In particular, it is well-known that the cycle basis of a graph contains only even-length cycles if and only if the graph is bipartite. Therefore F d = ∅ is equivalent to the statement that the graph G is bipartite.
We construct a basis of the null space as follows.
(i) For each clean edge e ∈ F c we define a basis vector by the respective cycle, For an example, see Fig. 6a. The non-zero elements of the basis vector y i form a graph cycle of even length, with alternating weights of ±1.
(ii) For each pair of dirty edges e 1 ∈ F d and e 2 ∈ F d where e 1 = e 2 the basis vector is given by a linear combination of two cycles, For an example, see Fig. 6b-d. Again, the non-zero elements of the basis vector y i form a graph cycle of even length.
The number of distinct basis vectors, defined by Eqs. (17) and (18), is therefore |F c | + We show that the basis vectors of Eqs. (17) and (18)    The following theorem follows directly from Lemmas 1, 2, and 3 above.

G. Illustrating Example
In this section we provide a small example graph illustrating the above discussed concepts.
Consider again the network introduced above in Fig. 1, with six vertices V = {1, 2, 3, 4, 5, 6} and seven edges. We now add two self-loops to nodes 1 and 6 in this network, giving the The root vertex is given by v root = 3. This graph is shown in Fig. 4 (the root of the spanning tree is marked with grey) and the corresponding matrix A is shown in Table II. The cycle vectors are shown in Fig. 5 and the basis of the null space in Fig. 6. A different root vertex could be used when constructing the spanning tree, which would lead to a different set of vectors that span the null space; however, any choice of spanning tree or root vertex will span the same null space.

III. EXPERIMENTAL EVALUATION
To demonstrate the scalability of the method described in this paper we perform two sampling experiments on seven publicly available sparse real-world networks. These networks are all examples of recommendation datasets (note that this does not limit the generality of the discussion). In recommendation data, a user provides a rating for a given item, i.e., TABLE II. The six uppermost rows show the incidence matrix matrix A for the graph in Fig. 4.
The next four rows show the graph cycles c i which are also shown graphically in Fig. 5; see Eq.
(14) for the definition. The four lowermost rows show the basis vectors y i , also shown graphically in Fig. 6; see Eqs. (17) and (18)  the data items are triplets of the form (user, item, rating). The properties of the networks are presented in Table III. The table shows   We performed the following preprocessing steps for the networks. In the first experiment we sample networks where the vertex weights are preserved exactly, while the edge weights w(e) are allowed to vary on the interval (0, 1), i.e., the range of the edge weights in the original network. In the context of the recommender systems this means that the total ratings given by a user and the total ratings received by an item are both preserved exactly (i.e., the ratings for a given user are just allocated differently). In the second experiment we sample networks where both edge and vertex weights are allowed to vary. The edge weights are again constrained to the interval (0, 1) while the vertex weights W (v) are constrained to the interval [0.9 · W (v), 1.1 · W (v)] for each vertex v ∈ V in the original network. In the context of recommender systems this means that the total ratings given by a user and the total ratings received by an item cannot vary more than ±10% from the value observed in the original dataset.
In both experiments we consider the scalability of the sampling method presented in this paper in terms of the convergence rate of the sampler. Studying the convergence of a Markov chain is a nontrivial problem and we here consider the convergence in terms of how the Frobenius norm between the edge weight vector of the observed network (w) and the jth sample from the sampler (w * j ) evolves. The sampler presented in this paper is implemented in C, as an extension to R [17] and is freely available for download from http://github.com/edahelsinki/ cyclesampler.

A. Results
In the experiments we set a target of 100 000 samples with a sampling time cut The initialisation and sampling times (both in seconds) of the sampler are presented in Table III, recorded on a standard laptop equipped with a quad-core 2.6 GHz Intel Core i7 processor and 20 Gb of RAM, running a 64-bit version of R (v. 3.5.1) on Linux. The initialisation time (t init ) is the time required to set up the sampler, which consists of determining the spanning tree and identifying the cycles. The sampling time (t sample ) for a particular network is the time required to take a number of steps equal to the dimensionality of its null space |C|, i.e., |E| − |V | + 1 in experiment 1 and |E| + 1 in experiment 2 (all of our networks are bipartite). The subscripts 1 and 2 are used to denote the initialisation times for experiment 1 and 2, respectively. The initialisation and sampling times increase as the dimensionality of the null space of the network increases. This is also reflected in the initialisation and sampling times for experiment 2, where the dimensionality of the null space is higher due to the addition of one self-loop per vertex required to preserve vertex weights on an interval (see Section II B).
The initialisation time is about a second for small networks (Last.fm, MovieLens 100k) and less than 10 minutes even for the TasteProfile network with tens of millions of edges.
Similarly, the time needed to produce a sample ranges from a fraction of a second for the small networks to about 1.5 minutes for the largest network.
The convergence results from experiments 1 and 2 are shown in Figure 7, showing the evolution of the Frobenius norm between the starting state and the sampled state. The The results from experiment 1, where the vertex weights are preserved exactly, are shown in Figure 7(a). We notice that in the order of 1 000|C| steps are needed for the sampler to converge for all datasets except for Last.fm, which has not converged after 100 000|C| steps.
The results from experiment 2, where the vertex weights are preserved on an interval, are shown in Figure 7(b). The sampler clearly converges more slowly for all datasets than in experiment 1; the number of steps required for convergence appears to be in the order of 10 000|C| steps, which is approximately a tenfold increase in number of steps compared to experiment 1. Here the sampler has not yet converged for Last.fm, MovieLens 20M and TasteProfile.

IV. CONCLUSIONS
Sampling networks has important applications in multiple domains where it is required to investigate and understand the significance of different phenomena described by the structure of the network. Many such networks are often sparse and large and consequently generating surrogate networks adhering to specific constraints is a difficult problem. In this paper we introduced CycleSampler; a novel Markov chain Monte Carlo method that allows sampling of both undirected and directed networks with interval constraints on both edge and node weights.
The presented method provides an efficient means for sampling large networks and we TABLE III. Properties of the networks. The networks are sorted in order of an increasing number of edges. The columns are as follows: rows and columns give the full size of the data matrix and density is the number of nonzero entries. The number of edges, vertices and the dimensionality of the null space (in experiment 1) are given by |E|, |V | and |C| = |E| − |V | + 1, respectively. In experiment 2 the dimensionality of the null space is |E| + 1 due to the addition of one self-loop per vertex. The initialisation time for the sampler (e.g., finding the spanning tree and enumerating cycles) and the time needed to take a number of steps equal to the dimensionality of the null space of the network are shown in the columns t init and t sample . The times are in seconds and the subscript 1 refers to experiment 1 whereas the subscript 2 refers to experiment 2. E http://labrosa.ee.columbia.edu/millionsong/tasteprofile and [22] provided an empirical evaluation demonstrating that the method scales to large sparse reallife networks. We believe that the CycleSampler-method has applications in many domains and we also release an open-source implementation of the method as an R-package.

ACKNOWLEDGMENTS
This work was funded by the Academy of Finland (decisions 326280 and 326339) and