A measure for characterizing heavy-tailed networks

Heavy-tailed networks are often characterized in the literature by their degree distribution's similarity to a power law. However, many heavy-tailed networks in real life do not have power-law degree distributions, and in many applications the scale-free nature of the network is irrelevant so long as the network possesses hubs. Here we present the Cooke-Nieboer index (CNI), a non-asymptotic measure of the heavy-tailedness of a network's degree distribution which does not presume a power-law form. The CNI is easy to calculate, and clearly distinguishes between networks with power-law, exponential, and symmetric degree distributions.

shape; the exponent α is known as the tail index of the distribution [12][13][14][15] and the network. This asymptotic measure usually depends on a very small fraction of nodes of the network, those residing in the distribution's tail, and philosophically it reinforces the semantic equivalence between heavy-tailed and scale-free networks.
As an alternative, we present a new measure, called the Cooke-Nieboer index, which quantifies the heavytailedness of a network. This measure does not presume that the distribution is scale-free, nor is it asymptotic: rather, it is applied to the entire degree distribution rather than just to its tail. After defining the measure, we will investigate its value for several theoretical distributions and synthetic networks in order to understand its properties. We will end by applying our measure to real-world networks, comparing it with the tail index α and the "strength" of a power-law fit as discussed in [4].

A. The Obesity Index
In the probability literature [3], a distribution f (x) is said to be heavy-tailed if ∞ −∞ e λx f (x) dx = ∞ for all λ > 0. ( Most heavy-tailed distributions of interest fall into a subcategory known as the subexponential distributions, defined as follows [16]: if X 1 , . . . , X n are independent and identically distributed (i.i.d.) random variables chosen from a subexponential distribution, then lim x→∞ P (X 1 + · · · + X n > x) P (max(X 1 , . . . , X n ) > x) = 1, for all n ≥ 1 (3) In other words, the sum of the random variables is likely to be large if and only if their maximum is likely to be large. This is the principle of a single big jump [3].
(For example, if the cost of cleaning up from natural disasters follows a subexponential distribution, then the total cost of cleanup in any given year is going to be roughly equal to the total cost of the largest disaster that year.) Power-law distributions and regulary varying distributions [7] are subsets of the set of subexponential distributions.
To characterize the "subexponentiality" of a distribution X, Cooke and Nieboer [12] suggest a measure known as the obesity index, defined as follows: select four i.i.d.random samples from the distribution and label them in ascending order, so that If the distribution is symmetric, then the quantities X 4 + X 1 and X 2 + X 3 are equally likely to be larger, and so its obesity index is one-half [12]. For a subexponential distribution, on the other hand, X 4 will probably be larger than the other three variables combined, in which case X 1 + X 4 must certainly be greater than X 2 + X 3 , and the probability is much greater than one-half. The obesity index is a probability, and so ranges from zero to one. Like skewness and kurtosis, it is independent of offset and positive scaling of the distribution: i.e.
Multiplying the distribution by a negative number reverses the inequality in Eq. 4, however, so that

B. The Cooke-Nieboer Index
For a given distribution X, we define the Cooke-Nieboer index (CNI) Θ(X) in a similar way.
Definition: Let X 1 , . . . , X 4 be four i.i.d.random samples chosen from a particular distribution X, such as the degree distribution of a network. We define where E{·} signifies the expectation value and sgn(x) is the signum function For later convenience, we define so that Θ(X) = E{sgn(Φ(X))}. The CNI differs from the obesity index in three ways: (i) The CNI ranges from −1 to 1, so that for symmetric distributions, Θ = 0; (ii) it accounts for the finite probability that X 1 + X 4 = X 2 + X 3 in discrete distributions; and (iii) it avoids the term "obesity", which may cause confusion in applications of network science to health issues. Otherwise, for a continuous distribution X, the two measures are simply related: The exact CNI can be calculated for a finite distribution with N data points x i , by considering every combination of four points (including duplicates): This naive algorithm runs in O(N 4 ) time. If the data points are non-negative integers x i = 0, 1, . . . , M and n a is the number of data points equal to a, then we can use the form n a n b n c n d sgn(Φ(a, b, c, d)) (12) instead, which runs in O(M 4 ) time.
For larger distributions it is sufficient to use a Monte Carlo simulation such as the one in Fig. 1, calculating Φ multiple times until some desired standard error σx. Figure 2 shows that the CNI calculated this way is normally distributed for multiple types of distributions, with a standard deviation equal to σx. The number of steps required to reach a desired standard error is proportional to σ −2 x , with a coefficient depending on the type of distribution (Fig. 3).   Fig. 2, the number of steps N required to reach a particular standard error SE, where a step is a single calculation of Φ (Eq. 9). All three curves closely obey the relationship SE ∝ 1/ √ N after one thousand steps.

III. DISTRIBUTIONS
We saw in Section II B that Θ = 0 for symmetric distributions. It is shown in [12] that the obesity index that an exponential distribution P (x) = λe −λx has an obesity index is 3/4 regardless of scale, and thus according to Eq. 10, Θ = 1/2. Using these two values as boundaries, we divide distributions into three regimes: 1. High-CNI distributions, with Θ > 0.5. These are the subexponential distributions, which have heavier tails than the exponential distribution. They include the power-law distributions, whose CNIs (as shown in Fig. 4) range from Θ = 1 for α = 0 to Θ = 0.5 as α → ∞.
3. Negative-CNI distributions, with Θ < 0. These are distributions which have a preponderance of large values and fewer small values: a distribution that grows rather than decays. FIG. 4. The CNI of a power-law distribution 1/x α+1 as a function of its tail index α, calculated via numerical simulation. The grey area highlights the region where most "scale-free" networks are found, between α = 1 and α = 2 [4,17]. Ref. [12] calculates the CNI at these values as 2π 2 − 19 = 0.739 and 1185−120π 2 = 0.647, respectively. There is no closed form for this curve but it is close to the expression 1 2 + 1 2 (1 + α) −10/9 , which is shown as a dashed line.

A. Bimodal Distribution
To understand how this calculation works, it is useful to consider the simple bimodal distribution X = a with probability p b > a with probability 1 − p .
If we choose four samples from this distribution, and 0 ≤ s ≤ 4 of them are a, it is simple to show that Φ (Eq. 9) is equal to zero if s is even, Φ < 0 if s = 1, and Φ > 0 if s = 3. Thus we can calculate the CNI of this distribution precisely: Note that the result does not depend on the values a and b. Fig. 5 shows a graph of this polynomial. Where the distribution is symmetric, at p = 0, 0.5, and 1, the CNI is zero. When the smaller values are predominant, as in typical degree distributions, the CNI is positive, with a maximum value of Θ = 2  0.79. When there are more large values than small values, however, the CNI is negative. Note that a bimodal distribution can never reach the "high-CNI" regime. By contrast, trimodal distributions

B. Changing the Number of Samples
A natural question to ask regarding our definition is whether there is something about choosing four samples. In fact, we can generalize Eq. 9 to use any number of samples X i : The first term 1 2 (max X i − min X i ) is the halfway point between the largest and smallest values, and could be thought of as the "geometric center" of the samples, while the second term is of course the mean. When one of the samples is much larger than the others, the mean falls to the negative side of the geometric center, and Φ(X) is positive. This makes the CNI a type of skewness measure for the distribution. Figure 6 shows the modified CNI Θ S (using S samples) for several basic random distributions. The value for a flat distribution remains zero throughout, but for others, Θ S increases monotonically as the number of samples increases, compressing the "high-CNI" regime and expanding the "low-CNI" regime. The value S = 4 evenly divides the high and low regimes, and so is a reasonable choice for this paper. Notice that changing the value of S does not change the ordering of these distributions, but this is not true in general. The generalization of Eq. 14 for S samples is and we can show that Θ 4 (0.77) = 0.383 is less than Θ 4 (0.79) = 0.385, but Θ 7 (0.77) = 0.732 is greater than Θ 7 (0.79) = 0.729.

IV. NETWORKS
For an undirected, unweighted network G, we define Θ(G) to be the CNI of its degree distribution; that is, where n i ∈ G are nodes and k ni is the degree of node n i in G. (For weighted networks, one can let k ni be the total weight of the edges connected to n i ; there is no need for this to be an integer.) Networks with symmetric degree distributions, such as complete graphs K n and cycle graphs C n , have Θ = 0. Because Θ has the same scaling independence as the obesity index (Eq. 5), Θ(G ∪ G) = Θ(G), although the measure is not otherwise additive. From (Eq. 6) it can be shown that whereḠ is the converse of G. The CNI for Barabási-Albert networks of N =100,000 nodes, as a function of the minimum degree m. The black dot marks the mean value over 100 sampled networks, the error bars show the standard deviation, and the grey dots mark all values. Note the unusual value at m = 1. The dashed line shows the CNI of a power-law distribution k −3 , which is the value we expect all of these values to converge to [2,18] as N → ∞.
Erdős-Rényi random networks G(n, p) primarily fall in the "low-CNI regime" (Fig. 7), with the value of Θ depending strongly on the average degree k = np of the network. The CNI is never negative, but can be zero up until a certain threshold ( k ≈ 0.07 in the figure), although the average CNI rises steadily with average degree. The average CNI reaches a maximum value before decreasing until it reaches zero again when k = n − 1. The significance of the shape of this curve, particularly the threshold where the CNI stops being zero, warrants further study. Figure 8 shows that Barabási-Albert networks are high-CNI networks, as is expected, and close to the value measured in Fig. 4 for a power-law degree distribution with α = 2. Notice, however, that the CNI depends on the parameter m, which specifies the minimum degree of the network, or alternatively, the number of nodes each new node attaches to when added to the network. This contradicts [2] which says that the infinite-network degree distribution should be P (k) ∝ 1 k 3 independent of the minimum degree m. This may be a finite size effect, as Barabási-Albert networks are known to converge slowly to their infinite state [19]. Recall that the degree distribution is a discrete distribution, unlike the continuous distribution discussed in Fig. 4. The discrepancy may also be due to the non-asymptotic nature of the CNI measure. According to [18], the degree distribution P (k) of such a network should approach lim N →∞ For measures that only apply to the tail of the distribution, this can be approximated as P (k) ∝ k −3 ; but when the entire distribution is taken to the account, as it is with the CNI, the dependence on m may be more pronounced. The precise reason for this discrepancy is worthy of further study, as is the jump in value between m = 1 and m = 2. Another interesting synthetic network is a partial periodic lattice (PPL), in which each node in a lattice with periodic boundary conditions is connected to each of its m nearest neighbors with probability p. and is a (4m − 1)-degree polynomial. Figure 9 shows this polynomial Θ lattice (p) for a few values of m. Such a network is in the low-CNI regime when p < 0.5 and there are few well-connected nodes; when p > 0.5, there are a larger number of high-degree nodes, and Θ < 0. It is a coincidence that the transition between these regimes is equal to the bond percolation threshold of the square lattice [20].

V. REAL-LIFE NETWORKS
We now apply our measure to a set of real-life networks. We choose to work with the same sample of 927 networks, drawn from the ICON database [21], which are studied in [4]. Following that paper's lead, each nonsimple network (i.e. those that are directed, weighted, multipartite, or multiplanar) is used to generate a collection of unweighted, undirected simple graphs, according to criteria described in [4]. We defineΘ of a network to be the median CNI of the network's collection of graphs. [4], we exclude from the "super-weak" category those networks that satisfy the "weakest" condition. Figure 10 shows the distribution of the networks' median CNI,Θ. The average median CNI for all networks is Θ = 0.32 ± 0.27, but the distribution is bimodal, with one peak around Θ = 0.5 and one just below Θ = 0. The negative-CNI peak is made up mostly of planar graphs, specifically United States road networks [22] and fungal growth networks [23]; their negative CNI is reminiscent of the partial periodic lattices considered in Section IV. Excluding these two outlying groups, the average CNI is Θ = 0.49 ± 0.15, on the boundary between the highand low-CNI regimes. Fig. 10 also breaks the distribution down into the strength classifications used in [4], according to how strong a fit a power-law is to each collection of simple graphs. Most of the strongest fits to the power-law model have high CNI, though some dip below 0.5, most significantly the protein-protein interaction network in Mus musculus [24] with Θ = 0.39. However, 30% of networks in the "weak" category and below are also high-CNI. Overall, 31% of our chosen networks lie in the high-CNI regime; another 24% are close, in the 0.4 ≤ Θ < 0.5 range (suggesting a new "mid-CNI" regime). Scale-free networks might be rare, but high-CNI networks are not. Another way to classify the heaviness of a network's tail is with its tail index α, found by fitting the tail of the degree distribution to a power-law x −α−1 [4,8,14,15]. Figure 11 shows the CNI of each of our simple graphs versus its tail index: the two values have a moderate negative correlation as one might expect, with a Pearson correlation coefficient of r = −0.38. The border between high and low-CNI occurs at α = 2.3, close to the upper range α = 2 often cited [4,17] for those networks which are "scale-free".
However, there are times when the two quantities differ in surprising ways. Consider the set of affiliation networks between board directors on Norwegian public limited companies [25], determined monthly from 2006 through 2009. These networks have a tail index which varies between 1 and 5.5 (see Fig. 12b), but their CNI is a fairly constant Θ = 0.656 ± 0.007 throughout. Do the networks vary significantly or not? If we look at the degree distributions (Fig. 12a) from two particular months (May 2006 and August 2006) with very different tail indices (α = 5.0 and α = 1.2, respectively), we see that the two histograms are quite similar, suggesting that the CNI is a more accurate representation of their heavy-tailed nature.

VI. CONCLUSION
We have introduced the Cooke-Nieboer index as a new and potentially useful method for characterizing heavytailed networks. The CNI divides networks into three regimes: high-CNI which includes scale-free networks and other networks with heavy tails, low-CNI which includes random and regular networks, and negative-CNI which includes planar networks which are mostly connected. While presented here in the context of simple graphs, it is easily generalized to apply to weighted and directed networks, We have shown (Fig. 11) that our measure is loosely correlated with the tail index of networks, but with certain differences. Philosophically, the CNI avoids the question of whether heavy-tailed networks can be classified as "scale-free" or not. The CNI is also non-asymptotic, but whether this is an improvement on the tail index may depend on the application or one's point of view: the tail index is more sensitive to small changes in the tail, as seen in Fig. 12, but two distributions with the same tail may have considerably different CNIs, depending on the rest of the distribution. Perhaps the two measures may serve complementary roles, each characterizing certain network behaviors well. We hope that this measure will find applications in studies of epidemics, network fragility, and other fields where the distinction between a power-law network and a heavy-tailed network may be important. There are a number of interesting results in this paper which warrant further study. The upper limit on the CNI of a bimodal distribution (Fig. 5) means that a star network, for instance, could never be high-CNI, and there are similar pathological instances of networks which are clearly hub-dominated but which have Θ < 0.5. This could be written off as a mathematical curiosity, but there may be a modification that can address this problem.
The structure of the graph in Fig. 7, which shows the distribution of CNI values for Erdős-Rényi networks, has several curious points about it. Why is there a threshold average degree k beyond which one no longer finds networks with Θ = 0? Does the value of k that maximizes the average CNI correspond to any other thresholds known to occur in random networks?
We also saw in Fig. 8 that the CNI for a Barabási-Albert network depends on the parameter m. The degree distributions of these networks are known to approach a constant power law in the infinite limit independent of the minimum degree, so why is there a steady distinction in the CNI, and why is the CNI so much lower in the m = 1 case?
In conclusion, we hope that this measure is useful to the network science community at large.
We thank Anne Broido and Aaron Clauset for making their data available in a convenient format at https://github.com/adbroido/SFAnalysis; we relied heavily on their data in Section V. We also thank Phil Chodrow and Nicole Eikmeier for useful conversations.

APPENDIX: AN EFFICIENT CNI ALGORITHM
One can improve the speed of Fig. 1 by implementing a running standard error, such as with Welford's online algorithm [26]. However, one can do even better by exploiting the fact that the thing we're taking the average of, sgn Φ, only takes one of three values. Suppose we take N sets of four samples from our distribution and calculate x i = sgn Φ i for each one. If we define D ≡ i x i , then the CNI is Θ = D/N . The variance of this measurement is σ 2 = 1 N i x 2 i − x i 2 . Because x 2 i is either zero or one, i x 2 i = N − Z where Z is the number of times that Φ i = 0. Thus the variance can be written and thus the squared standard error is This confirms the result seen in Fig. 3 that 1 √ N is an upper-bound and a good approximation for σx, so long as Z and D are both much smaller than N .
The code in Fig. 13 uses this insight to calculate the standard error, and calculates the CNI almost 3 times faster than code using the Welford algorithm, and 75 times faster than the code in Fig. 1.