Spectral entropies as information-theoretic tools for complex network comparison

Any physical system can be viewed from the perspective that information is implicitly represented in its state. However, the quantification of this information when it comes to complex networks has remained largely elusive. In this work, we use techniques inspired by quantum statistical mechanics to define an entropy measure for complex networks and to develop a set of information-theoretic tools, based on network spectral properties, such as Renyi q-entropy, generalized Kullback-Leibler and Jensen-Shannon divergences, the latter allowing us to define a natural distance measure between complex networks. First we show that by minimizing the Kullback-Leibler divergence between an observed network and a parametric network model, inference of model parameter(s) by means of maximum-likelihood estimation can be achieved and model selection can be performed appropriate information criteria. Second, we show that the information-theoretic metric quantifies the distance between pairs of networks and we can use it, for instance, to cluster the layers of a multilayer system. By applying this framework to networks corresponding to sites of the human microbiome, we perform hierarchical cluster analysis and recover with high accuracy existing community-based associations. Our results imply that spectral based statistical inference in complex networks results in demonstrably superior performance as well as a conceptual backbone, filling a gap towards a network information theory.


I. INTRODUCTION
Shannon's entropy [1] and information-theoretic derived measures have been successfully applied in a range of disciplines, from revealing time scale dependence in neural coding [2,3] to quantifying quantum information [4][5][6] and the complexity of genetic sequences [7,8] or to unravel the mesoscale organization of interconnected systems [9][10][11][12], to cite just a few emblematic achievements. However, when it comes to complex networks, an appropriate definition of entropy has remained elusiveapplicability being often limited to the probability distribution of some network descriptor (such as the normalized distribution of node degrees).
Complex network theory dates to the discovery of several fundamental features [13,14] and complex networks have been widely used to model and better understand the organization of complex systems and their dynamics [15][16][17][18][19][20], the controllability of their constituents [21] and their resilience to structural and dynamical perturbations [22][23][24][25][26][27]. A recent and ongoing drive in complex networks is to merge ideas with quantum information science, including entangled networks for communication [28][29][30], developing a theory of node centrality in quantum walks on graphs [31] and in the detection of community structures formed in quantum systems [32].
In the same spirit as the Church-Turing-Deutsch principle [33], like all physical systems it is possible to view complex networks in terms of information processing, in which information changes in time from an input to an output state of a system. This then necessitates a method to quantify this information, its time-dependence in terms of storage and transfer of information, ideally between levels of a multilayer system, which can further be imperfect due to randomness or noise. A similar challenge was addressed by quantum theorists beginning decades ago when faced with quantifying informatic properties of quantum states [4][5][6][34][35][36][37]. Indeed, the modern theory of quantum information, as the name suggests, is fundamentally built upon the quantification of physical information, entropic based quantifications of non-locality, entanglement [34] and the inherent complexity of the quantum model [35]. These decades of research have placed entropic measures central to the modern theory, leading to quantum generalizations of Shannon's classical information theory [4][5][6][34][35][36][37]. However, for a variety of reasons, these results can't be applied directly to complex networks, although classical information-theoretic tools have been successfully adopted to solve specific problems [9,[38][39][40][41][42][43][44][45][46].
Here we address a similarly motivated challenge: inspired by how entropy is calculated in quantum systems, we define an interconnectivity-based density matrix to calculate the von Neumann entropy of a network. We analytically prove that our definition satisfies the desired additivity properties similar to quantum thermodynamic entropy-though the proof of this differs tremendously from the quantum case-which opens the door to avoid the shortcomings, recently pointed out in [47], imposed by sub-additivity failing as found in past approaches.
We exploit this entropy to develop a set of information-theoretic tools which apply to complex networks, such as Rényi q-entropy, generalized Kullback-Leibler and Jensen-Shannon divergences, thus providing a backbone to an information-theoretic approach to network science. Our framework, based on spectral properties of networks, allows one to probe contemporary problems faced in complex network science. For instance, we show that Kullback-Leibler minimization can be used to infer the parameter(s) of a model aimed to fit an observed network. By exploiting results of classical information theory, we are able to introduce the spectral counterpart of likelihood maximization and use it in practical applications to fit network models and perform model selection based on appropriate Akaike and Bayesian information criteria, as well as minimum description length. The strength of our approach relies on the fact that it uses the network as a whole, instead of a subset of network's descriptors, to attack parameter inference and model selection problems. Another important byproduct of the proposed information theoretic framework is the possibility to quantify the distance between complex networks. This problem is of high interest in recent applications concerning multilayer systems, such as time-varying and multiplex networks [48][49][50]. This type of systems consist of nodes replicated across several networks that exhibit different types of relationships or connectivity [51]. The possibility to compare layers from an information-theoretic point of view has been recently explored to aggregate them in order to reduce the structure and the complexity of biological, transportation and social multiplex networks [47]. To numerically probe our method, we cluster the layer of an empirical system and compare it against the existing classification.

II. VON NEUMANN ENTROPY OF A COMPLEX NETWORK
In quantum mechanics, probability distributions are encoded by density matrices. A density matrix ρ is a Hermitian and positive semi-definite matrix, with trace equal to unity which is used to represent both mixed and pure quantum states. A system is in a pure state |ψ , if and only if the bound Tr ρ 2 ≤ 1 is saturated. The density matrix admits a spectral decomposition as for an orthonormal basis {|φ i }, where λ i are nonnegative eigenvalues which sum up to 1. The density matrix allows to define the von Neumann entropy by i.e. it is equal to the Shannon entropy of the eigenvalues of the density matrix, where by convention 0 log 2 0 := 0. In this work we are not limited in having a quantum setup where the matrix ρ can be called a "density matrix" in the physical sense, but instead build on this idea to define a matrix from a network that satisfies the same mathematical properties of a density matrix.
First, such a density matrix should be positive definite and symmetric, therefore ρ = Z −1 e −βH -with H a symmetric matrix with non-negative eigenvalues, Z and β real numbers -is a suitable candidate. Second, the eigenvalues of ρ must sum to unity, thus imposing the constraint Z = Tr e −βH . Here, we use the following density matrix, defined by where L = D−A denotes the combinatorial graph Laplacian, being A the adjacency matrix of the network and D the diagonal matrix of node degrees. Eq. (3) resembles the Gibbs state of a quantum system where the Hamiltonian equals the Laplacian matrix, i.e. H = L, where β is a scale parameter. It is worth remarking that this functional form appears in many problems of physical interest. For instance, if β parameterizes time, setting Z = 1 and H = L, i.e. the Laplacian matrix, then e −Lt is the evolution matrix governing purely diffusive dynamics and its trace N i=1 e −λit is N times the average return probability, i.e. the probability that a walker starting at any node i will return to its origin at time t. Another case of interest is when β parameterizes temperature of a quantum system with Hamiltonian H and Z = Tr e −βH , so that ρ would provide the Gibbs state of the system in equilibrium at finite temperature. Although we are interested mainly in nonnegative values of β, it is interesting to show that when β = −1, Z = 1 and H = A we recover the well-known communicability matrix introduced by Estrada [52][53][54][55][56][57] (see Ref. [58] for a thorough review) or a normalized version of it, if Z = Tr e A .
In Ref. [59], a matrix is built in such a way that its mathematical properties satisfy the requirements of a density matrix. For an undirected complex network, L is symmetric and positive semi-definite but its eigenvalues do not sum to unity. However, the eigenvalues of the proposed matrix, defined by ρ bgs = L/ Tr L, evidently do, therefore ρ bgs is a mathematically suitable density matrix [45,59] and hence, Eq. (2) can in principle be applied. This approach has been recently generalized to the case of multilayer systems [51], composite networks where units exhibits different types of relationships that are generally modeled as different layers and used to reduce their structure [47].
However, it has been recently found that the von Neumann entropy calculated from the rescaled Laplacian ρ bgs does not satisfy the sub-additivity property in some critical circumstances [47]. For instance, let ρ bgs be the rescaled Laplacian matrix of a network with N nodes and |E| 1 edges, let σ bgs be the rescaled Laplacian matrix of a network with N nodes and just one (undirected) edge and let τ bgs be the rescaled Laplacian of the network obtained by summing up, entry wise, the corresponding adjacency matrices of the previous two networks. Since S(σ bgs ) = 0, the sub-additivity property S(τ bgs ) ≤ S(ρ bgs ) + S(σ bgs ) is not always satisfied because, as we will see later, S(τ bgs ) ≥ S(ρ bgs ) very often.
While this peculiar behavior can in fact be exploited in certain situations [47], one is interested in preserving sub-additivity, because in general it is expected that the entropy of a composite system A + B is equal or smaller than the sum of the entropy of A and B. In the case of complex networks, it seems intuitively reasonable but has remained to be proven-see Appendix A-that the aggregation of two graphs is expected to have lower entropy than the sum of the entropy of each graph separately. The entropy of the density matrix introduced to complex networks in this work -that we name spectral entropy-presents some interesting features that will be discussed later. A visualization of the density matrix corresponding to different network models is shown in Fig. 1.

III. VON NEUMANN ENTROPY OF STANDARD NETWORK MODELS
For simplicity, in the following we will use ρ = ρ G where there will be no ambiguity about which density matrix is under consideration. Indeed, we will also use the notation S(G) = S(ρ G ) to indicate the entropy of network G.
From the eigen-decomposition of the Laplacian matrix L = QΛQ −1 , where Λ is the diagonal matrix of Laplacian's eigenvalues, it is straightforward to show that where λ i (L) indicates the i-th eigenvalue of L. The spectral entropy of ρ can be rewritten as It follows that provides the relationship between the eigenvalues of the density and the Laplacian matrices. Using Eq. (6), the spectral entropy of the network G reduces to A possible interpretation of this entropy will be given later. Here, it is worth discussing some very special cases, where the spectral entropy following from our definition provides very different results from BGS (S bgs = S(ρ bgs )).
Networks of isolated nodes. First, let us consider the case of a network with no links among nodes. The eigenvalues of the combinatorial Laplacian are all 0, Z = N and S = log 2 N , i.e. the entropy is maximum regardless of β-whereas S bgs is not defined.
Single-link networks. Let us consider a network with only one link between node i to node j, i.e. A = E(ij) + E(ji) being E(ij) the canonical matrix with all zero entries except in i−th-j−th component. It is straightforward to show that the eigenvalues of the combinatorial Laplacian are all 0 except one whose value is 2. While S bgs = 0, it follows that where Z = N − 1 + e −2β and the asymptotic behavior is log 2 N for any β. Our spectral entropy scales with the size of the network when only one directed link is present. Complete networks. In the case of the clique network, there are N − 1 eigenvalues equal to N , from which where Z = 1 + (N − 1)e −βN and the asymptotic behavior is log 2 N for small β and 0 for large β.
Complex network models. We show in Fig. 2 the von Neumann entropy as a function of 1/β for Erdös-Rényi [60], Watts-Strogatz [61] and K-regular networks, for different values of their parameters, the link probability p link , the rewiring probability p rew and the number K of node neighbors, respectively. Intriguingly, in all three network types, when β is high the entropy tends to zero whereas it approaches the theoretical limit log 2 N when β is small enough.
Sub-additivity of spectral entropy. Let us consider an undirected network G of N nodes changing over time, with adjacency matrix A(0) at time t = 0. At each time step t a pair of two nodes, chosen uniformly randomly and not yet connected, is linked by one undirected link. This is equivalent to having another network G (t) consisting of just one link and N − 2 isolated nodes, whose If G(t − 1) and G (t) have no edges in common, the above operation is equivalent to the union of the two graphs, another typical approach to aggregate networks. One immediately asks if the spectral entropy defined in this work and the BGS entropy satisfy the sub-additivity property such that We show in Fig. 3 the distribution of ∆S = S(G(t)) − [S(G(t − 1)) + S(G (t))] obtained by calculating both quantum and BGS entropy for an ensemble of Erdös-Rényi networks. Similar results are obtained by using ensembles of Watts-Strogatz and K-regular networks.
As proven in the Appendix, the spectral entropy does not violate the sub-additivity property regardless of the value of β, whereas the S bgs often violates these sensible additivity requirement. See Appendix A for further details.

IV. RÉNYI ENTROPY OF COMPLEX NETWORKS
The von Neumann entropy is just a special case of the more general Rényi entropy of a quantum system, defined by [36] It is widely used in quantum information theory and quantum computing to quantify entanglement [62] and correlations in physical systems [63], it can be expressed as families of tensor network contractions [64] and has been found to share a close relationship to free energy [65]. By using Eq. (6), the Rényi spectral entropy of a complex network is therefore given by in terms of the eigenvalues of the Laplacian matrix. In fact, this entropy generalizes other entropic measures.
As q approaches 0, S q increasingly weights all eigenvalues more equally, approaching the Hartley entropy. In the limit of q → 1 Rényi entropy approaches the spectral entropy. As q approaches ∞, S q becomes dominated by the high-probability events and it converges to the min-Entropy. The case with q = 2, recovers the collision entropy-in the case of quantum systems, this is connected to the purity of the system. In fact, S 2 identically vanishes for ρ 2 = ρ, i.e. for pure states. We show in Fig. 4 entropy as a function of q at different values of β for Erdös-Rényi, Watts-Strogatz and K-regular networks, respectively, for different values of their parameters. It is common to assess that the Watts-Strogatz network reduces to a K-regular graph for p rew = 0 and approaches an Erdös-Rényi network for p res = 1. However, the behavior of Rényi entropy as a function of q shows that there are some significant differences at least in the latter case, where the entropy is considerably larger for the Watts-Strogatz model than the Erdös-Rényi one.

V. GENERALIZED QUANTUM DIVERGENCES BETWEEN TWO COMPLEX NETWORKS
One of the main goal of information theory is to quantify the amount of information about a probability distribution -generally obtained from empirical measurements -provided that one has full information about another probability distribution -e.g. the model. This goal is achieved by introducing relative entropies or, equivalently, information divergences.
Similarly, the introduction of divergences (a.k.a. quantum relative entropy) in quantum information theory is foundational to the quest to understand differences between quantum states, quantum and classical information, the quantification of the thermodynamic cost of communication as well as optimal protocols to transfer information-see e.g. the reviews [6] and [66].
The quantum Rényi entropy can be used to define the quantum Rényi divergence, also known as q−relative Rényi entropy, by which reduces to the quantum Kullback-Leibler divergence (or, equivalently, the quantum relative entropy) for q → 1.
In general such divergences are not symmetric and bounded, making difficult certain comparison uses. An alternative measure is the quantum q−Jensen-Shannon where µ = ρ+σ 2 is usually called mixture matrix. It can be proven that J 1 2 q defines a true metric for 0 ≤ q < 2 [67][68][69] and can be used as a measure of distinguishability [47] or similarity [70].
Our formalism allows to use the family of q−quantum divergences to compare two networks and this opens up a new approach to attack two fundamental problems in network science.

A. Maximum Likelihood Estimation and Model Selection
For a given model and its corresponding set of parameters, the likelihood function measures the probability of observing the data according to the model parameters. Therefore, it is sufficient to perform maximum likelihood estimation to obtain the parameters of the model that better reproduce the data, according to the model.
While it is straightforward to define the likelihood function for probability distributions, it is challenging to define a similar concept in the context of density matrices. In the case of complex networks the challenge is complicated by the lack of an appropriate probability distribution. In fact, one usually designs a model, or a set of models, to reproduce only some salient features of an observed complex network, often limiting the comparison between model and data to a few descriptors such as degree distribution, degree correlations, clustering or path statistics.
Quantum divergences provide a grounded and unifying approach to this problem. Let us start from the classical problem where an empirical probability distribution P (x) is obtained from observing the N outcomes {x i } (i = 1, 2, ..., N ) of a stochastic variable X . Let Q(x; Θ) be a model to approximate P (x), which depends on one (or more) parameter(s), here indicated by Θ. In this context, the Kullback-Leibler divergence measures the information gain when the model Q(x; Θ) is used to explain the observation P (x), and it can be written as If the model Q is plausible, there exist a value Θ such that the divergence is minimum and we are interested in finding such a value by minimizing the divergence with respect to Θ. We notice that the first term in the righthand side of Eq. (16) does not depend on Θ and therefore plays no role in the minimization procedure. By noticing that where δ is the Dirac function, the second term in the right-hand side of Eq. (16) reduces to which is proportional to the negative log-likelihood function. Here, the pre-factor can be safely neglected during the minimization procedure. Therefore, by minimizing the Kullback-Leibler divergence one effectively maximizes the log-likelihood function: We use the proposed framework to achieve a similar result in the case of density matrices. Let ρ be the density matrix of an empirical network and let σ(Θ) a model for such a network, depending on one (or more) parameter(s), here indicated by Θ. By starting from Eq. (14), and by arguments similar to the classical case, it is straightforward to show that By comparing the right-hand side of Eq. (18) and Eq. (19), we define the network log-likelihood function by where the likelihood function can be calculated by exploiting the properties of the matrix exponential as
This result allows us to obtain a maximum-likelihood estimation of parameter(s) Θ = {θ 1 , θ 2 , ..., θ d } by minimizing the Kullback-Leibler divergence between networks. The covariance matrix corresponding to this estimation is the Fisher information matrix (see Appendix B), whose classical counterpart is equivalent to the Hessian of the Kullback-Leibler divergence and it is used to assess the quality of the spectral likelihood estimate. If Θ is an unbiased estimator of Θ, the associated covariance matrix satisfies the Cramer-Rao bound cov(Θ ) ≥ I −1 (Θ ).
We can exploit this finding to go beyond model fitting, defining an operative procedure for model selection.
In fact, generally more than one model is used to understand the data and a fundamental problem in data analysis, known as model selection, is to quantify which model out of a set of candidates is the best one in reproducing the data. One solution to this problem has been given by Akaike, who proposed an information criterion (AIC) for this purpose [71], by showing that the expected value of the relative cross-entropy term in the Kullback-Leibler divergence equals the log-likelihood of the model given the data plus a penalizing constant term which accounts for the number of free parameters. The AIC is given by where k is the number of parameters of the model and we plug Eq. (20) into Eq. (23) for applications to complex networks. In practice, given a set of models M = {M 1 , M 2 , ..., M n }, with number of parameters k 1 , k 2 , ..., k n and likelihood L 1 , L 2 , ..., L n , respectively, the most suitable candidate to explain the data is the one being a trade-off between having as small as possible divergence from the data and as small as possible number of parameters, i.e. the one such that AIC is minimum. Similarly, other model selection criteria can be extended from information theory to the complex network framework. This is the case of Bayesian information criterion (BIC) defined by and Fisher information approximation (FIA) defined by where the last term penalizes a model because of its geometric complexity [72,73], a typical concept in information geometry [74]. In classical statistical inference, it has been shown [72] that the FIA quantifies the length of the shorted description of the data given the model and, as a consequence, its minimization corresponds to finding the minimum description length (MDL) of the network. The MDL principle [75], also known as a formalization of Occam's razor, represents data and models as codes to be compressed, where the model better compresses data to provide its best description. This principle is one of the most important concepts in information theory and its interpretation, in our context, opens the door for future studies.
To probe the power of the proposed method we performed several tests on synthetic networks. We generate a network from a specific model with a given value of the parameter(s), assuming it is the observed network, and a set of networks from the same model by varying the value of the parameter(s). Therefore, we perform a maximum-likelihood parameter estimation based on Kullback-Leibler minimization to find the value of the parameter(s) that better fit the observation.
In Fig. 5A we consider the case of an Erdös-Rényi network model (N = 200), being link probability p link the parameter to fit and p link = 0.05 its expected value. By scanning over the range allowed for p link , we sample an ensemble of 100 realizations for each value of the link probability and compare each network against the observed one to calculate the average Kullback-Leibler divergence. In principle, the procedure depends on the value of β, therefore we perform the analysis for different values of this parameter in order to understand which value (or range of values) provides the best fit. We consider specific values of β for this purpose, more specifically we calculate the β such that the entropy normalized to its maximum value, i.e. log 2 N , gets a specific real value c(β ) between 0 and 1. We choose values for c(β ) ranging between 0.01 and 0.9, and the results show that the global minima corresponds or are very close to the expected value, for c(β ) ≤ 0.5. The best performance is obtained for c(β ) < 0.1. This analysis suggests a rule of thumb to choose a specific value of β for fitting purposes: the region close to the critical point -where entropy changes from 0 to a positive value -provides the most performant range for β.
In Fig. 5B we show the Kullback-Leibler divergence of Watts-Strogatz's network model (N = 200) for different values of the parameters K and p rew against a Watts-Strogatz network with K = 6 and p rew = 0.2, assumed to be the empirical data. The result shows that the most likely region of the parameter space, i.e. the one where the model is more informative about the data, is successfully identified by the Kullback-Leibler minimization procedure previously described. The result is very interesting because only a single realization of the model, instead of an ensemble as in the previous case, has been used for each pair of parameters, suggesting that the procedure is robust against sample size while reducing the computational cost of the calculation. Fig. 5C shows the Kullback-Leibler divergence of a stochastic block-model [76] (N = 200) for different values of the intra-and inter-community probability parameters, p intra and p inter , and number of equally sized blocks against a network obtained from the same model with 8 blocks, p intra = 0.5 and p inter = 0.05, assumed to be the empirical data. As for the Erdös-Rényi case, we sample an ensemble of 100 realizations for each triad of parameters, to calculate the average Kullback-Leibler divergence between model and a observation. We show the results for different block sizes and the overall minimum is found for 8 blocks, p intra = 0.6 and p inter = 0.05, in excellent agreement with expectation.

B. Clustering layers of multilayer systems
The second application concerns the more recent problem of identifying layers of a multiplex network or snapshots of a time-varying network [51] that provide redundant information about the system and that might be aggregated to reduce the complexity of the system [47]. In this case, by using appropriate quantum divergences it is possible to calculate a distance between layers or snapshots and devise exact or heuristics procedures to cluster them. With the recent rise of interest in multilayer systems [48][49][50], this problem became very relevant for practical applications, e.g. to demonstrate the validity of a multilayer approach to modeling complex systems as the human brain [77], and new approaches to cluster and aggregate layers have been proposed more recently [78,79].
Here, we use the networks built from analysis of the structure and function of the human microbiome, consisting of 18 layers of a multiplex network, each one corresponding to a body site. Recently such layers have been partitioned into community types, by using Dirichlet multinomial mixture models, that may be associated with complex diseases [56]. We use the same data and calculate the Jensen-Shannon distance between each pair of layers for different values of the β parameter. We show in Fig. 6 the resulting hierarchical clustering of the layers, in good agreement with the result reported in [56]. A more quantitative comparison against the results in [56] is difficult because only p−value upper bounds are reported in the study for community-type associations. Nevertheless, our method is able to correctly reproduce the clustering of sites within certain body regions. For instance, the community consisting of hard palate, tongue dorsum, saliva, palatine tonsils and throat, as well as the community made by vaginal introitus, mid vagina and posterior fornix and the smaller ones made by sub-and supragingival plaque and left/right retroauricular crease. For ranges of β within one order of magnitude, the differences between hierarchies are small and not significant. Here we show the results for β = 0.1 and β = 10, with cophenetic distance equal to 0.84 and Baker's gamma correlation coefficient equal to 0.68. By performing the same analysis on hierarchies where labels are reshuffled uniformly random, we reject the null hypothesis that the two dendrograms are uncorrelated (P < 0.001), support-ing the fact that the result is robust to the choice of β.

VI. DISCUSSION
The use of von Neumann entropy is central to modern quantum information theory, with many new insights, uses and interpretations arising often. In the context of complex networks, spectral entropy might be considered as a measure of the information content of the system, although this interpretation is not easy to accept without the appropriate formalization typical of classical and quantum information theory.
Here, we observe that networks of isolated nodes (which challenge the notion of a network itself) and fullyconnected networks, i.e. cliques, have maximal entropy. Of course, connectivity is not the feature common to such extremal network scenarios. We investigated several alternative explanations, but the most promising one is based on the number of possible configurations obtained by reshuffling the connections among nodes, that is equivalent to randomly reshuffling the entries of the adjacency matrix. In fact, all possible reshuffled configuration of both the network of isolated nodes and the clique network do not alter the system. In other words, the graph isomorphism problem becomes trivial, as all possible networks are not distinguishable from each other, thus maximizing the uncertainty about the system and, consequently, the entropy. In the case of K-regular networks the situation is a bit different, because not all reshuffled configurations will keep them unaltered: therefore such networks are expected to exhibit lower entropy than log 2 N . It is interesting the case of an Erdös-Rényi network, which is characterized by stochastically homo-geneous connectivity. The entropy for this type of networks is almost always zero or log 2 N , except for a narrow range of β. This is compatible with our interpretation, because on average almost any reshuffled configuration of this type of networks will resemble the original. According to this interpretation, extremely sparse networks and extremely dense networks should exhibit almost maximum entropy. We have performed additional numerical experiments and theoretical analysis of toy models (path graphs, lattices, etc.) and we have verified our expectation.
The proposed framework is entirely based on the computation of eigenvalues of density matrices and their products. The computational complexity of such operation is generally expected to be polynomial and to scale as O(n γ ), with 2 < γ < 2.4, making challenging calculations for networks with hundreds of thousands nodes. However, the recent advances in parallel operations on dense symmetric matrices make possible to solve eigenvalue problems for matrices with size of the order of 10 5 in less than five minutes on standard computer machines [80].

VII. CONCLUSIONS AND OUTLOOK
The concept of thermodynamic entropy has been crucial to understand the structure and the dynamics of complex systems. The generalization of this concept to quantum mechanics by von Neumann was a milestone in the field, allowing, among other applications, to characterize the mixedness of quantum states.
An appropriate definition of entropy has been lacking in the field of complex networks, mainly because there is no simple way to define a probability distribution able to represent, without loss of information, the network. Previous attempts to define the entropy of a complex network where mainly based on the calculation of Shannon entropy of the probability distribution of some descriptor(s), while neglecting other information.
More recently, the idea that a quantum-like entropy might be introduced for complex networks has been explored. Here, we have demonstrated that previous attempts fail in preserving fundamental properties of entropy, like sub-additivity. Motivated by this finding, we have proposed a density matrix, inspired by the Gibbs state of a quantum system, which non-trivially depends on the combinatorial Laplacian matrix of the network. One of the main advantages of this approach, rooted in spectral theory, is that the resulting entropy does not depend on the distribution of a specific network descriptor but it depends on the network as a whole.
In this study, we have shown, analytically and numerically, that our density matrix allows to preserve the basic properties of an entropy. More specifically, we have demonstrated that the entropy of the system obtained by aggregating two different networks must be equal to or smaller than the sum of the entropy of the two non-aggregated networks.
This definition of entropy has allowed us to define Rényi spectral entropy and generalized relative entropies, also known as quantum divergences. By using Kullback-Leibler divergence and well-known results in classical information theory we have devised a maximum-likelihood approach to model fitting, entirely based on the eigenvalue spectra of density matrices. Our numerical experiments confirmed that by minimizing the Kullback-Leibler divergence between observed network and models, we obtain the maximum spectral likelihood estimation of parameters. This result might be of special interest for applications devoted to unravel the meso-scale organization of a network, among others.
Finally, by using the Jensen-Shannon divergence, whose square root provides a metric, we have shown that our framework can be applied to quantify the spectral distance between two networks. More specifically, this result is of interest for applications in multilayer network research, where the distance among layers can be measured and used to hierarchically cluster them. As a representative example of the power of our methodology, we have shown that this approach recovers with excellent accuracy the community-based classification of human microbiome sites.
This study opens interesting perspectives on future works. For instance, while our inference procedure does not provide the community label to assign to each node, as in community-detection methods based on stochastic block-modeling [81][82][83][84][85][86][87], it provides a fast way, based on network's spectral properties, to explore the parameter space in order to identify the most informative region about the data. An intriguing possibility to explore is how this result can complement existing network inference approaches [88].
A future application of our framework-of great interest for the community of network scientists-is the quantification of how much information is needed for correctly learning the parameters of a model, that should not be confused with the Cramer-Rao bound. This represents a crucial issue in network science: connectivity of empirical networks is often sampled through ad hoc algorithms, as in the case of virtual social systems, the Internet or the World Wide Web. Quantifying to which extent it is possible to learn the parameter(s) of a network model in absence of partial connectivity information is a tantalizing possibility, expected to deepen our understanding of networked systems.
The information-theoretic quantities introduced in this work open the intriguing possibility to extend information geometry to complex networks. In information geometry parametric statistical models define a Riemannian manifold endowed with Fisher information matrix as a metric tensor [89,90]. The importance of information manifolds for statistical inference is well established and found applications to machine learning [74,91,92] and phase transitions in quantum systems [93][94][95][96][97][98]. It will be interesting to explore to which extent the success of information geometry can be ported to statistical inference of complex networks, where the metric tensor of the underlying information manifold is given by the spectral Fisher information matrix.
The potential of the proposed methodologies goes beyond network science and we envision important contributions to quantum physics, especially to the emergent field of Hamiltonian learning [99,100] and inference of quantum complex network models based on qubit entanglement [29,30].

ACKNOWLEDGMENTS
The authors thanks Alex Arenas for useful discussions. MDD acknowledges financial support from the Spanish program Juan de la Cierva (IJCI-2014-20225). JDB acknowledges IARPA and the Foundational Questions Institute (FQXi) for financial support.
Appendix A: Sub-additivity of the von Neumann entropy for aggregate networks Let A and B be the adjacency matrices of two networks G A and G B , respectively, with N nodes. Let C = A + B be the adjacency matrix obtained by their sum, which represents a new network G C obtained by aggregating G A and G B . In the following we will show that S(G C ) ≤ S(G A ) + S(G B ), i.e. that the sub-additivity property of the entropy is always satisfied. For sake of simplicity, we will use the notation S X = S(G X ).
Let us indicate with L A , L B and L C the Laplacian matrices of G A , G B and G C , respectively, and let ρ A , ρ B and ρ C indicate the corresponding density matrices defined as in Eq. (3).
The Kullback-Leibler divergence of two density matrices is always non-negative, as a consequence of Klein's inequality with f (X) = X ln X. Therefore from which log 2 Z X ≥ 0.
By summing up such non-negative terms, the following inequality D(ρ C ||ρ A ) + D(ρ C ||ρ B ) + +β Tr [L A ρ A ] + β Tr [L B ρ B ] + log 2 Z C ≥ 0 (A4) holds. The above inequality can be expanded into By exploiting the fact that, for a Gibbs-like state, the von Neumann entropy is given by S(ρ X ) = β Tr [L X ρ X ] + log 2 Z X (A6) and L C = L A + L B , it follows that which leads to S A + S B ≥ S C .