Configuration model for correlation matrices preserving the node strength

Correlation matrices are a major type of multivariate data. To examine properties of a given correlation matrix, a common practice is to compare the same quantity between the original correlation matrix and reference correlation matrices, such as those derived from random matrix theory, that partially preserve properties of the original matrix. We propose a model to generate such reference correlation and covariance matrices for the given matrix. Correlation matrices are often analysed as networks, which are heterogeneous across nodes in terms of the total connectivity to other nodes for each node. Given this background, the present algorithm generates random networks that preserve the expectation of total connectivity of each node to other nodes, akin to configuration models for conventional networks. Our algorithm is derived from the maximum entropy principle. We will apply the proposed algorithm to measurement of clustering coefficients and community detection, both of which require a null model to assess the statistical significance of the obtained results.


I. INTRODUCTION
Correlation matrices are a major form of multivariate data in various domains.Examples include financial time series [1,2], behavioural and questionnaire data in psychology [3], genetic interactions [4][5][6], neuroscience [7][8][9] and climate science [10].Although pairwise correlation does not always reflect physical connection or direct interaction between two entities, correlation matrices, whose entries represent the strength of correlation between a pair of entities (which we call nodes in the rest of the paper), are conventionally used as a relatively inexpensive substitute for direct connection.
Major analysis tools for correlation matrix data include principal component analysis [11], factor analysis [12], Markowitz's portfolio theory in mathematical finance [13] and random matrix theory [1,2].A more recent approach to correlational data is network analysis.With this approach, the first task is usually to either threshold on the value of the pairwise correlation to define an unweighted (i.e., binary) network or adopt the value of the pairwise correlation as the edge weight to define a weighted network.Then, one examines properties of the obtained network.Network analysis provides information that is different from the information obtained with other methods, such as the distance between two nodes, centrality (i.e., importance) of the nodes according to various criteria and network motifs (i.e., overrepresented small subgraphs) [14][15][16].Network analysis of correlation matrices is common across disciplines [3-6, 9, 10, 17-23].
However, there are fundamental technical problems in applying standard methods of network analysis to correlation matrix data.First, correlation networks tend to suffer from type 1 errors (i.e., false positives) because pairwise correlation does not differentiate between direct effects (i.e., nodes v 1 and v 2 are correlated because they directly interact) and indirect effects (i.e., v 1 and v 2 are correlated because nodes v 1 and v 3 interact and v 2 and v 3 interact) [24][25][26].Second, when analysing a correlation matrix as an unweighted network, no consensus exists regarding the choice of the threshold value despite the evidence that results are sensitive to the threshold (e.g., Ref. [27]).Third, whereas thresholding is claimed to mitigate uncertainty on weak links and enhance interpretability of network-analysis results [6,27], thresholding discards potentially important information contained in the values of the correlation coefficient [28].Last, even if we do not carry out thresholding and treat a correlation matrix as a weighted network, the problem of type-1 errors remains and it is unclear how to deal with negatively weighted edges.
We consider that these shortcomings inherent in correlation network analysis owe to the paucity of network-analysis tools tailored to correlation matrices.Not just being symmetric, correlation matrices are a special type of matrices in that they are positive semidefinite (i.e., all eigenvalues are nonnegative), they are dense and the range of the entries is confined between −1 and 1 [29].Furthermore, the node i's weighted degree in a correlation matrix represents the correlation between node i and the average of the signal over all nodes, which is somewhat non-intuitive and affects analysis of correlation networks [30].We do have algorithms to partition correlation networks into communities [30], calculate their clustering coefficients [31] and detect change points in time-varying correlation networks [32].These algorithms are tailored to correlation matrix input.However, these analysis tools were proposed only recently, and analysis tools for correlation matrices as networks still seem to be in their infancy.
In the present paper, we propose a configuration model for correlation matrices and showcase its use as the null model in measuring the clustering coefficient and community structure.In general, a null model of networks generates randomised networks that preserve some but not all features of the given network.Then, one compares a property in question calculated for the given network and that calculated for sample networks generated by the null model to assess whether the property of the given network is significant relative to that of the null model [33,34].Null models available for correlation matrices include the identity matrix [30], Laguerre ensembles [1] or Gaussian orthogonal ensemble [2] of random matrix theory, Hirschberger-Qu-Steuer (H-Q-S) algorithm [35] and correlation matrices reconstructed from noise eigenmodes corresponding to small eigenvalues of the correlation matrix (and the largest eigenmode in a different variant) [30].
For conventional networks (i.e., networks not derived from correlation matrices), heterogeneity in the degree distribution is a common feature in empirical data [15,16].Configuration models are probably the most often used class of null models and generate random networks under the constraint that generated networks conserve the (expected) degree of each node in the original network [34,[36][37][38].Heterogeneous degree distributions have also been observed for correlation networks of the brain [17,18], financial data [19,21] and gene coexpression [22,23] (also see Fig. 2).However, none of the aforementioned null models for correlation matrices is intended to preserve the degree or strength (i.e., weighted degree) of the nodes in correlation matrices, which motivates the present paper.

II. MAXIMUM ENTROPY MODEL FOR CORRELATION MATRICES PRE-SERVING THE EXPECTED STRENGTH OF EACH NODE
We propose a configuration model for covariance matrices.We work with covariance matrices rather than correlation matrices due to analytical tractability of the former.However, in practice, we usually analyse correlation matrices rather than covariance matrices because the latter is un-normalised.Therefore, we need a configuration model for correlation matrices.Given this situation, we will explain applications of our algorithm to correlation matrices in section II A first.This discussion gives two conditions that constrain the configuration model for covariance matrices, which will be developed in section II B. MATLAB codes for estimating the configuration model are available at Github [39].

A. Conservation of the node strength in correlation matrices
Denote by Σ org the N × N covariance matrix given as input and by Σ con the N × N covariance matrix obtained from the configuration model.We will explain how to calculate Σ con from Σ org in section II B. When the input is a correlation matrix, denoted by ρ org , our aim is to ensure that the expected strength of each node of the correlation matrix generated by the configuration model, denoted by ρ con , is similar to that of ρ org .
To this end, we start by discussing the relationship between the entries of the covariance matrix and the node strength in the corresponding correlation matrix.The Pearson correlation between nodes i and j is given by where ρ = (ρ ij ) is the correlation matrix corresponding to a covariance matrix Σ.A direct equivalent of the strength of node i in the correlation matrix, denoted by s i , is given by Equation ( 2) indicates that, if each diagonal element of Σ con and the row sum of the offdiagonal elements of Σ con for each row are equal to those for Σ org , the configuration model, which will be formulated in section II B, roughly conserves s i (1 ≤ i ≤ N) of the input correlation matrix.Therefore, in our configuration model, we will impose that the expectation of Σ con ii and N j=1;j =i Σ con ij are equal to Σ org ii and N j=1;j =i Σ org ij , respectively, for each i (1 ≤ i ≤ N).
In fact, Eq. ( 2) implies that, even under these constraints, the expected node strength for ρ con is not generally equal to the node strength for ρ org .The discrepancy would be large if the autocovariance, Σ org jj , which appears in the denominator in Eq. ( 2), heavily depends on guarantees that the configuration model conserves s i of the correlation matrix for each i.
A correlation matrix is a covariance matrix (therefore allowed as input to our algorithm developed in section II B) and its diagonal elements are independent of the node (i.e., equal to 1 for each node).Therefore, when a correlation matrix is fed to our algorithm, we expect that the output conserves the node strength to a high accuracy.
The flow of the algorithm is shown in Fig. 1.If the original data are a covariance matrix, we first transform it to a correlation matrix, ρ org , using Eq. ( 1).Then, we submit ρ org , which is a covariance matrix, to our algorithm.Because the input covariance matrix (i.e., ρ org ) has uniform diagonal elements (i.e., all equal to 1), we expect that the algorithm approximately conserves the node's strength of the correlation matrix.The output of the algorithm is a covariance matrix whose expectation of each diagonal element is equal to 1.Note that each diagonal element of the output covariance matrix is not generally equal to 1. Finally, we transform the output covariance matrix to the correlation matrix, which is denoted by ρ con , using Eq.(1).

B. Maximum entropy formalism and the gradient descent algorithm
Assume a covariance matrix Σ org as input.We generate random covariance matrices that conserve the expectation of the row sum of the off-diagonal elements of Σ org in each row and the expectation of each diagonal element, i.e., the auto-covariance of each node, of Σ org .We achieve this goal by standing on the maximum entropy principle, with which one generates the distribution with the largest entropy under certain constraints [40].For conventional networks, the maximum entropy principle has been used for generating unweighted [41][42][43][44][45] and weighted [42][43][44][45][46][47][48] networks (also see [49][50][51]).However, networks generated by these algorithms are not correlation or covariance matrices in general.
Denote by N the number of elements, which we refer to as nodes according to the terminology of networks.We generate N × N covariance matrices of the following form: where X = (x ij ) is an N × L real matrix and ⊤ denotes the transposition.Because a covariance matrix is positive semidefinite, its eigendecomposition implies that any given covariance matrix can be written in the form of Eq. (3) when L is larger than or equal to the number of positive eigenvalues of Σ con .Because matrix Σ con is interpreted as the sample covariance matrix when the ith data vector (e.g., time series in discrete time or a feature vector) is given by {x i1 , . . ., x iL }.
We will determine a distribution of matrix X, which we denote by p(X).Under the maximum entropy principle, we maximise where α i and β i are Lagrange multipliers.By taking the functional derivative of Eq. ( 5) with respect to p(X) and setting it to zero, we obtain where x ℓ = (x 1ℓ , . . ., x N,ℓ ) ⊤ and Therefore, p(X) is given by a multivariate normal distribution, i.e., with which one draws x ℓ for each ℓ from the N-variate multivariate normal distribution with mean zero and precision matrix Σ −1 , independently for each ℓ.Note that Σ is the covariance matrix for the estimated multivariate normal distribution.
To numerically determine the precision matrix, we reparametrise Eq. ( 7) as without loss of generality.We infer Σ −1 by running the following gradient descent algorithm.
Equation ( 8) leads to Therefore, the gradient descent learning rule for α i and β i to maximise H(X) is given by where 1 ≤ i ≤ N and ǫ is the learning rate.We refer to Eq. ( 8) with the optimised α i and β i values as the configuration model for correlation matrices.In the numerical simulations in Section III, we set ǫ = 10 −4 .We remark that the gradient descent algorithm, and hence the obtained precision matrix, does not depend on our choice of L.

C. Choice of L
A covariance matrix Σ con obtained from our configuration model obeys a Wishart distribution with degree of freedom L, denoted by W N (L, Σ).The mean of each element of Σ con is given by Σ and the variance of [52,53].Therefore, L controls the amount of fluctuations in covariance matrices generated by the algorithm.In the limit of L → ∞, the configuration model always produces covariance matrix Σ, in which the strength of each node and each diagonal element agree with those of the input covariance matrix, Σ org .If L is finite, the configuration model produces covariance matrices that differ from sample to sample.
We set L to the length of the original data based on which the covariance or correlation matrix is calculated (e.g., the length of time series, number of participants in an experiment, or dimension of the feature vector).If the length of the original data is unknown, we propose to set L to the number of positive eigenvalues of Σ org because it is the smallest value of L with which the configuration model may preserve the rank of the input covariance matrix in addition to the node strength.We remark that our gradient descent algorithm often fails when Σ org is not of full rank (hence L < N).In the following sections, we use empirical data whose L value is known and L > N.

D. Uniformity of samples
By maximising the entropy in terms of p(X), our configuration model does not maximise the entropy in terms of the distribution of all possible positive semidefinite matrices, which qualify as covariance matrices.Therefore, our model is biased in the space of all possible positive semidefinite matrices.However, we consider that it is rather realistic to formulate the maximum entropy principle in terms of p(X) because empirical covariance matrices are usually calculated from Eq. ( 3), where X is raw data.

A. Data
We use five empirical correlation matrices to compare different methods.In all cases, the empirical correlation matrix, ρ org , is calculated as the Pearson correlation coefficient between pairs of multidimensional measurements or time series.
The first correlation matrix is based on psychological questionnaires with N = 30 question items.We refer to this data set as the motivation data.The questionnaires consist of three scales (i.e., inventories) of academic motivation at school.The first scale is the so-called Achievement Goal Questionnaire (18 items) [54], which assesses students' mastery goals (i.e.goals to master a task), performance-approach goals (i.e.goals to outperform others) and performance-avoidance goals (i.e.goals not to be outperformed by others) in a class.

The second scale is a shortened version (six items) of an intrinsic motivation scale used in
Ref. [54], which assesses students' intrinsic motivation or enjoyment in a class.The last one is an academic self-concept scale (six items) [55], which assesses students' competence belief about a class.School children responded to these questionnaire items on a five-point Likert scale (1, strongly disagree -5, strongly agree).The Pearson correlation coefficient between each pair of items is calculated from responses from L = 686 persons.The correlation matrix is available as Supplementary Material.
Two correlation matrices are obtained from multivariate time series of functional magnetic resonance imaging (fMRI) signals in the brain.Each correlation matrix is derived from a human participant.The data are collected from the Human Connectome Project [56].For each of the two participants, we extract time series at N = 264 locations whose coordinates are determined in a previous study [57].The pairwise correlation is calculated based on fMRI time series of length L = 4, 760.We refer to the data from the two participants as fMRI1 and fMRI2.Details of the preprocessing procedures are explained in Appendix A.
We also use two correlation matrices obtained from time series of the logarithmic return of the daily closing prices in the Japanese and US stock markets.For the Japanese data, we use the N = 264 stocks belonging to the first section of the Tokyo Stock Exchange provided by Nikkei NEEDS [58].We limit ourselves to the stocks that have transactions on every trading day between 12 March 1996 and 29 February 2016, yielding 4, 904 trading days in total.For the US data, we obtain the N = 325 stocks from the list of the Standard & Poor's 500 index using Mathematica's FinancialData package [59].We limit ourselves to the stocks that have transactions on every trading day between 3 January 1996 and 24 February 2017, yielding 5, 324 trading days in total.For each stock, we convert the time series of the stock price into that of the logarithmic return by x org it = log(y org i,t+1 /y org i,t ), where y org i,t is the closing price of the ith stock on the tth day, and x org it is the corresponding logarithmic return.The length of {y i,t } is equal to L = 4, 903 and L = 5, 323 for the Japanese and US data sets, respectively.

B. Degree and strength distributions for the empirical correlation matrices and networks
The motivation behind our configuration model is that the node strength value depends on nodes.Otherwise, the previously proposed models to generate random correlation or covariance matrices [1,2,30,35] would probably suffice.Therefore, in this section we measure the distribution of the node's degree and strength in the empirical networks.To calculate the degree of each node i, which is denoted by k i , we binarise the correlation matrix to create an unweighted network.For this purpose, we threshold on the pairwise correlation value to make the edge density equal to 0.15, which is an arbitrary choice.To calculate the strength of each node i, we consider weighted networks obtained without the thresholding on the pairwise correlation value.For the weighted networks, we define the node strength by either (i) the sum of the off-diagonal elements of the correlation matrix, denoted by s i ; (ii) the same sum but using the absolute value of the correlation, denoted by s abs i ; or (iii) the same sum but discarding negative correlation values, denoted by s + i .The survival probability (i.e., probability that a quantity is larger than or equal to the specified value) of the degree and the three types of node strength are shown in Fig. 2 for each empirical network.As briefly mentioned in Section I, the degree and strength are to some extent heterogeneous across nodes, although the distributions are not long-tailed.

C. Distribution of eigenvalues
Random matrix theory is a useful tool to formulate null models of correlation matrices [1,2].MacMahon and Garlaschelli proposed a null model of a correlation matrix, which we denote by ρ MG2 [30].Matrix ρ MG2 preserves the eigenmodes of the input correlation matrix, ρ org , that correspond to small eigenvalues, i.e., those contained in the spectrum of a correlation matrix constructed from N completely random time series of length L. Their other null model, which we denote by ρ MG3 , preserves the eigenmode corresponding to the largest eigenvalue of ρ org in addition to the noisy eigenmodes used in ρ MG2 .See Appendix C for the definition of ρ MG2 and ρ MG3 .To relate the present configuration model to random matrix theory, we investigate the eigenvalue distribution for our configuration model in this section.
We first generate a correlation matrix, ρ org , from N completely independent normally distributed time series of length L. With this random correlation matrix as input, we estimate the configuration model.Then, we generate a sample correlation matrix, denoted by ρ con , from the estimated configuration model.With N = 100 and L = 200, the distribution of the eigenvalues of the original correlation matrix and that of a sample correlation matrix generated by the configuration model are shown in skyblue and red in Fig. 3(a), respectively.
The figure suggests that the two distributions are similar.Furthermore, both distributions are similar to the theoretical distribution for the completely random correlation matrix called the Marcenko-Pastur (also called Sengupta-Mitra) distribution given by where , 30] (shown in the black lines in Fig. 3).The results are qualitatively the same for a larger random correlation matrix with N = 500 and L = 1, 000 (Fig. 3(b)).Therefore, when random correlation matrices are input, the present configuration model behaves similarly to the existing null models ρ MG2 and ρ MG3 .
Then, we turn to a random correlation matrix with community structure.By adapting the benchmark models used in Ref. [30], we construct a random correlation matrix with four non-overlapping communities as follows.We set N = 500 and L = 1, 000.We assume that the signal on the ith node (1 , where α(t), β i (t) and γ c (t) for each i (1 .The mode with the largest eigenvalue, which we call the dominant mode (also called the market mode in the literature [30]), corresponds to the conservation of the node's strength, as we will examine in the next section.Although the largest eigenvalue of ρ con is different from that of the original correlation matrix, ρ org , due to randomness of ρ con and possibly for other reasons, the eigenvalue distribution for ρ con is similar to that for ρ MG3 .
Because the present configuration model is a Wishart distribution of covariance matrices, we have access to its expectation with respect to p(X), which is equal to Σ for any L. We convert Σ to the correlation matrix to denote it by ρ con , where • represents the expectation.Correlation matrix ρ con is approximately the expectation of the sample correlation matrix, ρ con .Note that ρ con is equal to any sample correlation matrix, ρ con , in the limit L → ∞.The eigenvalue distribution for ρ con is shown by the magenta lines in Fig. 4. If the distribution followed the combination of a single dominant eigenvalue and the Marcenko-Pastur distribution, Eq. ( 13) suggests that the bulk part would follow the delta function located at λ = 1 because ρ con corresponds to the limit L → ∞.However, the figure suggests that this is not the case.The eigenvalue distribution of ρ con is composed of a noisy part with a finite width and a dominant mode.
For the motivation data (Fig. 4(a)) and the financial data (Figs.4(d) and 4(e)), the bulk part of the eigenvalue distribution for ρ con deviates from the Marcenko-Pastur distribution.
However, it is closer to the Marcenko-Pastur distribution than the bulk part of the eigenvalue distribution for the original correlation matrix is.In addition, ρ con has a single dominant eigenvalue that is much larger than the other eigenvalues.These observations also apply to ρ con .Therefore, for the motivation and financial data, the present configuration model filters the input correlation matrix to produce a correlation matrix that is qualitatively, although not quantitatively, similar to ρ MG3 .

D. Strength of each node
In this section, we compare the strength of each node between the empirical correlation matrices and those generated by different models.
The strength of each node, defined by s i = N j=1;j =i ρ ij , is compared between each of the empirical correlation matrices, ρ org , and the corresponding configuration model in Fig. 5.
For all the empirical correlation matrices, ρ con almost perfectly reproduces the strength of each node in ρ org , corroborating the validity of our gradient descent algorithm (shown by the circles in Fig. 5).A sample correlation matrix ρ con generated by the configuration model produces node strengths that carry some fluctuations around the correct values (squares in Fig. 5).Because the standard deviation of each entry of ρ con is proportional to L −1/2 (Section II), the fluctuation is generally small for data with a large L value.
The eigenvalue distribution for the configuration model is characterised by a dominant mode and the N − 1 eigenvalues that constitute a bulk that resembles the Marcenko-Pastur distribution to different extents depending on the data (Fig. 4).To examine the relationship between the largest eigenvalue and the conservation of the node's strength, we filter the expected correlation matrix generated by the present configuration model, ρ con , by only keeping the dominant eigenmode.In other words, we calculate matrix , where λ 1 is the largest eigenvalue of ρ con and u (1) is the corresponding normalised column eigenvector.
Then, we compute the node's strength for λ 1 u (1) u ⊤ (1) .It should be noted that, although is not a correlation matrix because its diagonal elements are not equal to unity in general, the diagonal elements are not used in the calculation of the node's strength such that the node's strength is well defined [30].
The node strength for λ 1 u (1) u ⊤ (1) is plotted against that for the original correlation matrix, ρ org , by the diamonds in Fig. 5.Despite a slight overestimation, λ 1 u (1) u ⊤ (1) reproduces the node's strength for the original correlation matrix with a high accuracy.Therefore, our configuration model roughly retains the dominant mode of the original correlation matrix to conserve the node's strength and produce the other N − 1 random modes whose eigenvalue distribution approximates the Marcenko-Pastur distribution to different extents.However, differently from a previous null model, ρ MG3 , that exactly preserves the dominant mode of the input correlation matrix, the dominant mode of the present configuration model is not the same as that of the original correlation matrix.This fact is evinced by the difference in the position between the rightmost skyblue versus magenta bars in each panel of Fig. 4.
To examine the relationship between the node strength and the dominant mode of the empirical correlation matrices, we calculated matrix λ 1 u (1) u ⊤ (1) from the original correlation matrix and plotted its node strength against that of the original correlation matrix ρ org by the triangles in Fig. 5.Note that this particular analysis does not have to do with any null model including the present configuration model.For the financial data, the dominant mode of ρ org explains the strength of each node with a high accuracy (Figs. 5

(d) and 5(e)).
This is presumably because the dominant eigenvalue is much larger than the other N − 1 eigenvalues for these correlation matrices, which is a robust observation for financial time series data [1,2,30].For the motivation and fMRI data, for which the dominant eigenvalue is not relatively large as compared to the case of the financial data, we also find a similar agreement between the dominant mode and the node strength albeit with a lower accuracy (Figs.5(a)-(c)).
We conclude that the dominant mode represents the sequence of node strength if the largest eigenvalue is far from the other eigenvalues of the correlation matrix.To our numerical effort, this condition holds true for some empirical correlation matrices and all correlation matrices obtained from the configuration model.
Next, we examine the same relationship between the empirical correlation matrices and three other models of correlation matrix.The first correlation matrix is a covariance matrix generated by the H-Q-S algorithm (Appendix B), which is then converted to the correlation matrix.We denote this correlation matrix by ρ HQS .The other two correlation matrices are derived from random matrix theory, i.e., ρ MG2 and ρ MG3 .The strength of each node is compared between the empirical correlation matrices, ρ org , and the three models in Fig. 6.Correlation matrix ρ MG3 reproduces the node strength with a high accuracy for the financial data (diamonds in Figs.6(d) and 6(e)).This result is consistent with the observation that the dominant mode reproduces the node strength (Figs. 5

(d) and 5(e)).
We obtain qualitatively the same results for the other data sets although the association between the empirical correlation matrix and ρ MG3 in terms of the node strength is weaker (diamonds in Figs.6(a)-(c)).
Correlation matrices ρ HQS and ρ MG2 do not produce heterogeneous distributions of the strength across different nodes (circles and squares in Fig. 6).In particular, the node strength for ρ MG2 is close to zero for all nodes.They can be regarded as correlationmatrix counterparts of the Erdős-Rényi random graph for conventional networks, which do not conserve each node's degree.

E. Distribution of off-diagonal elements
The survival probability of the off-diagonal elements of the correlation matrix (i.e., Pearson correlation values between pairs of nodes) is compared between the empirical data and the models in Fig. 7 for each data set.The expectation of the configuration model, ρ con , produces distributions of the off-diagonal elements moderately close to the empirical distributions.As expected, ρ con produces somewhat noisier distributions.Correlation matrix ρ MG3 beats our configuration model (i.e., ρ con and ρ con ) in approximating the empirical distribution.The H-Q-S model, ρ HQS , also produces distributions roughly close to the empirical ones, which is consistent with the previous results [35,60,61].The distributions derived from ρ MG2 are far from the empirical distributions.

F. Clustering coefficient
Clustering coefficients measure abundance of connected triangles in networks.For conventional networks, the cluster coefficients in empirical networks are much larger than in the configuration model in many cases [15].For correlation matrix data, one can construct a conventional weighted network by using the Pearson correlation value as the edge weight or an conventional unweighted network by thresholding on the edge weight.In both cases, the clustering coefficient tends to be inflated due to the presence of an indirect path (correlation between nodes v 1 and v 2 and that between v 1 and v 3 implies correlation between v 2 and v 3 ) [31,60].The H-Q-S model was shown to mitigate the effect of indirect paths on statistically measuring clustering coefficients [60,61].In this section, we compare the impact of different null models on the statistical significance of clustering coefficients in empirical correlation matrices.We use three models as null models, i.e., an algorithm that generates correlation matrices by assuming relatively long white-noise signals independent across different nodes, which we call the white-noise model (Appendix D), the H-Q-S model and our configuration model.We do not use ρ MG2 or ρ MG3 because they are not designed to produce random samples of correlation matrices, which are necessary for calculating the statistical significance of the clustering coefficients or other indices.
Because various measurements of unweighted correlation networks depend on the threshold value [60,62,63], we use two types of clustering coefficients that do not require thresholding.The first clustering coefficient is a weighted clustering coefficient [64], denoted by C wei,O (Appendix E).The second clustering coefficient, denoted by C cor,M , is the one based on partial mutual information, which we recently proposed [31] (Appendix F).
For each empirical correlation matrix and each null model, we generate 10 3 correlation matrices, calculate the clustering coefficient (i.e., C wei,O or C cor,M ) for each of the generated correlation matrices and calculate the sample mean and standard deviation of the clustering coefficient, denoted by μ and σ, respectively.The Z score is given by (C org − μ)/σ, where C org is the clustering coefficient for the original correlation matrix.By assuming that the clustering coefficient for the null model obeys a normal distribution, we translate the Z score to the P value based on the two-tailed test.
For the five empirical correlation matrices, the values of the clustering coefficients and the statistical results are shown in Table I.Both C wei,O and C cor,M for all the empirical networks are significantly larger than the values for the white-noise null model.This result is consistent with common knowledge that many empirical networks have high clustering [15], including the case of weighted networks [65].However, the same result does not hold true for the other two null models.Relative to the present configuration model, ρ con , clustering coefficient C wei,O is significantly small for all the five empirical correlation matrices.In contrast, C cor,M for all the empirical correlation matrices is larger than that for ρ con , including the case of insignificant results (i.e., Japanese stock market data).With the H-Q-S null model, the results vary across both the empirical correlation matrix and the type of clustering coefficient.
In sum, the empirical correlation matrices do not necessarily show high clustering coefficients when the H-Q-S model or the present configuration model is used as the reference.
In addition, the selection of the null model (i.e., the H-Q-S versus configuration model) may even qualitatively change the statistical results.

G. Community detection
Various conventional networks are organised into communities, i.e., sets of nodes such that the edges are dense within a community and relatively sparse across different communities [66].In this section, we apply our configuration model to community detection in correlation matrices.A naive application of community detection algorithms designed for conventional weighted networks to correlation matrix data would yield biased results.This observation led to development of community-detection algorithms tailored to correlation matrices with appropriate null models [30].We compare community detection when the null model is either ρ con , the expectation of ρ HQS denoted by ρ HQS , ρ MG2 , ρ MG3 , or the identity matrix denoted by ρ MG1 .All the off-diagonal values of ρ HQS are equal (Appendix B).
Correlation matrix ρ MG1 assumes the absence of correlation between any pair of nodes.
Note that ρ MG1 , ρ MG2 and ρ MG3 have been used for community detection in correlation matrices [30].
We maximise the modularity given by [30] where C norm = N i,j=1 ρ ij is a normalisation constant, ρ is a null model of the correlation matrix relative to which community structure is detected, δ is the Kronecker delta, and g i is the community to which node i belongs.We use the Louvain algorithm [67] to maximise Q.
To assess the statistical significance of the detected community structure, we maximise Q for randomised correlation matrices as well as for the given correlation matrix.When the null model is our configuration model, we generated random samples ρ con to calculate the Z score and P value.When the null model is ρ HQS , we generated random samples ρ HQS from the H-Q-S model.Because ρ MG1 , ρ MG2 and ρ MG3 are null models that do not generate sample correlation matrices, we generated random samples from the H-Q-S model (i.e., ρ HQS ) for these null models.In each case, we generated 10 3 random correlation matrices to calculate the Z score and the P value.
First, we start by using ρ con as the input correlation matrix rather than the null model.
Correlation matrix ρ con is considered to lack community structure because it is maximally random in terms of the entropy under the constraint on the strength of each node.Because the modularity value would be trivially insignificant if ρ con is used as the null model, we maximised the modularity with the other four null models, i.e., ρ HQS , ρ MG1 , ρ MG2 and ρ MG3 .The optimized Q values and the statistical results for the different empirical networks are shown in Table II.The modularity with the H-Q-S null model has detected significant community structure in the configuration-model correlation matrix (i.e., ρ con ) for all the data sets.Similarly, the modularity with the ρ MG1 null model yields significant community structure in two cases with N = 264.Therefore, we conclude that these two null models are not suitable for community detection.In contrast, modularity with the ρ MG2 or ρ MG3 null model does not find significant community structure, except for ρ MG2 with the motivation data, which is a small data set (N = 30).Therefore, ρ MG2 and ρ MG3 seem to be reasonable null models for community detection [30].
Therefore, we focus on community structure of the empirical correlation matrices obtained by maximising Q combined with either the ρ con , ρ MG2 or ρ MG3 null model.The maximised modularity values and statistical results for the empirical data are shown in Table III.The maximised modularity is insignificant for all the five empirical correlation matrices when the null model is ρ MG3 .The modularity is significant for all but the fMRI1 data when the null model is ρ MG2 .It should be noted that, with the combination of ρ MG2 and either the Japanese or US stock data, the modularity value is almost equal to 1 for any randomised correlation matrices.This is because the magnitude of the eigenvalues whose corresponding eigenmodes are preserved in ρ MG2 is much smaller than the dominant eigenvalue.Then, ρ MG2 is approximately a zero matrix, which makes the second term on the right-hand side of Eq. ( 14) negligible.Furthermore, modularity maximisation has only detected a single community (i.e., no partition into different communities), which makes the summation on the right-hand side of Eq. ( 14) almost equal to C norm , yielding Q ≈ 1.
With the configuration null model, the modularity is significant in all cases, presumably because the value of L is large and fluctuations of the modularity for samples generated by the estimated configuration model are small.Because the motivation data set is small and the modularity values for the original correlation matrices are small for the stock market data, we focus on the fMRI data in the remainder of this section.In the present fMRI data, each node is assigned with a biologically determined label representing estimated functions of the node [57].The relationship between the detected community structure and the biological label of the node is shown in Fig. 8.
To assess the extent to which the detected communities are consistent with the biological label of the node, we compute the probability that two nodes with the same label belong to the same community.We denote this probability by P emp .Because P emp would be large when there are a small number of communities, we normalise P emp by the probability in the case of the completely random assignment of nodes to a label, which we denote by P rand .We obtain , where n comm is the number of communities and N c is the number of nodes in the cth community.The values of P emp −P rand and P emp /P rand for the two fMRI data sets and the three null models are shown in the top half of Table IV.For both normalised measures of the consistency between the nodal label and community structure, i.e., P emp − P rand and P emp /P rand , the configuration null model realises a larger value than the ρ MG2 and ρ MG3 null models do.The results remain the same when the nodes having label "Uncertain" are removed before P emp and P rand are calculated (the bottom half of Table IV).We conclude that, for the present data set, our configuration model produces community structure that is more consistent with the biological label than the ρ MG2 and ρ MG3 null models do.However, a visual inspection of Fig. 8 suggests that the ρ MG2 null model realises community structure that is more consistent between the two participants than the other two null models do (Figs.8(b) and 8(e) as compared to Figs. 8(a), 8(c), 8(d) and 8(f)).To examine this point, we measure the Jaccard index between the community structure detected for fMRI1 and that for fMRI2.The Jaccard index is defined by j )δ(g j ) , where g (1) i and g (2) i are the communities to which node i belongs in the fMRI1 and fMRI2 data, respectively.We have found that the Jaccard index is larger (therefore, the two community structures are more similar) for the the ρ MG2 null model (= 0.567) than the ρ con (= 0.313) and ρ MG3 (= 0.418) null models.

IV. DISCUSSION
We proposed a configuration model for correlation matrices that preserves the expected strength of each node.We illustrate applications of the present model with clustering coefficients and community detection.Being a configuration model, the present model will find applications in measurements and algorithms for correlation and covariance matrices where comparison between the original matrix and reference matrices (i.e., null models) will be important.Judging from similar situations for conventional networks, we expect application of the present paper in, for example, different algorithms of community detection [66], network motifs [68] and detection of core-periphery structure [69].
A correlation matrix can be regarded as a weighted network.Several configuration models including those based on the maximum entropy principle have been proposed for weighted networks [42-44, 46-48, 70, 71].However, differently from the present configuration model, these previous models do not conserve positive semidefiniteness, which any correlation or covariance matrix must satisfy.In addition, our configuration model allows negative entries, whereas the previous models exclude negative edge weights; correlation or covariance matrices generally have negative entries.Therefore, the maximum entropy models for conventional weighted networks [42][43][44][46][47][48] and our model are different although they share the maximum entropy principle.
As a separate issue, constructing a weighted configuration model for conventional weighted networks is inherently difficult due to structural constraints imposed by the topology of the corresponding unweighted network [72].With our configuration model, we evaded this difficulty by not imposing an unweighted network topology in the estimated correlation or covariance matrix.
Correlation matrix ρ MG3 derived from random matrix theory [30] is similar to the present configuration model in the sense that ρ MG3 fairly accurately produced the node strength for the motivation data and financial data (Figs.6(a), 6(d) and 6(e)).For the fMRI data, it also explained the node strength for the fMRI data albeit to a lesser extent (Figs.6(b) and 6(c)).This is because ρ MG3 preserves the dominant eigenmodes of the original matrix by definition.The dominant eigenmode is strongly correlated with the node's strength (Fig. 5).In conventional unweighted networks, the eigenvector corresponding to the largest eigenvalue, called the eigenvector centrality [73], is often correlated with the node's degree [15,74].Also from this point of view, it is natural that the dominant mode approximately conserves the node's strength in various correlation matrices.Another similarity between the present configuration model and ρ MG3 is in the eigenvalue distribution (Figs. 3 and 4).
The configuration model also roughly preserves the dominant eigenmode.The configuration model does not perfectly preserve the bulk of the eigenvalue distribution corresponding to that of random correlation matrices (i.e., those falling in the support of the Marcenko-Pastur distribution), differently from ρ MG3 .Nevertheless, the configuration model shifts the bulk part of the eigenvalue distribution closer to the Marcenko-Pastur distribution.A difference between the present configuration model and ρ MG3 is that the former generally preserves the rank of the given correlation matrix, whereas the latter has a smaller rank owing to the elimination of some eigenmodes.Another difference is that ρ MG3 requires the length of the data (e.g., time series) based on which the correlation matrix is calculated, L, whereas the configuration model does not.The configuration model does need L to produce sample correlation matrices (i.e., ρ con ).However, it can be used in another mode, which is the expectation of the produced correlation matrices (i.e., ρ con ).In fact, we used ρ con for community detection (Section III G).This usage does not require the L value.
In sum, both ρ MG3 and the present configuration model can be regarded as configuration models for correlation matrices.They provide different methods to filter noise in correlation matrices.In contrast, the H-Q-S algorithm and ρ MG2 disregard the node's heterogeneity.Therefore, they are regarded as counterparts of the Erdős-Rényi random graph for correlation matrices.
An important limitation of the proposed algorithm is scalability.The gradient descent algorithm used in the present paper is slow because, in our experience, we have to make the learning rate (i.e., ǫ in Eqs.(11) and ( 12)) small for the algorithm to converge.Therefore, the largest correlation matrix that we used in the present paper was of size N = 500.
Alternatively, one can formulate a multidimensional root finding problem with unknowns α i and β i (1 ≤ i ≤ N) (Appendix G).However, we could not find the roots, which may be because of strong nonlinearity inherent in the set of the equations.The corresponding optimisation problem does not seem to be convex.The entropy probably has a rough landscape as a function of α i and β i .Up to our numerical efforts, we found that the landscape was even more rough when we fed covariance matrices rather than correlation matrices to our algorithm.Understanding this issue and devising more efficient algorithms are left for future work.
simplified to [77][78][79] h In Eq. (F2), d is the number of random variables and Σ ′ = (Σ ′ ij ) is the d × d covariance matrix derived from Xα 1 , . .., Xα d , i.e., Σ ′ ij = Xα i Xα j , where we recall that • represents the expectation.By substituting Eq. (F2) in Eq. (F1) and feeding the correlation matrix as a covariance matrix to Eq. (F2), one obtains We define the local clustering coefficient at node i as The denominator ensures C cor,M i to range between zero and one.The global clustering coefficient, denoted by C cor,M , is given by Appendix G: Parameter estimation by root finding We present a procedure to calculate the precision matrix that maximises the entropy of p(X) while respecting where 1 ≤ i ≤ N.
Equation (G1) yields Therefore, the (i, i) cofactor of Σ −1 is given by By combining Eqs.(G6), (G7) and (G8), one obtains where v i ≡ Σ org ii and 1 ≤ i ≤ N. Given u i and v i (1 ≤ i ≤ N), functions f i and g i (1 ≤ i ≤ N) define a system of 2N nonlinear equations for 2N unknowns, α i and β i (1 ≤ i ≤ N).We attempted to solve it using a MATLAB in-built function for root finding and an in-house implementation of the Newton-Raphson method.However, neither method could find the root, presumably because of the rugged landscape of f i and g i as a function of α i and β i .Rewriting Eqs.(G5) are independent normal variables with mean zero and standard deviation 1. Signal α(t) represents the global signal, β i (t) represents local noise, γ c (t) corresponds to the signal for each community, µ represents the strength of the global signal, and ν represents the strength of the local noise.We set µ = 0.4 and ν = 0.8 and assume that c = 1 for 1 ≤ i ≤ 50, c = 2 for 51 ≤ i ≤ 150, c = 3 for 151 ≤ i ≤ 300 and c = 4 for 301 ≤ i ≤ 500, thus generating four communities of size 50, 100, 150 and 200.The distribution of eigenvalues for a sample correlation matrix with four communities is shown in skyblue in Fig.3(c).The distribution is composed of a bulk of eigenvalues and four large eigenvalues that do not belong to the bulk.The bulk part of the distribution does not resemble the Marcenko-Pastur distribution, whereas the eigenvalues are not considerably larger than λ + , i.e., the largest value for the Marcenko-Pastur distribution.The four largest eigenvalues correspond to the four planted communities.The eigenvalue distribution for a sample correlation matrix generated by the estimated configuration model is shown by the red lines in Fig.3(c).It consists of a bulk part and a single large eigenvalue.The bulk part deviates from the Marcenko-Pastur distribution.However, it is closer to the Marcenko-Pastur distribution than the bulk part of the eigenvalue distribution for the original correlation matrix with four communities (shown in skyblue in Fig.3(c)) is.Note that the three additional eigenmodes corresponding to the communities are filtered out by the configuration model (shown in red in Fig. 3(c)).The present configuration model is expected to be suitable as a null model for community detection (Section III G) because the model filters out the singular eigenmodes encoding the community structure.Next, for the five empirical correlation matrices, we compared the distribution of eigenvalues between the original correlation matrix and a sample correlation matrix generated by the estimated configuration model.The results are shown in Fig. 4. The figures suggest that, for the fMRI data, the configuration model produces a distribution of eigenvalues that is almost the same as the Marcenko-Pastur distribution except for one eigenmode whose eigenvalue is much larger than λ + (red lines in Figs.4(b) and (c))

FIG. 1 .
FIG.1.Flow of the algorithm for generating random correlation matrices.If the input is a covariance matrix, we first transform it to the correlation matrix and feed it to the configuration model.If the input is a correlation matrix, we directly feed it to the configuration model.Because the output of the configuration model is a random covariance matrix (or samples generated from it), we transform it to the correlation matrix, which is the final output.

FIG. 5 .
FIG. 5. Comparison of the node strength between the original correlation matrix, the configuration model and the correlation matrices constructed from the dominant mode.(a) Motivation.(b) fMRI1.(c) fMRI2.(d) Japanese stocks.(e) US stocks.

TABLE I .
Clustering coefficients.The clustering coefficient is denoted by C. For each type of randomised correlation matrices, the average and standard deviation based on 10 3 samples are shown.

TABLE II .
Community detection when the configuration null model, ρ con , is given as input.For each null model, the mean and standard deviation of the modularity values based on 10 3 samples of randomised networks are shown.

TABLE III .
Community detection for empirical correlation matrices.For each null model, the mean and standard deviation of the modularity values based on 10 3 samples of randomised networks are shown.

TABLE IV .
Consistency between the biological label of the nodes and the detected communities for the fMRI data.