Weakly-correlated synapses promote dimension reduction in deep neural networks

By controlling synaptic and neural correlations, deep learning has achieved empirical successes in improving classification performances. How synaptic correlations affect neural correlations to produce disentangled hidden representations remains elusive. Here we propose a simplified model of dimension reduction, taking into account pairwise correlations among synapses, to reveal the mechanism underlying how the synaptic correlations affect dimension reduction. Our theory determines the synaptic-correlation scaling form requiring only mathematical self-consistency, for both binary and continuous synapses. The theory also predicts that weakly-correlated synapses encourage dimension reduction compared to their orthogonal counterparts. In addition, these synapses slow down the decorrelation process along the network depth. These two computational roles are explained by the proposed mean-field equation. The theoretical predictions are in excellent agreement with numerical simulations, and the key features are also captured by a deep learning with Hebbian rules.

Introduction.-Neural correlation is a common characteristic in most neural computations [1], playing vital roles in stimulus coding [2,3], information storage [4] and various cognition tasks that can be implemented by recurrent neural networks [5,6]. Neural correlation was recently shown by a mean-field theory [7] to be able to manipulate the dimensionality of layered representations in deep computations, which was empirically revealed to be a fundamental process in deep artificial neural networks [8]. This theory demonstrates that a compact neural representation with weak neural correlations is an emergent behavior of layered neural networks whose synaptic weights are independently and identically distributed. However, in real cortical circuits, synaptic weights among neurons, even in the same layer, may not be ideally independent with each other [4,[9][10][11], perhaps mainly due to biological synaptic plasticity [4,11]. On the other hand, a recent theoretical study of unsupervised feature learning predicts that weakly-correlated synapses promote unsupervised concept-formation by reducing the necessary sensory data samples [12,13]. Therefore, in what exact way a weak correlation among synapses affects the emergent behavior of layered neural networks remains unknown.
In particular, suppression of unwanted variability in sensory inputs [14], compact representation of invariance (generalization to novel context) [15], and the feature selectivity (discrimination ability of neural representation) [16] may be closely related to an appropriate neural dimensionality, which keeps only relevant informative components spanning a robust neural subspace [6]. Therefore, clarifying how synaptic correlations affect neural correlations and further the neural process underlying the transformation of representation dimensionality becomes fundamentally important.
To reveal the mechanism underlying how the synaptic correlations affect compact neural representations, we consider deep neural networks that realize a layer-wise transformation of sensory inputs. All in-coming synapses to a hidden neuron form a receptive field (RF) of that hidden neuron. The correlation among synapses is modeled by the inter-RF correlation (Fig. 1). We do not need a prior knowledge about the synaptic correlation strength. In fact, our mean-field theory yields different scaling behaviors of synaptic correlation with respect to the number of neurons at each layer, for both binary and continuous synaptic weights. The scaling behaviors are exactly a requirement of mathematically well-defined dimensionality.
Compared to the previous work of orthogonal weights [7], our current model reveals richer ways of controlling dimensionality of neural representations, i.e., the dimensionality can be tuned layer by layer in both additive and multiplicative manners. Each manner can be analytically understood in the mean-field limit. The theory predicts that, according to the scaling, the weak correlation among inter-RF synapses is able to promote dimension reduction in deep neural networks. In addition, these synapses slow down the neural decorrelation process along the network hierarchy. Therefore, our theory provides deep insights towards understanding how synaptic correlations shape compact neural representations in deep neural networks.
Model.-A deep neural network is composed of multiple layers of non-linear transformation of sensory inputs (e.g., natural images). The depth of the network is defined as the number of hidden layers (d), and the width of each layer is defined by the number of neurons at that layer. For simplicity, we assume an equal width (N ) in this study. To specify weights between l − 1 and l-th layers, we define a weight matrix w l whose i-th row corresponds to incoming connections to the neuron i at the higher layer (so-called the receptive field of the neuron i). Firing biases of neurons at the l-th layer are denoted by b l . The input data are transformed through consecutive hidden representations denoted by h l (l = 1, · · · , d), in During the transformation, a cascade of internal representations ({h l }) are generated by the correlated synapses, with the covariance structure specified by the matrix above the layer. g characterizes the variance of synaptic weights, while the diagonal block characterizes the inter-receptive-field correlation among corresponding synapses (different line colors), and q specifies the synaptic correlation strength. We do not know a priori the exact scaling form of q, which is self-consistently determined by our theory.
which each entry h l i defines a non-linear transformation of its pre-activation . Without loss of generality, we use the non-linear transfer function φ(x) = tanh(x).
To take into account inter-RF correlations, we specify the covariance structure as w l ik w l jk = q for i = j, and (w l ik ) 2 = g 2 /N for continuous weights, while (w l ik ) 2 = 1 for binary weights (±1) (see Fig. 1). The weight has zero mean. The bias follows N (0, σ b ). The orthogonal RF case was studied in a recent work [7] that demonstrates mechanisms of dimension reduction in deep neural networks of continuous weights. Here, we extend the analysis to the non-orthogonal case, and do not enforce any scaling constraint a priori to the correlation level q. The role of q in dimension reduction is determined in a self-consistent way. Note that we do not assume any prescribed correlations of intra-RF weights for simplicity. In the current setting, the weight matrix w l is more highly structured than in the orthogonal case, resembling qualitatively what occurs in a biological neural circuit [10,11].
To define the computational task, we consider a random input ensemble from which each input sample is drawn. This ensemble is characterized by zero mean and the covariance matrix Λ = 1 N mξmξ T , where mξ is an N × P matrix whose components follow a normal distribution of zero mean and variance σ 2 (σ = 0.5 throughout the paper). The ratio α = P/N controls the spectral density of the covariance matrix [17].
We first analyze the continuous-weight case. Con-sidering an average over the input ensemble, we define the mean-subtracted weighted-sum as , and thus a l i has zero mean. It follows that the covariance of a l can be written as ∆ l ij = a l i a l j = w l C l−1 (w l ) T ij , where C l−1 defines the covariance (two-point correlation) matrix of the hidden representation at the (l − 1)-th layer. The deep network defined in Fig. 1 implies that each neuron at an intermediate layer receives a large number of nearly independent input contributions. Therefore, the central limit theorem suggests that the mean of hidden neural activity m l and covariance C l are given respectively by where (1), we re-parametrize a l i and a l j by independent normal random variables, such that the covariance structure of the mean-subtracted pre-activations is satisfied [17]. In physics, Eq. (1) constructs an iterative mean-field equation across layers to describe the transformation of the activity statistics in the deep neural hierarchy.
In statistical physics, the macroscopic behavior of a complex system of many degrees of freedom can be described by a few order parameters. Following the same spirit, we define a linear dimensionality of the hidden representation at each layer as where {λ i } is the eigen-spectrum of the covariance matrix C l . In statistics, this measure is called the participation ratio [3] that is used to identify the number of non-zero significant eigenvalues. These eigenvalues capture the dominant dimensions that explain variability of the neural representation. Using the mathematical identities tr(C) = i λ i and tr(C 2 ) = i λ 2 i , one can derive that the normalized In a mean-field approximation [18], C l ij ∼ O(1/ √ N ), which implies that ∆ l ij is of the same order O(1/ √ N ). Therefore, C l ij can be expanded in terms of ∆ l ij , resulting in C l ij = K l ij ∆ l ij to leading order [17]. Here, where the mean z 0 i,j = b l i,j + [w l m l−1 ] i,j . For the binary-weight case, the pre-activation should be multiplied by a pre-factor g √ N to ensure that the preactivation is of the order one. By using the above expansion, one gets the relationship between Σ l+1 and Σ l as where the overline in κ ≡ (K l+1 ij ) 2 means the disorder average over the quenched network parameters [17]. Clearly, q can not be of the order one, otherwise Eq. (3) is not self-consistent in mathematics, in that N Σ ∼ O(1) because of the magnitude of C ij . A unique scaling for q must then be q = r √ N , resulting in q 2 N = r 2 where r ∼ O(1), and thus Eq. (3) is self-consistent in physics as well. We thus call this type of synapses the weaklycorrelated synapses. Inserting Σ l+1 into the dimensionality definition together with the linear approximation of ii K l 1 [17], one immediately obtains the dimensionality of the hidden representation at the (l + 1)-th layer, in terms of the activity statistics from the previous layer, Using the Cauchy-Schwartz inequality, one can prove that γ 1 ≤ γ 2 [17]. It is also clear that γ 2 ≥ 1.
(5) The self-consistency of the above equation for the neural correlation strength requires that q = r N 3/2 where r ∼ O(1). Therefore, q 2 N 2 vanishes in a large network width limit. It then follows that the output dimensionality given the input activity statistics reads When q = 0, γ 1 = 1, and the dimensionality formula for the orthogonal case is recovered [7]. Note that the dimensionality of the previous hidden representation for both binary-and continuous-weight cases is given bỹ . Therefore, the output dimensionality is tuned by a multiplicative factor γ 1 and an additive term [the last term in the denominator of Eq. (4) or Eq. (6)]. The tuning mechanism of the hidden-representation dimensionality in a deep hierarchy is thus much richer than that in the orthogonal scenario [7].
The linear relationship between Σ l+1 and Σ l in Eq. (3) and Eq. (5) defines an operating point where Σ l+1 = Σ l . When the input neural correlation strength is below the point, the non-linear transformation would further strengthen the correlation; whereas, the output neural correlation would be attenuated once the input one is above the operating point. The synaptic correlation increases not only the operating point, but also the overall neural correlation level, as indicated by a boost in the intercept while maintaining the slope of the linear relationship in the orthogonal case [17].
In the large-width (mean-field) limit, key parameters γ 1 , γ 2 , K l 1 , and K l 2 can be iteratively constructed from their values at the input layer. This iteration determines the mean-field solution of the dimension reduction, which captures typical properties of the system under different realizations of the network parameters (quenched disorder). Technical details to derive the above dimensionality evolution for both binary-and continuous-weight cases are given in the supplemental material [17].
Results and Discussion.-We first study the typical behavior of dimension reduction in networks of binary weights.
Surprisingly, the weak correlation among synapses is able to reduce further the hiddenrepresentation dimensionality across layers compared to the case of orthogonal weights [ Fig. 2 (a)]. Moreover, the synaptic correlation r can also boost the correlation strength Σ [ Fig. 2 (b)]. The boost is larger at earlier layers of deep networks. In a practical learning process, the synaptic plasticity can introduce a certain level of correlations among synapses, reflecting sensory uncertainty. Our theoretical model, despite using a random ensemble of correlated weights, reveals quantitatively the computation role of the synaptic correlations, which can impact both representation manifold and neural decorrelation process. In other words, the weak synaptic correlation accelerates the dimension reduction, while reducing the decay speed of the neural correlation strength.
The exact mechanism underlying the computational roles of synaptic correlation can be revealed by a large-N expansion of the two-point correlation function. In the thermodynamic limit, our theory predicts that the dimensionality can be tuned by two factors: one is multiplicative, captured by γ 1 , which is observed to grow until arriving at the unity [ Fig. 2(d)], and always equal to the unity only at q = 0. This multiplicative factor [see Eq. (4) or Eq. (6)] gives rise to further reduction of dimensionality, especially at deeper layers. However, its value can be less than one at earlier layers, thereby allowing possibility of increasing the dimensionality at the (l + 1)-th layer, provided that where A > 0 denotes the additive term in Eq. (4) or Eq. (6). This predicts that transient dimensionality expansion is possible given strong neural correlations or redundant coding at earlier layers, which was empirically observed during training in both feed-forward and recur- rent neural networks [5,7]. The other additive factor is directly related to r 2 . This extra term is clearly positive, thereby contributing an additional reduction of dimensionality. Both factors compete with each other; the multiplicative factor saturates at the unity [the left inset of Fig. 2 (d)], while the additive term decreases with the network depth, which overall makes the dimension reduction more slowly at deeper networks, in consistent with a practical learning of image classification tasks where the final low-dimensional manifold must be stable for reading out object identities [15,[19][20][21][22][23]. The elegant way in which the overall neural correlation level is tuned can also be explained by our theory. The weak synaptic correlation contributes an additional additive term in the linear relationship between the input and output neural correlations. This additive term, namely r 2 (K l 1 ) 2 [the same for both binary and continuous weights, see Eq. the linear relationship while maintaining the same slope with the case of r = 0 ( Fig. S1 in [17]). The decorrelation process would thus proceed more slowly when going deeper into the network. The prediction of our theory may then explain the empirical success in improving the classification performance by a deccorrelation regularization of weights [24]. By changing the other statistics of network parameters, say g and σ b , but fixing r, one can observe rich effects of these parameters [ Fig. 2 (c)]. First, the dimension reduction seems to be unaffected, or changes by a negligible margin. However, the correlation strength of the hidden representation displays rich behavior. The weight strength elevates the correlation level, as also expected from previous studies [7], playing the similar role to the synaptic correlation [ Fig. 2 (b)]. In contrast, increasing the firing bias would further decorrelate the hidden representation. The mechanisms are encoded into Eq. (3) and Eq. (5), in that both the weight strength and firing bias affect the slope and intercept of the linear relationship in a highly non-trivial way, via a recursive iteration from initial values of network activity statistics [17]. These rich effects of tuning the neural correlation of hidden representations may thus shed light on understanding the empirical success of reducing overfitting by optimizing decorrelated representations [25]. The decorrelated representation also coincides with the efficient coding hypothesis in system neuroscience [26]. In this hypothesis, a useful representation should be maximally disentangled, rather than being highly redundant. Our analysis provides a theoretical support for this hypothesis, emphasizing the role of weakly-correlated synapses.
We finally remark that the above analysis also carries over to networks with continuous weights (Fig. 3). Our simulations on deep networks trained with Hebbian learning rules [27,28] also show key features of the theoretical predictions of our simple model. Details are given in the supplemental material [17].
Summary.-Benefits of weakly-correlated synapses are widely claimed in both system neuroscience [4,[9][10][11] and machine learning [24]. The benefits are also realized in a theoretical model of shallow-network unsupervised learning [12], a fundamental process in cerebral cortex [29]. However, the theoretical basis about how weakly-correlated synapses promote the computation in deep neural networks remains unexplored, due to the theoretical complexity of handling the covariance matrix of neural responses. This work tackles the challenge, by deriving mean-field theory for diagonally block-organized correlation matrix of synapses, inspired by our recent work of unsupervised learning [12].
Our work first identifies the unique scaling form of synaptic correlation level, for both binary and continuous weight values, based solely on the mathematical selfconsistency. This scaling form may be experimentally testable, like in the intact brain [30]. The theory then reveals in what exact way the weak synaptic correlation tunes both representation dimensionality and associated decorrelation process. More precisely, the synaptic correlation accelerates the dimension reduction, while slowing down the decorrelation process. Both computation roles can be explained in a large-N expansion of our mean-field equations. These predictions coincide with empirical successes in decorrelation regularizations of either synaptic level [24] or neural level [25]. Our theory is thus promising to encourage further understanding of how synaptic and neural correlations interact with each other to yield a disentangled representation supporting the success of deep learning, and even hierarchical information processing in different pathways of neural circuits, a challenging problem in interdisciplinary fields across physics, machine learning and neuroscience [31,32].
We thank Chan Li and Wenxuan Zou for a careful reading of the manuscript. This research was supported by the start-up budget 74130-18831109 of the 100-talentprogram of Sun Yat-sen University, and the NSFC (Grant No. 11805284).

Supplemental Material
Theoretical analysis in the large-N limit In this paper, we consider deep neural networks of either binary or continuous weights. In the case of binary weights, the distribution of w l ij is parametrized by zero mean and the covariance given by w l ij w l ks = δ js q + δ ik δ js (1 − q) = qδ js (1 − δ ik ) + δ ik δ js (S1) where · denotes the disorder average over the network parameter statistics, and δ js is the Kronecker delta function. An illustration of this weight covariance is already given in Fig.1 in the main text. For simplicity, we consider only inter-RF correlations, neglecting intra-RF ones. The bias parameter b l i follows independently a Gaussian distribution with zero mean and variance σ b . The activation h l i is then given by For the continuous-weight case, the parameter w l ij follows the Gaussian distribution with zero mean and the covariance specified by w l ij w l ks = δ js q + δ ik δ js The bias parameter b l i follows independently a Gaussian distribution with zero mean and variance σ b . The activation h l i is then given by Mean-field iteration of activity moments In this section, we derive the mean-field iteration of activity moments. Note that, for both binary and continuous weights, the form of the iteration looks very similar, we here thus derive in detail the iteration for the continuous case. We first derive the mean-field equation for the mean activity m l i as follows, where the average · is defined over the activity statistics throughout the paper, and we define the mean-subtracted weighted-sum (or pre-activation) a l i = j w l ij h l−1 j − h l−1 j , then its expectation is zero, and variance is given by where C denotes the covariance matrix of the neural activity. Because a l i is the sum of N nearly-independent random terms, as N → ∞, we apply the central limit theorem, and obtain where Dt = e −t 2 /2 dt/ √ 2π. Then, we consider the covariance of activities. Note that the Gaussian random variable a l i has a variance ∆ l ii . The activity covariance is then given by where Dx = e −x 2 /2 dx/ √ 2π, and ψ = ∆ l ij / ∆ l ii ∆ l jj . a l i and a l j have been parametrized by two independent standard Gaussian random variables, say x and y, respectively. The pre-activation correlation has been captured by the correlation coefficient ψ (|ψ| ≤ 1).
For binary weights, the above Eqs. (S6), (S7) become and where With the activity moments, we can then evaluate the dimensionality of the l-th layer by where {λ i } is the eigen-spectrum of the covariance matrix C l . Then we can define the normalized dimensionality as D l = (Tr C l ) 2 N Tr(C l ) 2 , which is then independent of the network width N . To derive the recursion of dimensionality for each layer, we define additionally K l 1 = 1 i<j (C l ij ) 2 for a large value of N . The normalized dimensionality of the l-th layer is thus expressed as which is useful for the following theoretical analysis.

Expansion of two-point correlations
In the mean-field limit, we can assume C l ij ∼ O(1/ √ N ) for i = j [18]. In the following part, we assume weights take continuous values. Binary weights can be similarly analyzed, by noting that the pre-activations should be multiplied by a pre-factor g/ √ N . We first analyze the off-diagonal part of the covariance matrix. First, we notice that . That is, when N is sufficiently large, ∆ ij is very small. Then we can carry out a Taylor expansion with respect to a small ∆ ij whose layer index is added below: where we define z 0 (S14) where the linear coefficient is an average over standard normal variables, and is called hereafter K l ij for the following analysis. We next remark that ∆ ii . In the small-g limit, we can carry out an expansion in √ ∆ ii whose layer index is added below, and get (S15) This result was reported in our previous work [7]. We then analyze the diagonal part of the covariance matrix, We expand the above formula in the small ∆ l ii , i.e., φ a l i + z 0 (S17) Therefore, we can write C l ii K l ii ∆ l ii , where K l ii is the shorthand for the linear coefficient. To improve the prediction accuracy, one needs to include high-order terms into this approximation. We observe that if we use Eq. (S14) by setting i = j, the theoretical prediction in the main text can match the numerical simulation results even in a relatively large value of g. This may be due to the fact that the contribution of ∆ l ii is taken into account when computing K l ii . For binary weights, the expanded covariance is just the same as the equations [Eq. (S14) and Eq. (S17)], yet with

Binary weights
First, we calculate K l 1 , and in the large N and small g limits, we obtain where · means an average over the distribution of network parameters, and ∆ l ii is approximated by Note that the argument of φ (·) is a sum of a large number of nearly-independent random variables. It is then easy to write that g According to the central limit theorem, we obtain where we have defined where we have used ∆ l ii = g 2 K l−1 1 . Finally, we obtain the recursion for K l 1 , Note that K l 1 can also be calculated recursively without the small-g assumption as follows, Next, the recursion of K l 2 can be calculated by definition as follows, where we have assumed that K l ii in the large-N limit does not depend on the specific site index, and thus Note that K l 2 can be evaluated recursively without the small-g assumption as where z, u, and t are all standard normal variables, and f We finally derive the recursion of Σ l . First, for the binary weights, to compute ∆ 2 ij , where the layer index can be added later, we have where the cross-term vanishes in statistics to derive the second equality, due to the vanishing intra-RF correlation for one hidden neuron. The third equality is derived by considering the inter-RF correlation in our current setting. Finally, we arrive at where we have to assume q = r/ √ N [r ∼ O(1)], and (K l+1 ij ) 2 is used to replace (K l+1 ij ) 2 in the mean-field approximation and can be computed recursively as follows, where x, y, z 1 , z 2 , u 1 , u 2 are all standard Gaussian random variables, capturing both thermal and disorder average (inner and outer ones respectively). The correlation coefficient is given by Therefore, q plays an important role in our current theory, in contrast to the orthogonal case [7].

Continuous weights
First, we calculate K l 1 , and in the large N limit, we obtain where · means an average over the distribution of network parameters, and ∆ l ii can be approximated by The argument of φ (·) is a sum of a large number of nearly-independent random variables, as a result, it is easy to write that j w l ij m l−1 j + b l i = 0, and According to the central limit theorem, we obtain where we have defined Q l−1 def = 1 N i m l−1 i 2 . The recursion of Q l becomes where we have used ∆ l ii = g 2 K l−1 1 . Finally, we obtain the recursion for K l 1 as Note that K l 1 can also be calculated recursively without the small-g assumption as follows, We then derive K l 2 as follows, where (K l ii ) 2 is approximated by Note that K l 2 can be evaluated recursively without the small-g assumption as where z, u, and t are all standard normal variables, and f To finally obtain the recursion of Σ l , we first analyze ∆ 2 ij whose layer index will be added later as follows, where ρ = qN g 2 due to the fact that w ik w jk = q as well as w 2 ik = g 2 /N . To arrive at the final equality, we have used the statistics equality as follows, We finally arrive at the recursion for Σ l as follows, where we have to assume q = r/N 3 2 [r ∼ O(1)], and (K l ij ) 2 is used to approximate (K l ij ) 2 and can be computed as follows, where x, y, z 1 , z 2 , u 1 , u 2 are all standard Gaussian random variables, capturing both thermal and disorder average (inner and outer ones respectively). The correlation coefficient in the continuous-weight case is given by We finally remark that the synaptic correlation is able to boost the neural correlation level when transmitting signal via hidden representations. From the linear relationship between Σ l+1 and Σ l [see Eq. (S28)], one derives for the binary weights that the operating point is given by: where Σ l * has been multiplied by N , and Υ def = g 4 (K l+1 ij ) 2 . Eq. (S45) implies that the operating point is increased by the synaptic correlations (the last term in the equation). The intercept of the linear relationship is also increased by a positive amount Υr 2 (K l 1 ) 2 . Note that the slope of the linear relationship under the orthogonal-weight and correlatedweight cases are the same. These effects are qualitatively the same for both continuous and binary weights, which is shown in Fig. S1.

Binary weights
According to the definition, with the help of Eqs. (S22) and (S24) and the recursion equation for Σ l , we obtaiñ where γ 1 = K 2 ij /K ii 2 , γ 2 = K 2 ii /K ii 2 and r = qN 1 2 . When the superscripts of layer index for K ij and K ii are clear, the superscripts are omitted. Here, we manage to use the activity statistics at previous layers to estimate the dimensionality of the current layer, rather than the original formula [Eq. (S12)]. Thus the mechanism for dimensionality change can be revealed.
Note that to evaluate γ 1 and γ 2 , we need to compute the following quantities, where ρ = q. Q l , K l 1 , and K l 2 can also be computed recursively by following the iterative equations in Sec. .

Continuous weights
According to the definition, with the help of Eqs. (S35) and (S37) and the recursion equation for Σ l , we obtaiñ where γ 1 = K 2 ij /K ii 2 , γ 2 = K 2 ii /K ii 2 and r = qN 3 2 . When the superscripts of layer index for K ij and K ii are clear, the superscripts are omitted. The iterative equations for computing γ 1 and γ 2 are the same with Eq. (S47), yet with ρ = qN g 2 for the continuous-weight case. Q l , K l 1 , and K l 2 can also be computed recursively by following the iterative equations in Sec. .

Closed-form mean-field iterations for estimating the dimensionality
For continuous weights, the mean-field iteration of the covariance matrix of activations is given by and where Dx = e −x 2 /2 dx/ √ 2π, and ψ = ∆ l ij / ∆ l ii ∆ l jj . By taking the covariance matrix Λ and zero mean of the data as initial values, we can get the covariance matrix of the activation values of each subsequent layer by iterating the For binary weights, the equation of the mean-field iteration is just a bit different, given by and The procedure of estimating the linear dimensionality is similar to that of the continuous-weight case.

Numerical generation of weight-correlated neural networks
We consider a five-layer fully connected neural network with one input layer and four hidden layers. The number of neurons in each layer is specified by N . The parameters of the network are generated by following the procedure below, and after the initialization, all parameters remain unchanged during the simulation of dimension estimation, and then the result is averaged over many independent realizations of the same statistics of network parameters.

Binary weights
The binary weight (w ij = ±1) follows a statistics of zero mean and the covariance specified by Diagonalization of the full covariance matrix of binary weights is challenging. However, no correlation occurs within each RF. Then we can generate the network weights for each diagonal block in Fig.1 (see the main text) independently by a dichotomized Gaussian (DG) process [33]. In the DG process, the binary weights can be generated by w l ij = sign(x l ij ), where where x l ij is sampled from a multivariate Gaussian distribution of zero mean (due to w l ij = 0) and the following covariance, as also shown in a schematic illustration in Fig. S2, The relation between q and Σ can be established by matching the covariance of the DG process with our prescribed correlation level q, i.e.,

Continuous weights
For the continuous case, the parameter w l ij follows the Gaussian distribution with zero means and the covariance specified by w l ij w l ks = δ js q + δ ik δ js When generating the weights, we divide w l into N vectors, each of which is defined by [w l 1j , w l 2j , · · · , w l N j ] (j = 1, 2, · · · , N ). Then all those vectors are independently sampled from a multivariate Gaussian distribution with zero means and the covariance matrix Σ ∈ R N ×N , in which Σ is a symmetric matrix with diagonal elements g 2 /N and off-diagonal elements q. The parameter b l i follows N (0, σ b ) independently.
Numerical generation of input data samples following the pre-defined covariance The input dataset for our deep transformation include 100000 data samples, which are independently sampled from a multivariate Gaussian distribution with zero means and the covariance matrix specified by Λ = 1 N ξξ T , where ξ ∈ R N ×P and its elements follow a Gaussian distribution ξ ij ∼ N (0, σ 2 ). We define α = P/N , and the relation between α andD can be proved to beD = α/(1 + α), as we shall show later.
First, we need to calculate the eigenvalue spectrum of Λ, namely {λ Λ i } by replica trick [34]. The Edwards-Jones formula reads where the Sokhotski-Plemelj identity is used to derive the second equality, and a Gaussian integral representation of the determinant leads to where λ = λ − i , and dy Hereafter, for simplicity, we consider only the annealed average Z(λ) Λ , instead of the more complex quenched one like that analyzed in Ref. [34]. The result will be cross-checked by numerical experiments. In some cases, the annealed average agrees with the quenched one via replica trick, due to the fact that different replicas of the Gaussian variables y are decoupled [35]. It then proceeds as follows, where dm def = i dmi √ π , and we have used the integral identity for m: We then calculate the expectation with respect to ξ, and obtain Thus, where we have used the following identity: We can now write down Z(λ) as where f λ (q, r,q,r) = 1 2 λ q + αr − ασ 2 rq − qq + 1 2 lnq − αrr + α 2 lnr.
When N → ∞, we can use Laplace's approximation, from which The stationary point (q , r ,q ,r ) is computed as Applying the Edwards-Jones formula in the annealed version, we obtain Note that ∂f λ ∂λ = 1 2 q , where q can be obtained by solving the saddle-point equations [Eq. (S71)]. We finally arrive at where λ ± = ( √ α ± 1) 2 for α > 1. The comparison between theory and simulation is shown in Fig. S3. In the large N limit, the normalized dimensionality of the matrix Λ can be written in the form as follows, The numerator part is given by the first order moment of the eigen-spectrum,  where we calculate the integral by computing the area of the semicircle shown in Fig. S4 (a). The denominator part is given by the second moment of the spectrum, computed as where the integral here can be transformed to half of the volume of the cylinder shown in Fig. S4 (b) [36]. Finally, we conclude thatD from which the normalized input dimensionality is independent of the input-pattern (ξ) variance σ 2 . This analytic result is confirmed in numerical simulations in the main text.

Deep neural networks trained with Hebbian learning rules
To verify the revealed principles in this paper, we perform an on-line training of a layered neural network using the same transfer function, by applying the well-known Hebbian rule. We use the synthetic dataset generated in Sec. , containing 10 000 input samples (α = 2). These data samples are then sequentially shown to the input layer of the deep neural network, and then learned layer by layer. We use the synaptic rescaling to control the synaptic strength, like w ij (t) ← , where w ij (t) is updated by the following regularized Hebbian rule: where t denotes the learning step, N denotes the receptive field size of the hidden neuron at the l-th layer, η is the learning rate, and κ c enforces the synaptic-correlation constraint, inspired by our theory. The last term in Eq. (S80) can be derived as the gradient descent of the correlation-constraint objective: where the synaptic-correlation scaling derived in our paper is used for learning the continuous weights. The synaptic rescaling operation together with the synaptic-correlation constraint encourages synapses to compete with each other to encode input features during learning [27,28]. As shown in Fig. S5, the regularized Hebbian learning rule is able to reduce the dimensionality of hidden representations, while decorrelating the representations as well. The qualitative behavior of the on-line trained systems coincides with the theoretical predictions of our model about roles of synaptic correlations. Therefore, it is promising to design unsupervised/supervised learning algorithms that can control the synaptic correlations in future works, e.g., in sensory perception of real-world datasets.