Data-driven effective model shows a liquid-like deep learning

The geometric structure of an optimization landscape is argued to be fundamentally important to support the success of deep neural network learning. A direct computation of the landscape beyond two layers is hard. Therefore, to capture the global view of the landscape, an interpretable model of the network-parameter (or weight) space must be established. However, the model is lacking so far. Furthermore, it remains unknown what the landscape looks like for deep networks of binary synapses, which plays a key role in robust and energy efficient neuromorphic computation. Here, we propose a statistical mechanics framework by directly building a least structured model of the high-dimensional weight space, considering realistic structured data, stochastic gradient descent training, and the computational depth of neural networks. We also consider whether the number of network parameters outnumbers the number of supplied training data, namely, over- or under-parametrization. Our least structured model reveals that the weight spaces of the under-parametrization and over-parameterization cases belong to the same class, in the sense that these weight spaces are well-connected without any hierarchical clustering structure. In contrast, the shallow-network has a broken weight space, characterized by a discontinuous phase transition, thereby clarifying the benefit of depth in deep learning from the angle of high dimensional geometry. Our effective model also reveals that inside a deep network, there exists a liquid-like central part of the architecture in the sense that the weights in this part behave as randomly as possible, providing algorithmic implications. Our data-driven model thus provides a statistical mechanics insight about why deep learning is unreasonably effective in terms of the high-dimensional weight space, and how deep networks are different from shallow ones.


I. INTRODUCTION
Artificial deep neural networks have achieved state-of-the-art performance in many industrial and academic domains ranging from pattern recognition and natural language processing [1] to many-body quantum physics and classical statistical physics [2]. However, it remains challenging to reveal the mechanisms underlying the success of deep learning. One key obstacle is building the causal relationship between the high-dimensional non-convex optimization landscape and the state-of-the-art performance of deep networks [3]. In the past few years, many theoretical and empirical efforts contributed to understanding this fundamental relationship in deep networks of real-valued weights. High-loss local minima of the optimization landscape are rare in the over-parametrization regime where the number of trainable parameters (weights) in a typical deep network is much larger than the number of training data [4]. All minima in the landscape were shown to be global minima, given special requirements for network architectures and neural activation functions [5]. In particular, no substantial barriers between minima were observed in deep learning [6]. These minima can be even connected via hyper-paths within a unique global valley [7][8][9]. The geometric structure of local minima was also extensively studied [10][11][12][13], in terms of a controversial description of the geometric flatness of the minima [14][15][16][17][18]. The flat minima imply that the weights around them can be perturbed without significantly changing the training cost function. The flatness of such minima was empirically supported by the eigen-spectrum of the Hessian matrix evaluated at those minima [11]. The curvature of the landscape was also claimed to correlate with the generalization ability of deep networks for unseen data [18,19]. Therefore, studying deep networks from the perspective of the high-dimensional weight space is fundamentally important, yet the previous studies did not give a global view about the collective behavior of interacting weights.
Physics methods have been applied to investigate the weight space structure of non-convex high-dimensional problems, e.g., the spherical or binary spin model approximation of deep neural networks [20][21][22][23], and the weight space structure of binary perceptron problems [24][25][26]. However, one drawback of these studies concerns their strong assumptions to build toy models. These models assume unstructured data as well as shallow networks, and moreover the learning behavior is far from the practical training procedure used in a typical deep learning. Therefore, it is computationally hard to directly compute the weight space even beyond two-layer architectures, and moreover it remains elusive from an interpretable-model perspective what the geometric structure of the deep-learning landscape looks like and which key factor affects the landscape. In this paper, we restrict the theoretical analysis to deep networks of binary synapses, which is particularly important for robust and energy-efficient neuromorphic computation [27,28].
Here, from a conceptually different viewpoint, we propose a data-driven effective model, taking into account interactions among structured data, deep architectures, and training.
We consider deep networks of binary weights that are amenable for theoretical analysis and moreover can be efficiently trained with mean-field methods [29]. The weight configurations realizing a high classification accuracy of the structured data are collected as a large ensemble, whose statistics is described by first-order and second-order moments of the weight distribution. To construct an effective model of the weight space, the maximum-entropy principle [30] is applied. This principle was previously used to analyze neural population at the collective network-activity level [31,32]. This effective model of the practical deep learning is then analyzed from an entropy landscape angle, in which the geometric structure of the entire weight space can be characterized in different contexts: over-parametrization, under-parametrization and shallow networks.
Our study establishes an interpretable model of the weight space focusing on the global view of the landscape, and this model has three important predictions: (i) The deep-learning weight-space is smooth and dominated by a single well-connected component, while the shallow network is characterized by a broken weight space. (ii) For the deep network, the under-parameterization and over-parameterization (the number of training data is more and less than the number of parameters, respectively) belong to the same universal class with a common smooth landscape, which exactly coincides with recent empirical findings of no substantial barriers between minima in the deep-learning loss landscape [6][7][8], although we focus on the deep networks of binary weights. We remark that the universal class may be related to the absence of the double descent phenomenon for the deep networks of binary weights, at least in the explored range of training parameters. (iii) A most surprising prediction is that a special interior part of a largest entropy in the weight space emerges after learning, showing a liquid-like property, compared to two more constrained boundaries of the deep network. This part is conjectured to play a key role in fast learning dynamics. m i = σ i and pairwise correlations (second-order moments) C ij = σ i σ j will be computed directly from the collected weight configurations. Based on the weight statistics, we establish an effective data-driven Ising model that is able to reflect the weight space characteristics, and then apply the entropy landscape analysis by introducing a distance-dependent extra term into the Boltzmann distribution of the effective Ising model. The entropy landscape analysis can help to explore the internal structure of the weight parameter space by counting how dense weight parameters σ are at a typical distance d away from the reference one σ * .

II. DEEP LEARNING SETTING
In order to collect weight configurations in the neural network parameter space, we design a neural network architecture to solve the classification task of the handwritten digits (or the MNIST dataset [33]) first, and then an extensive number of weight solutions that lead to high test accuracy are collected. The test accuracy refers to the network's ability to classify unseen dataset. More precisely, we consider a four-layer fully-connected feedforward network. Each layer has N l (l = 0, 1, 2, 3) neurons, where N 0 equals to the dimension of the input vector, and N 3 equals to the number of total classes. The weight matrix W l indicates connections between layer l and layer l − 1 (Fig. 1). For simplicity, we do not consider biases for neurons. The neural activity (including the input) is transfered by a non-linear function, i.e., the leaky ReLU (L-ReLU) function defined by max(0, z)+0.1 min(0, z), where z denotes the pre-activation or the weighted-sum of input activities. We finally add a softmax layer after the output layer and use the cross entropy as the loss function. Because of the second demand, the network with binary weights is difficult to train by the standard gradient decent algorithm, due to the fact that the cost function is not differentiable with respect to the discrete synaptic weight. To tackle this challenge, we apply a recentlyproposed mean-field training algorithm [29]. More precisely, we consider the situation that each weight w l ij follows a binomial distribution P (w l ij ) parametrized by an external field θ l ij as follows: where σ(·) is a sigmoid function. The external fields are clearly continuous. Therefore, the standard backpropagation can be applied to the external fields {θ l ij }, instead of weights (ill-defined in gradients).
In the feedforward transformation, the pre-activation can be written as z l j = i w l ij a l−1 i , where a l−1 i is the activation at the layer l−1. The weight w l ij is subject to the aforementioned binomial distribution, and thus its mean and variance, i.e., µ l ij and (σ l ij ) 2 , are given by: To avoid a direct sampling process, we use the local re-parametrization trick [34]. According (green dots) and 20 000 learned weight configurations (blue dots). The low dimensional projection is carried out by the t-SNE technique [35]. The random initialization denotes the weight configuration (or initial state) sampled by the Gaussian random initial fields θ 0 . The collected weight configurations (or solution states) are obtained at the end of the training starting from the initial fields θ 0 . For the solution states, the deeper their colors are, the higher the corresponding test accuracies are, while the random initial states are all of the same green color of chance-level accuracy.
to the central limit theorem, z l j can then be approximated to follow a Gaussian distribution Hereafter, we define the mean and variance of z l j to be m l j and (v l j ) 2 , respectively. Taken together, the feedforward transformation can be finally re-parametrized as: where l j is a standard Gaussian random variable, and 1 √ N l−1 is a scaling factor to eliminate the layer-width dependence of the activation a l j . Detailed training procedures are given to Appendix A.
In practice, we use two hidden layers with 15 neurons for each. The widths of input layer and output layer are determined by the data, which are 20 and 10, respectively. Thus, our network architecture is specified by 20-15-15-10, resulting in the number of weight parameters N = 675. We highlight that the total number of possible weights 2 N ≈ 10 1213 is already very huge for a statistical analysis. As expected, the size of hidden layer makes learning improve over the shallow counterpart (see Appendix G). Unless stated otherwise, we use the entire MNIST dataset (i.e., 60 000 digits examples) for training and the other 10 000 examples for testing. In the following analysis, we denote P as the number of training examples.
By fine tuning hyper-parameters of the training procedure (see Appendix D), the neural network can reach a test accuracy of 85% [ Fig. 2(a)], i.e., the trained network can classify correctly 85% of the unseen test dataset. To collect distinct weight configurations, we train the neural network for 50 epochs starting from a random Gaussian initialization of external fields θ during each sampling trial. Then, we use the trained external fields to generate 10 weight configurations and select the one with the optimal test accuracy (≥ 80%). In this way, we obtain two million weight configurations (the corresponding number is specified by M below) with test accuracies ranged from 80% to 83%, for constructing an effective model.
To give a brief picture of the weight parameter space, we apply the t-SNE dimension reduction technique [35] to 20 000 random initialization points and the corresponding learned weight configurations with the same number [ Fig. 2 (b)]. Here, random initialization points represent the weight parameters sampled from the Gaussian-initialized external fields, from which the external fields are trained to reach the test accuracy threshold. The collected weight configurations are sampled from the final binomial distribution. Note that the dominant part of the weight configuration space (valid weights realizing the classification task) seems to be organized into a large connected component [ Fig. 2 (b)], which also indicates that compared to the entire parameter space, the weight space realizing the classification task occupies a relatively small region. Next, we design a semi-rigorous statistical mechanics framework to explore a detailed high-dimensional geometric structure of the weight space for the current deep-learning model.

III. DATA-DRIVEN ISING MODEL
We build a data-driven Ising model to describe these M weight configurations, as we are interested in the statistical properties of the deep-learning weight space. More precisely, one weight parameter solution W = {w 1 , w 2 , . . . , w N } is transformed to a spin configuration σ = {σ 1 , . . . , σ N } in the following analysis. Given the constraints of weight statistics including magnetization (first-order moments) m i = σ i and pairwise correlation (secondorder moments) C ij = σ i σ j , the least-structured-model probability distribution P (σ) to fit the weight statistics follows the maximum entropy principle [30] (see also Fig. 1 for an illustration). According to this principle, P (σ) is recast into the form of the Boltzmann distribution: where Z is the partition function in statistical physics, β is the inverse temperature (β = 1 throughout the paper as it can be absorbed into the model parameters), and E (σ) = To find the model parameters Ω = {J, h}, we use the gradient accent algorithm corresponding to the maximum likelihood learning principle. By maximizing the log-likelihood ln (P Ω (σ)) weight-data with respect to the model parameters Ω = {J, h}, we can obtain the following iterative learning rules [36]: where t and η denote the learning step and learning rate, respectively. In the above learning equations, the data expectation terms can be directly computed from the sampled parameter solutions. However, the model expectation terms are commonly difficult to compute due to the O(2 N ) computation complexity. Fortunately, the cavity method in spin glass theory (Appendix C) can be used to approximate the model expectations. At each learning step, to evaluate how well the Ising model fits the weight statistics, we compute the root-mean squared error (or deviation) between the data expectations and model expectations: to a zero deviation. We stop the learning when ∆ < 0.01 or after 300 learning steps are reached.
In Fig. 3(b), we verify that the reconstructed magnetizations and correlations from the data-driven Ising model are in good agreement with the measured ones, which ensures that the data-driven Ising model is an effective model whose model parameters Ω = {J, h} capture statistical properties of the deep-learning weight space. Moreover, we also find that the data-driven Ising model can reproduce a significant fraction of three-weight correlations (Appendix B). The histogram of the model parameters is shown in Fig. 3(a). Note that most of the couplings are concentrated around zero value for the full model, despite a relatively less frequent tail. In contrast, the bias is broadly distributed. To get more insights about the model parameters, we plot the histogram for different sub-parts of the deep network [ Fig. 3(d)]. Compared to the boundaries, i.e., LB and RB, the interior's couplings are more concentrated around zero value, and moreover the corresponding biases are all concentrated around −0.0065 corresponding to the highest peak in the histogram of Fig. 3 (a), which clearly implies that weights in the interior of the deep network have weak correlations, and moreover the magnetization gets close to zero, behaving like in a high temperature phase, as we shall prove semi-rigorously using the concept of entropy landscape. Interestingly, the number of peaks in the LB's (resp. RB's) bias distribution approximately corresponds to the number of input (resp. output) neurons. This observation demonstrates that the LB's and RB's weights perform redundant encoding and robust decoding, respectively.
Given the model parameters of the constructed Ising model, we can compute the model energy of every weight configuration directly. Fig. 3(c) shows the model energy density versus the test accuracy. Though the fluctuation is large, we find that the energies of sampled weight-configurations distribute around the typical energy of the Ising model and they have high accuracies, which helps to build an intuitive relationship between the model energy landscape and the deep-learning landscape.
To cross-check the qualitative properties of the constructed Ising model, we also apply the pseudolikelihood maximization method to construct the Ising model, relying on the entire configuration data, rather than only the first two moments of the statistics. This method allows us to control the sparsity of the effective Ising model, i.e., the number of null couplings. Introducing the null couplings in the model is physically intuitive, since weights in different layers may not strongly interact with each other. Therefore, we construct sparse Ising models only for deep networks. Technical details are given in Appendix E. As shown in Fig. 4, the aforementioned properties are conserved. In fact, the null couplings explain the weakly interacting weights across layers, corresponding to negligible values of couplings in the dense counterpart. Thus it can be expected that the qualitative behavior does not change for both methods.

IV. ENTROPY LANDSCAPE ANALYSIS
In the previous section, we construct an effective Ising model, whose model parameters preserve the information up to second order correlations of the weight space. In order to explore the internal structure of the weight space of the deep learning, we introduce a distance-dependent term x i σ * i σ i in the original Boltzmann distribution [Eq. (4)] as follows [32]: where σ * is the reference weight configuration taken from the collected weight data, its energy is computed from the Ising Hamiltonian, and x is the coupling field constraining the distance between the configuration σ and the reference one σ * . Intuitively, x > 0 implies that the weight configurations closer to the reference are preferred, while x < 0 implies that the weight configuration far away from the reference are preferred (see Fig. 1 for an illustration).
Then the geometric structure of the high-dimensional weight space can be summarized into the constrained free energy function given below.
where E(σ) is the Hamiltonian of the effective model. We further introduce the energy density = E(σ)/N and the overlap q = i σ i σ * i /N . It then follows that where βf ( , q) ≡ β −xq−s( , q) is the constrained free energy as a function of energy density and overlap q. This constrained free energy captures the competition between the model energy and entropy. The entropy is defined by s( , q) = 1/N ln measuring the log-number of the weight configurations with energy N and overlap q given a reference. In the following analysis, we omit the energy density dependence of s( , q) and write it as the function of d, i.e., s(d). We also assume β = 1, as the inverse temperature has been absorbed into the inferred couplings and biases of the effective model (see Sec. III).
In the case of sparse Ising models learned by the pseudolikelihood maximization method, the cavity approximation may break. However, to estimate the entropy, one can use the heat-capacity integration technique (see Appendix H), despite its demanding computational cost.
To have a complete understanding of the deep-learning landscape, we analyze three representative scenarios. The first one is the under-parametrization case, in which the num- Let us then look at the entropy landscape. According to Eq. (6), the entropy of the system at x = 0 gives a rough estimate of the size of the original weight space, which is one advantage of our statistical mechanics analysis (see a similar study of constraint satisfaction problems [39]). By varying the value of x, we found that the entropy landscape is also smooth. For a system with discrete degrees of freedom, one can easily compute the upper work of learning credit assignments [40]. In sum, the under-parametrization does not lead to a broken weight space, with a liquid-like interior next to two more constrained boundaries.
The aforementioned properties of the deep learning weight space are qualitatively conserved, when the effective model is constructed by applying the pseudolikelihood maximization method, where the sparsity of the model is controlled (Fig. 6). When the cavity approximation breaks, the heat-capacity integration method can be used, yielding the smooth landscape as well (data not shown).
The entropy landscape analysis also allows to compute the iso-energy curve (technical details are given in Appendix C). Figure 7 shows the iso-energy entropy curve, where every point on the curve shares the same energy with the reference configuration. This curve is suppressed when d ∈ [0.1, 0.9], compared to the original entropy landscape. In the regime close to the reference, the entropy is nearly the same as that without the energy constraint.
In the regime around x = 0, there appears a negligible hysteresis loop, indicating that the roughness of the landscape is weak even if the iso-energy constraint is enforced. Note that there exist two local minima in the energy-distance curve, whose asymmetry is caused by the external bias of weights. This shows that by applying the current physics tool, one can access the structure of the specified landscape around any reference point in the high-dimensional deep-learning space.

B. Over-parametrization scenario
We then ask whether the nice property of the under-parametrization case transfers to the over-parametrization case, which is the popular setting in the current deep learning era. We thus investigate the over-parametrization case, where the number of training examples is less than the model complexity (Appendix D). The standard over-parameterization requires that α 1, while our analysis is limited to a light version (α > 1) of over-parametrization.
We surprisingly find that the qualitative properties of the under-parametrization case also hold in the over-parametrization case [ Fig. 5 (b)], in an excellent agreement with recent empirical studies of deep learning at a large-scale architecture [6][7][8]. All these previous studies confirmed that the stochastic gradient descent dynamics does not suffer from high loss barriers [41], and the loss function landscape is characterized by many minima yet separated by low barriers [42]. Our finding adds further evidences by claiming a smooth landscape for deep networks of binary synapses, in both under-and over-parametrization scenarios, which may be connected to the absence of double descent phenomena (Appendix F).
However, we notice a salient difference that the extent the interior entropy gets close to the upper bound is weaker than that of the under-parametrization case, despite the fact that the interior entropy is still highest among three parts of the deep network. Even in the overparametrization case, a liquid-like interior inside the network exists, coinciding exactly with a recent study of deep networks in both random inputs/outputs setting and teacher-student generalization setting [43]. This recent study claimed a glass-liquid-glass/solid-liquid-solid pattern inside the deep-learning architecture. Despite a relatively small scale for the sake of the data-driven analytical study, our work claims the same phenomenon in the central part of the deep network by constructing an effective data-driven model for a practical deep learning. Although we do not observe a glass/solid like boundaries (unlike in the recent study [43]), these two boundaries are clearly more constrained than the central part. We thus expect that these boundaries may enter a glass/solid phase in the thermodynamic limit, which is a very interesting future direction.

C. Shallow networks
To explore the benefit of depth, an important characteristic of the deep learning, we study a shallow-network architecture without any hidden layers. Other conditions are the same with the other two scenarios. Note that, in terms of test performance, adding hidden layers of increasing width improves over the shallow counterpart (Appendix G). Using the same statistical mechanics framework, we surprisingly find a first-order phase transition in the distance-coupling-field profile [ Fig. 5 (c)]. The discontinuous transition is characterized by a hysteresis loop and associated entropy gaps, irrespective of references selected from high-, medium-and low-energy regions. For the shallow-network case, the weight space is broken into clusters separated by entropy gaps. In particular, a double-discontinuous transition is even observed for the low-energy references, implying that the weight space around the lowenergy references becomes much more complex, as also revealed in a biological data analysis of retina coding [32].
Compared to the deep network, the dynamics in the shallow network can be strongly affected by the discontinuous transition because of metastability in thermodynamic states.
When adding more hidden layers, the broken weight space is replaced by a connected smooth sub-space, demonstrating the benefit of depth. Our effective model thus distinguishes a shallow network from its deep counterpart, as also supported by the Monte-Carlo dynamics in the next section.

D. Distributions of TAP free energy and Monte-Carlo dynamics of the effective model
To verify the geometric picture gained from the entropy landscape, we study the TAP free energy distribution of the effective Ising model (see Appendix I for technical details).
As shown in Fig. 8 for the under-parametrization setting, the distribution shows a single peak, consistent with the smooth landscape. The over-parameterization setting has similar results. The single peak profile is further corroborated by the Monte-Carlo dynamics (like that in Appendix B) starting from a reference solution state. As shown in Fig. 9, the local dynamics is apt to drift towards chance-level configurations, without difficulty of crossing high energy-barriers. In contrast, the Monte-Carlo dynamics of the Ising model for the and even inspire algorithmic designs (shown below). Therefore, the constructed Ising model could be an effective model. It would be very interesting to explore in future works whether higher-than-second order correlations in the data enhance our current understanding.

E. Effects of node-permutation symmetry on the deep learning landscape
There exists node-permutation symmetry in deep learning, which is able to relate one weight configuration to another one via a permutation of hidden nodes (see Appendix J). To explore effects of node-permutation symmetry, we design a procedure of ranking the pre-activations of hidden layers, and then control the fraction of permutation symmetry in the collected weight configurations. As shown in Fig. 10, increasing the permutation symmetry fraction leads to an enlarged entropy profile, for four different network settings.
Two key points must be remarked. First, adding the permutation symmetric configurations for learning the effective model does not affect the qualitative behavior of the entropy profile.
Second, the close-by regime of any reference is not affected by the fraction of the permutation symmetric configurations. According to this observation, we conjecture that the underlying picture is that the deep learning weight space under investigation is smooth as a whole. The other parts of the network are normally trained, as we did before. We surprisingly find that this scheme works in practice (Fig. 11 A recent work revealed that as the model complexity of continuous weights increases, the generalization error displays a cusp around a critical value of the complexity; on both sides of this cusp, the generalization error decays [44]. It is thus interesting in future works to see whether there exists a qualitative change in the geometric structure of the weight space around the cusp. We further remark that in our current study, the model complexity is fixed for simplicity. In fact, the model complexity can also be tuned to study the existence of the double descent phenomenon (see Appendix F). We confirm that in the explored range of model complexity, the double descent phenomenon is absent for deep learning of binary synapses. In other words, the generalization capability of the network continuously improves with increasing model complexity.
Third, a special interior part of a largest entropy in the weight space exists, showing a liquid-like property. The liquid phase is defined with a vanishing glass-order-parameter [43], or the absence of ergodicity breaking. In this sense, the learning dynamics in the interior part must be fast, because both directions of each weight are not highly biased, or close to a random initialization. In the particular scenario of under-parametrization, this giant interior entropy even gets very close to the upper bound of entropy set by a completely random-weight space.
The existence of the liquid-like central part coincides exactly with two recent studies, despite the current focus of binary weights. One is the toy random models of deep learning [43], as discussed in Sec. IV B; the other is the ensemble backpropagation applied to practical deep learning [40], To train a deep supervised network with binary weights, we apply the mean-field method first introduced in the previous work [29]. In our current setting, each weight w l ij is sampled from a binomial distribution P (w l ij ) parametrized by an external field θ l ij as follows, with mean µ l ij = 2σ(θ l ij ) − 1 and variance (σ l ij ) 2 = −4σ 2 (θ l ij ) + 4σ(θ l ij ). According to the central-limit theorem, the feedforward transformation can be re-parametrized as: where m l j = i µ l ij a l−1 i , and v l j = i (σ l ij ) 2 (a l−1 i ) 2 . During the error backpropagation phase, we need to compute the gradient of the loss function L [e.g., L CE in Fig. 2 (a)] with respect to the external field θ, which proceeds as We then define ∆ l j ≡ ∂L ∂z l j . On the top layer, ∆ l j = y L j −ŷ L j , where y L j = e z L j i e z L i is the softmax output, andŷ L j is the (one-hot) label of the input. On the lower layers, given ∆ l+1 k , we can iteratively compute ∆ l j : (A5b) Note that is sampled and quenched for both forward and backward computations in a single mini-batch gradient descent. During training, we add an L 2 regularization to controlling the magnitude of external fields. After the learning is terminated, an effective network with binary weights can be constructed by sampling the binomial distribution parametrized by external fields.

Appendix B: The Ising model predicts three-weight correlations
To see whether the data-driven Ising model can reproduce the three-weight correlations, we first run a Monte-Carlo experiment on the Ising Hamiltonian, and then carry out a sampling of the steady state space of the Monte-Carlo dynamics after a sufficient relaxation.
The three-weight correlations are estimated from these samples, and are then compared with those estimated directly from the data (weight ensemble). The result is shown in Fig. 12, which suggests that the Ising model can predict a significant fraction of three-weight correlations (e.g., about 82% provided that the criterion is set to ε = 0.015).

Appendix C: Cavity method for landscape analysis
To analyze the statistical properties of the least structured model, we apply the cavity method in the spin glass theory [45]. Note that the weight configuration follows the Boltzmann distribution P (σ) = exp (−βE(σ)) /Z, where the energy E(σ) = − i<j J ij σ i σ j − i h i σ i , and Z is the partition function. An exact computation of the free energy (−T ln Z) is an NP-hard problem. To get an approximate free energy, we first compute the free energy change ∆F i when adding a weight (e.g., σ i ) and its associated interactions. We then calculate the free energy change ∆F a when adding an interaction (e.g., J a σ i σ j , and a ≡ (ij)).
In all these derivations, we make an assumption that when an interaction is removed, its neighboring weights become independent such that the joint distribution of them becomes factorized, which is exact if the factor graph of the model is locally tree-like. The factor graph is a bipartite graph where two kinds of nodes (weight nodes and interaction nodes) are present. For detailed derivations, interested readers are recommended to go through the standard textbook [46], or the appendix of a recent paper [47]. In the cavity approximation, the free energy can be constructed as below, where |∂a| denotes the number of weights connected to the interaction a, i ∈ ∂a denotes the neighboring weights i of the interaction a, and a ∈ ∂i denotes the neighboring interactions a of the weight i. m j→b is the cavity magnetization of weight j in the absence of the interaction a, andm b→i = tanh βJ b j∈∂b\i m j→b is the conjugate magnetization.
According to the variational principle, the free energy should be stationary with respect to {m i→a }. We can then obtain a set of self-consistent equations called message passing equations as follows, where \i means that the weight i should be excluded from the set. m i→a is interpreted as the message passing from weight i to interaction b, andm b→i is interpreted as the message passing from interaction b to weight i. By iterating these equations from some random initialization, the iteration will converge to a fixed point {m * i→a }, which can be used to compute thermodynamic moments for learning (introduced below) and entropy (used for landscape analysis). We remark that the fixed point corresponds to a local minimum of the free energy function. The algorithm may not converge when the model becomes glassy.
We can directly compute the magnetization m i = σ i , and correlation C a = σ i σ j from the fixed point {m * i→a } as follows, Given the free energy, we can estimate the energy from the standard thermodynamics relation, E = ∂(βF ) ∂β , which is given by: ∆E a =βJ a tanh βJ a + i∈∂a m i→a 1 + tanh βJ a i∈∂a m i→a , The entropy of the model is defined as S = − σ P (σ) ln P (σ), which can be also estimated from the standard thermodynamics relation S = −βF + βE.
In the entropy landscape analysis, we introduce a perturbed probability distribution to explore the internal structure of the network parameter space, which is given by where β is the inverse temperature, and x is the couping field tuning the distance between the configuration σ and the reference σ * . With the Laplace method, the free energy density f can be approximated as f ≈ min ,q f ( , q), which is given by βf = β −xq −s( , d), where is the energy density (E/N ), q is the overlap i σ i σ * i /N , s( , d) is the entropy density S/N , and d is the Hamming distance per weight related to the overlap q by d = (1 − q)/2. At the same time, (x, q) should obey the following equations: ∂s( , d)/∂d = 2x and ∂s( , d)/∂ = β, because of the Laplace method. According to the double Legendre transform, the entropy density s( , d) is given by s( , q) = −βf + β − xq. Note that, by fixing d or , one can construct iso-distance or iso-energy curves just through finding compatible x or β for the following equations: due to the Legendre transform.
Under the perturbed probability measure, the cavity method is still applicable, with an only change for the bias h i as βh i → βh i + xσ * i . By iterating the same message passing equations to a fixed point, we can finally obtain the free energy density, energy density and entropy density, etc. Note that q is computed via i m i σ * i /N , where m i is the fixed-point magnetization.

Appendix D: Simulation details
The MNIST handwritten digits dataset is a classification benchmark dataset which contains 60 000 training images and 10 000 test images. The image is a 28 × 28 gray-scale handwriting digit with a label taken from one of 0 to 9. For the sake of simplicity, the dataset we use for training is re-generated by applying the PCA to the original dataset.
More precisely, we apply the PCA on training images to compute the eigenvectors first, and then project training and test images with these eigenvectors to finally obtain the low dimensional (20 pixels) images. Adam [48] and the initial learning rate is set to 0.3. During the training, we also introduce an L 2 regularization to the loss function to penalize large external fields. The regularization strength is set to 10 −5 , which is optimal for our experiments.
The stimulation details presented above apply to the case of under-parametrization. We also design two other kinds of training scenarios. The first case is the over-parametrization the shallow-network scenario, where we delete all hidden layers, and thus the network architecture changes to 20-10. Other hyper-parameters remain unchanged as well. Following the same weight-collection procedure as the under-parametrization case except for setting the accuracy threshold to 70%, we collect one million and half a million weight configurations for the over-parameterization and shallow-network scenarios, respectively, in order to construct effective models.
Finally, we show additional landscape analysis when the reference weight configuration is selected from the high and medium energy regions. A qualitative same behavior is identified ( Fig. 13 and Fig. 14). In addition, the histograms of model parameters inferred from the weight data are also shown in Fig. 15 for the shallow-network setting. Our framework can also be applied to other more challenging types of dataset, e.g., Fashion-MNIST consisting of fashion items images belonging to ten classes [49], as displayed in Fig. 16 for the overparameterization case. Unlike the case of the MNIST dataset, the entropy landscape of the full network gets close to that of the central part of the network, while other qualitative properties remain. All codes to obtain the results of this paper are available at the link: https://github.com/ZouWenXuan/Liquid-like-deep-learning. functions. The objective function to be maximized is written as follows [50], where the conditional probability of the spin σ i given the rest of the system P (σ µ i |σ µ \i ) is given below, (E2) Therefore, the pseudolikelihood method simplifies the original N -body problem into N independent one-body problems (but each with N −1 quenched spin variables). The optimization can then be implemented through maximizing each of the N local likelihood functions L i separately, and finally the symmetric coupling J ij = 1 2 (J i ij + J j ij ), where J x ij is the coupling value for the maximum of L x . The maximization can be done by simple gradient ascent procedures [51]. To learn a sparse Ising model, one can rewrite the local likelihood with 1 regularization asL i = L i + λ j =i |J ij |, where λ is a freely-tuned parameter controlling the sparsity of the model. The numerical maximization can be efficiently done by the interior point method [52], where λ is the pre-factor of λ max . This regularized optimization leads to a sparse effective model where non-zero and null couplings coexist, thereby distinguishing strongly interacting weights from non-interacting ones. To save the computational cost, we fix M = 100 000 in the pseudolikelihood maximization learning.

Appendix F: Double descent phenomenon
In the traditional statistical learning theory, a classical U-shaped test error curve is predicted when the training is performed with increasing model complexity. However, deep neural networks are able to achieve the state-of-the-art performance using over-parametrized architectures. More precisely, when the number of network parameters are sufficiently large to perfectly learn the training data (i.e., at the so-called interpolation threshold), further increasing the model complexity will cause a second drop of the test error. Therefore, the learning behavior in the over-parametrization regime is called the double descent phenomenon [37,44,53,54]. Note that for ridge regression problems (implemented via a two-layer neural network for an asymptotic study, where only the second layer weights are trained while the first layer weights are random for the sake of theoretical analysis [37,[53][54][55]), the test error may diverge as the interpolation boundary is approached [53,54], because of the divergence of network weights [56]. However, these studies are restricted to deep neural networks with real-valued weights and standard stochastic gradient descent training. In contrast, our study focuses on bounded synapses (e.g., ±1), showing intrinsic differences. In this section, we provide details to check if the double descent phenomenon remains in our current setting, i.e., training deep networks of binary weights.
We consider the architecture of K-h-h-x, where K means the dimension of the input image (the image can be preprocessed by PCA), h means the size of hidden layers (i.e., N 1 = N 2 = h in the main text), and x = 2 or 10 means binary classification (parity of the digits, see also Ref. [44]) or ten-class classification respectively. For the parity classification Deep learning of binary weights. The double descent phenomenon is absent. We actually train the network of binary weights for a total number of 10 5 epochs, therefore the abscissa should be multiplied by a factor of 100.
task, we use the hinge loss, while we use the cross-entropy loss for the ten-class classification task. The value of h is used to tune the model complexity. The value of K takes from 10 to 50 (preservation of 82% original information). In all these settings, we found double descent phenomenon for networks of real-valued weights [see Fig. 17 (a,b)]. In contrast, as shown in Fig. 17 (c,d), for networks of binary weights, the test error keeps decreasing with the model complexity, while the training error never reaches zero (or decreases extremely slowly). Compared to the real-valued-weight network, it is very difficult to train the network to perfectly fit the target function using the low-precision binary weights, which is evident in our simulation results [ Fig. 17 (c,d)]. We thus conclude that no interpolation threshold exists in our current setting, and correspondingly the peak of test error is not observed.
This conclusion is robust for binary classification and other values of input dimensionality. Appendix J: Node-permutation symmetry It is clear that there exists node-permutation symmetry for hidden layers, e.g, exchange of an arbitrary pair of nodes (including their associated incoming and outgoing weights) does not cause the change of the network output. This raises an interesting question-how much this permutation affects the weight configuration data and further the landscape of the effective model. To address this question, we first select all digits of label 1 in the test set as the input, and then each sample of weight configuration produces the hidden pre-activations, whose mean values over the input ensemble are subsequently ranked in the descending order for each hidden layer (called H D ). The ranked pre-activation result thus corresponds to an element of the node-permutation group, which transforms the original weight configuration σ to a permuted one σ gr by a permutation symmetry transformation T : σ gr = T σ (see also Fig. 19). Note that each element in the transformation matrix T takes either zero or one.
We thus call σ gr as a generating weight-configuration, from which permutation-dependent configurations can be generated by carrying out one of the permutation transformations.
The total number of these transformations is given by h! × h!, where h is the width of hidden layer (N 1 = N 2 = h). We remark that the result does not depend on the selected input ensemble, e.g., other digits.
According to this rule, if two weight configurations lead to the same σ gr (identical {H D } as well), we say they are permutation-dependent; otherwise, they are permutation-independent.
In fact, our original training data of weight configurations are completely permutationindependent, probably due to the limited-size weight-data yet the extremely large size of the deep learning weight space. To introduce the permutation symmetry among weight configurations, we use M s of σ gr (by a random selection) to generate as many as 2M s permutation-dependent configurations (one σ gr is paired with the corresponding generated configuration). These generated configurations are combined with the other M −M s ones (M is the number of the collected weight configurations). Therefore, the permutation symmetry ratio is calculated as r ps = 2Ms M −Ms+2Ms . The effects of r ps on the landscape are discussed in the main text.