Subdominant Dense Clusters Allow for Simple Learning and High Computational Performance in Neural Networks with Discrete Synapses

We show that discrete synaptic weights can be efficiently used for learning in large scale neural systems, and lead to unanticipated computational performance. We focus on the representative case of learning random patterns with binary synapses in single layer networks. The standard statistical analysis shows that this problem is exponentially dominated by isolated solutions that are extremely hard to find algorithmically. Here, we introduce a novel method that allows us to find analytical evidence for the existence of subdominant and extremely dense regions of solutions. Numerical experiments confirm these findings. We also show that the dense regions are surprisingly accessible by simple learning protocols, and that these synaptic configurations are robust to perturbations and generalize better than typical solutions. These outcomes extend to synapses with multiple states and to deeper neural architectures. The large deviation measure also suggests how to design novel algorithmic schemes for optimization based on local entropy maximization.

In the past decades, various methods borrowed from statistical physics have been quite successful in studying the basic properties of neural-like systems [1]. A well known general result is that, as is the case for other optimization problems, training neural networks is qualitatively different if the variables-the synaptic weights-are constrained to take discrete values. Yet, the standard equilibrium analysis, which suggests that the solutions to the constrained problem would be inaccessible, is at odds with some recent heuristic algorithmic advances [2][3][4][5], which demonstrate that simple effective protocols may be devised at least in some simple scenarios. Therefore, it is conceptually important to understand the underlying reason for this discrepancy, which may be relevant for larger classes of problems.
Furthermore, the modulation of synaptic efficacy is the elementary computational step for biological information storage and for large scale machine learning architectures and neuromorphic devices. While most models assume that synapses are continuous, it is extremely important to understand the practical implications of constraining the synaptic states. Biological considerations and recent experimental evidence [6,7] suggest that synaptic efficacies store a few bits each (between 1 and 5). Machine learning applications (especially hardware implementations) could benefit from using simpler synaptic models and update protocols, a research direction that is currently hampered by the difficulty of devising effective learning protocols.
The learning problem in neural networks with constrained synapses, even in its simplest formulation-the perceptron with N binary synapses-is known to be intractable in the worst case [8]. In the typical case, its equilibrium description is dominated in the large N limit by an exponential number (in N) of local minima [9][10][11][12], which easily trap standard search strategies based on free energy minimization, e.g. Monte Carlo algorithms [13,14] (a situation typical of spin glass phases, which is common to many hard random optimization problems [15][16][17]); moreover, the optimal synaptic configurations are typically geometrically isolated (i.e., they have mutual Hamming distances of order N), and thus are even harder to find for local search strategies [14].
Here, however, we show that the standard analysis does not capture the properties that are relevant for effective learning strategies. We introduce a novel large deviation analysis that reveals the existence of a different class of solutions, clustered in dense regions. These solutions have radically different properties from the dominating ones, and they are accessible to simple learning protocols. Numerical experiments support the results of this analysis, and show that the same picture also holds for complex neural architectures trained on real-world benchmarks. The analysis generalizes to the case of multilevel synapses.
In a more general sense, these findings highlight the key role that subdominant states have in understanding the practically relevant properties of a prototypical complex system. We have no reason to believe that this scenario is specific to this particular family of problems; in fact, we also show that an optimization strategy inspired by our analysis is also effective on the random K-satisfiability (K-SAT) problem.
The model.-The single layer binary neural network (perceptron) maps vectors of N inputs ξ ∈ f−1; 1g N to binary outputs as τðW; ξÞ ¼ sgnðW · ξÞ, where W ∈ f−1; 1g N is the vector of synaptic weights. Given αN input patterns ξ μ with μ ∈ f1; …; αNg and their corresponding desired outputs σ μ ∈ f−1; 1g αN , and defining X ξ ðWÞ ¼ Q αN μ¼1 Θ(σ μ τðW; ξ μ Þ), where ΘðxÞ is the Heaviside step function, the learning problem is that of finding W such that τðW; ξ μ Þ ¼ σ μ for all μ, i.e., such that X ξ ðWÞ ¼ 1. The entries ξ μ i are independent and identically distributed (i.i.d.) unbiased random variables. There are two main scenarios of interest for the distribution of the desired outputs σ μ : (1) the classification case, in which they are i.i.d. random variables, and (2) the generalization (or teacher-student) scenario, in which they are provided by a "teacher" device, i.e., another perceptron with synaptic weights W T . In the classification scenario, the typical problem has a solution with probability 1 in the limit of large N up to α c ¼ 0.833 [9], after which the probability of finding a solution drops to zero. α c is called the capacity; we also use this term for the maximum value of α for which a solution can be found by a specific algorithm. In the teacher-student scenario, the problem has exponentially many solutions up to α TS ¼ 1.245, after which there is a first-order transition and only one solution is possible: the teacher itself [1,10]. One additional quantity of interest in this scenario is the generalization error rate p e ¼ ð1=πÞ arccos½ð1=NÞW · W T , which is the probability that τðW; ξ ⋆ Þ ¼ τðW T ; ξ ⋆ Þ when ξ ⋆ is a previously unseen input.
The standard zero-temperature equilibrium analysis of this model is based on a probability measure defined by the partition function Z eq ¼ P fWg X ξ ðWÞ; the typical case is described by taking the quenched average hlogðZ eq Þi ξ over the realizations of the patterns.
Effective learning algorithms.-Only a handful of heuristic algorithms are currently believed-based on numerical evidence-to be able to solve the classification problem and achieve a nonzero capacity in the limit of large N in a subexponential running time: reinforced Belief Propagation (BP) [2], reinforced Max-Sum [3], SBPI [4], and CP þ R [5] (a brief description of each is provided in the Supplemental Material [18]). In the classification case, they achieve capacities between α ≃ 0.69 and α ≃ 0.75. They all share the property of being local and distributed, and have typical solving times that scale almost linearly with the size of the input. SBPI and CP þ R additionally have extremely simple requirements (only employing finite discrete quantities and simple, local, and online update schemes), making them appealing for practical purposes and reasonably plausible candidates for biological implementations. A qualitatively similar scenario holds in the generalization case, where all these algorithms perform well except in a finite window 1 ≲ α ≲ 1.5 around α TS .
These results are not captured by the standard spin glass theory; in particular, the effectiveness of the utterly simplified algorithms SBPI and CP þ R is in striking contrast with a glassy energy landscape in which solutions are isolated.
Numerical experiments.-We investigated this issue numerically, and found evidence that, in fact, the solutions found by the algorithms are typically not isolated; rather, they belong (with high probability at large N) to large connected clusters of solutions. More precisely: (1) from a given solutionW, a random walk process over neighboring configurations in the space of solutions can reach distances of order N from the starting point; (2) the number of solutions at a distance of order N fromW grows exponentially with N (this can be estimated from the analysis of the recurrence relations on the average growth factor of the number of solutions at varying distances, and using the random walk processes for sampling the local properties relevant to those relations).
Furthermore, we used the standard BP method on single problem instances to estimate the entropy of the solutions at varying distance (controlled via a Franz-Parisi potential [21]) from a reference solutionW obtained from a heuristic solver, and found that the results do not match the predictions of the equilibrium analysis [14], see Fig. 1.
Teacher-student case.-We also extended the equilibrium analysis [14] to the teacher-student scenario, and found that: (1) typical solutions are isolated for all values of α even when adding a nonzero stability constraint, as in the classification case; (2) the teacher device is isolated and indistinguishable from all other typical solutions except for the generalization error; and (3) the results of estimates obtained from BP are consistent with the analytical calculation when using the teacher as a reference point, but not when using a solution provided by a heuristic solver (see the inset in Fig. 1). Finally, the generalization error for solutions found algorithmically is lower than what would be expected for a typical solution (see Fig. 3).
Large deviation analysis.-These results indicate that calculations performed at thermodynamic equilibrium are effectively blind to the solutions found by the heuristic algorithms. Traditionally, in the context of replica theory, similar situations have been addressed by looking for subdominant states [22]. However, this is insufficient in the present case.
A different analytical tool is thus needed for obtaining a description of this regime, which-according to the numerical evidence-is characterized by regions with a high density of solutions. Clearly, the statistical weight of the individual solutions must be modified, by favoring the ones that are surrounded by a large number of other solutions. Therefore, we studied the following largedeviation free energy density function: week ending 18 SEPTEMBER 2015 128101-2 where N ðW;dÞ ¼ P fWg X ξ ðWÞδ(W ·W;Nð1 − 2dÞ) counts the number of solutions W at normalized Hamming distance d from a reference solutionW (δ is the Kronecker delta symbol) and y has the role of an inverse temperature. This free energy describes a system in which each configurationW is constrained to be a solution, and has a formal energy density EðWÞ ¼ −ð1=NÞ log N ðW; dÞ that favors configurations surrounded by an exponential number of other solutions, with y controlling the amount of reweighting. The regions of highest local density are then described in the regime of large y and small d.
The relevant quantities are computed through the usual statistical physics tools; of particular importance is the entropy density of the surrounding solutions, the local entropy: which is simply given by S I ðd; yÞ ¼ ∂ y ðyF ðd; yÞÞ. The signature for the existence of a dense and exponentially large cluster of solutions is that S I ðd; yÞ > 0 in a neighborhood of d ¼ 0. Another important quantity is the external entropy, i.e., the entropy of the reference solutions S E ðd; yÞ ¼ −y½F ðd; yÞ þ S I ðd; yÞ, which must also be non-negative. The special case y ¼ 1 is essentially equivalent to the computation of Ref. [11]; S I ðd; yÞ reduces to the computation à la Franz-Parisi of [14] in the limit y → 0.
We computed Eq. (1) by the replica method in the replica-symmetric (RS) ansatz, resulting in an expression involving 13 order parameters to be determined by the saddle point method. The analytical expressions and the details of the computation are reported in the Supplemental Material [18]. It turns out that, for all values of α and d, there is a value of y beyond which S E ðd; yÞ < 0, which is unphysical and signals a problem with the RS assumption. Therefore, we sought the value y ⋆ ¼ y ⋆ ðα; dÞ at which S E ðd; y ⋆ Þ ¼ 0, i.e., the highest value of y for which the RS analytical results are consistent. In the following, we thus drop the y dependency.
The solution to the system of equations stemming from the RS saddle point produces qualitatively very similar results for both the classification (with α < α c ) and the generalization (with α < α TS ) case. It displays a number of noteworthy properties (Fig. 2): (red) estimate from belief propagation using a solution from SBPI, with N ¼ 10001; (green) theoretical curve for the optimal W as computed from Eq. (1); and (dotted black) upper bound (α ¼ 0 case, all configurations are solutions). The random-walk points underestimate the number of solutions since they only consider single-flip-connected clusters; the BP curve is lower than the optimal curve because in the latterW is optimized as a function of the distance, while in the former it is fixed. Inset: comparison between a typical solution and one found with SBPI, in the teacher-student case at α ¼ 0.5 with N ¼ 1001. Larger potentials correspond to smaller distances. Top points (red): SBPI reference solution, with the entropy computed by BP; bottom curve (magenta): theoretical prediction for a typical solution; bottom points (purple): BP results using the teacher as reference.
FIG. 2 (color online). Large deviation analysis. Local entropy curves at varying distance d from the reference solutionW for various α (classification case). Black dotted curve, α ¼ 0 case (upper bound). Red solid curves, RS results from Eq. (1) (optimal W). Up to α ¼ 0.77, the curves are monotonic. At α ¼ 0.78, a region incorrectly described within the RS ansatz appears (dotted; geometric bounds are violated at the boundaries of the part of the curve with negative derivative). At α ¼ 0.79, the solution is discontinuous (a gap appears in the curve), and parts of the curve have negative entropy (dotted). Blue dashed curves, equilibrium analysis (typicalW) [14] (dotted parts are unphysical): the curves are never positive in a neighborhood of d ¼ 0. Inset: enlargement of the region around d ¼ 0 (notice the solution for α ¼ 0.79, followed by a gap). clusters of solutions. Furthermore, for all α, the curves for S I ðdÞ are all approximately equal around d ¼ 0; in particular, they all approximate the case for α ¼ 0 where all points are solutions. This implies that the clusters of solutions are extremely dense at their core. This is our chief result. The size of this dense region shrinks with α and vanishes at α c .
(2) For large distances, as expected, S I ðdÞ collapses with a second-order transition onto the equilibrium entropy; i.e., this regime is dominated by the typical solutions.
(3) Up to a certain α U (where α U ≃ 0.77 in the classification case and α U ≃ 1.1 in the generalization case), the S I ðdÞ curves are monotonic in d. Beyond α U , there is a transition in which there appear regions of d (dotted in Fig. 2) that are not correctly described by the RS ansatz (since geometric bounds are violated; see the discussion in the Supplemental Material for details [18]), and must be described at a higher level of replica symmetry breaking (RSB). We speculate that this transition signals a change in the structure of the space of solutions: for α < α U , the densest cores of solutions are immersed in a huge connected structure; for α > α U , this structure fractures and the dense cores become isolated and hard to find.
(4) In the teacher-student scenario, the generalization properties of the optimal reference solutionsW are generally much better than those of typical solutions. This is clearly shown in Fig. 3, where we also show that the curve for small d is in striking agreement with that produced using solutions obtained from the SBPI algorithm. The generalization error decreases monotonically when increasing d, and it saturates to a plateau when S I ðdÞ becomes equal to the entropy of the typical solutions [see point (2) above].
We expect this qualitative and quantitative picture, especially for α ≲ α U , to be quite robust. First, these results are convincingly supported by our numerical findings, where available. Furthermore, a slightly simplified model analyzed at a higher level of RSB and at y → ∞ [see Eq. (3) below] yields almost indistinguishable results.
The analytical computations are straightforwardly generalized to the case of multilevel synapses and sparse patterns, and the results are qualitatively identical [23].
Multilayer network.-These theoretical results seem to extend to more complex architectures and nonrandom learning problems. We observed this by heuristically extending the CP þ R algorithm to multilayer classifiers with L possible output labels, and training these networks on the MNIST database benchmark [24], which consists of 7 × 10 4 grayscale images of hand-written digits (L ¼ 10). A description of the architecture and of the learning algorithm is provided in the Supplemental Material [18].
We observed that it is indeed very easy to achieve perfect learning on the whole training data set, and that very good generalization errors can be reached (e.g., 1.25% with order 10 7 synapses) despite the binary nature of the synapses and the fact that we did not specialize the architecture for this particular data set. Moreover, we did not observe any overfitting: the generalization error does not degrade by reaching zero training error, or by using larger networks.
As for the perceptron, we performed a random-walk process in the space of solutions, with similar results: the simplified algorithm reaches a solution that is part of a dense, large connected cluster, and the generalization properties of the starting solution are better than those of solutions found in later stages of the random walk (see Fig. 1B in the Supplemental Material [18]).
Optimization.-We also studied a variant of the free energy (1) without the constraint onW: The analysis in this case requires at least an additional step of RSB, and will be presented in detail in a follow-up work [25]. Still, the results are very close to those reported for the constrained scenario; furthermore, the probability that theW in this system are a solution tends exponentially to 1 with d → 0, despite the removal of the explicit constraint. This suggests that we can algorithmically exploit F U ðd; yÞ to efficiently sample ground states of the system, and that such a strategy could be applied to different optimization problems as well.
As the most straightforward proof of concept in this direction we have developed a Monte Carlo Markov Chain algorithm, using the local entropy EðWÞ as an objective function. We call such a procedure the Entropy-driven Monte Carlo (EdMC) procedure [25].W is initialized at random; at each step EðWÞ is computed efficiently by the BP algorithm, which is expected to give good results at small d in dense regions; random local updates ofW are accepted or rejected using a standard Metropolis rule at fixed temperature y −1 . In addition to the binary perceptron problem, we applied the algorithm to another (radically different) problem in which standard simulated annealing (SA) methods are known to fail, namely, the famous random K-SAT problem. As expected from our analysis, for the perceptron learning problem EdMC does not suffer from trapping in local minima even at y ¼ ∞, up to at least α ≃ 0.65, with a running time that scales almost linearly with the size of the problem (whereas SA solving time diverges exponentially). For the random K-SAT problem we explored different regimes, and in particular one (in 4-SAT) where SA method is known to fail: in all cases, EdMC succeeds in an almost linear number of steps, outperforming SA method. Preliminary quantitative data are given in the Supplemental Material [18].

C. B. and R. Z. acknowledge the European Research Council for Grant No. 267915.
A. Brief description of the heuristic algorithms As mentioned in the main text, there are 4 algorithms which are currently known to be able to solve the classification problem for large N in a sub-exponential running time: reinforced Belief Propagation (R-BP) [1], reinforced Max-Sum (R-MS) [2], SBPI [3] and CP+R [4]. Here, we provide a brief summary of their characteristics.
The R-BP algorithm is a variant of the standard Belief Propagation (BP) algorithm. BP is a cavity method which can be used to compute the equilibrium properties (e.g. marginal probability distributions of single variables, entropy, etc.) at a given temperature for a particular instance of a problem described in terms of a factor graph. It is based on the Bethe-Peierls approximation, and it is known to give exact results under some circumstances in the limit of large N ; in particular, for the case of the binary perceptron with random inputs, it is believed that this is the case below α c . The BP algorithm can be turned into a heuristic solver by adding a reinforcement term: this is a time-dependent external field which tends to progressively polarize the probability distributions on a particular configuration, based on the approximate marginals computed at preceding steps of the iteration of the BP equations. The reinforcement can thus be seen as a "soft decimation" process, in which the variables are progressively and collectively fixed until they collapse onto a single configuration. This method seems to have an algorithmic capacity of at least α 0.74.
The R-MS algorithm is analogous to the R-BP algorithm, using Max-Sum (MS) as the underlying algorithm rather then BP. The MS algorithm can be derived as a particular zero-temperature limit of the BP equations, or it can be seen as a heuristic extension of the dynamic programming approach to loopy graphs. The reinforcement term acts in the same way as previously described for R-BP. The resulting characteristics of R-MS are very similar to those of BP; extensive numerical tests give a capacity of about α 0.75.
The SBPI algorithm was derived as a crude simplification of the R-BP algorithm, the underlying idea being that of stripping R-BP of all features which would be completely unrealistic in a biological context. This resulted in an on-line algorithm, in which patterns are presented one at a time, and in which only information locally available to the synapses is used in the synaptic update rule. Furthermore, the algorithm only uses a finite number of discrete internal states in each synapse, and is remarkably robust to noise and degradation. Rather surprisingly, despite the drastic simplifications, the critical capacity of this algorithm is only slightly reduced with respect to the original R-BP algorithm, and was measured at about α 0.69.
The CP+R algorithm was derived as a further simplification of the SBPI algorithm. It is equivalent to the former in the context of the on-line generalization task, but requires some minor modifications in the classification context. Its main difference from SBPI is that it substitutes an update rule which was triggered by near-threshold events in SBPI with a generalized, stochastic, unsupervised synaptic reinforcement process (the rate of application of this mechanism needs to be calibrated for optimal results). Note that the kind of reinforcement mentioned here is rather different from the reinforcement term of R-BP or R-MS. The capacity of the CP+R algorithm can be made equal to that of SBPI, α 0.69.

B. Large deviation analysis
We computed the action φ = −yF corresponding to the free energy F of eq. (1) of the main text by the replica method, in the so called replica-symmetric (RS) Ansatz. The resulting expression, in the generalization case, is: where we used the overlap S = 1 − 2d as a control parameter instead of d, and The order parametersq, q 1 , q 0 ,S, R,R and their conjugates (q,q 1 etc. andŜ) must be determined from the saddle point equations, i.e. by setting to zero the derivative of φ (S, y) with respect to each parameter. This yields a system of 13 coupled equations, with α, y and S as control parameters. We solved these equations iteratively.
The physical interpretation of the order parameters is as follows (here, the overlap between two configurations X and Y is defined as 1 N (X · Y )): q: overlap between two different reference solutionsW q 1 : overlap between two solutions W referred to the sameW q 0 : overlap between two solutions W referred to two differentW S: overlap between a solution W and its reference solutionW S: overlap between a solution W and an unrelated reference solutionW R: overlap between a solution W and the teacher W T R: overlap between a reference solutionW and the teacher W T Therefore,R can be used to compute the typical generalization error of reference solutionsW , as 1 π arccos R . An analogous relation yields the generalization error of the solutions W as a function of R.
It is also worth noting thatq < 1 implies that the number of reference solutionsW is larger than 1.
By setting to zero the order parameters R,R and their conjugates, and thus reducing the system of equations to the remaining 9 saddle point conditions, we obtain the classification scenario.
It can be noted that, although we call this solution replica-symmetric, the structure is highly reminiscent of a 1-RSB solution. Indeed, it can be shown that, if we remove the constraints on the configurationsW , and solve forS = 0 rather then fixing S, we obtain exactly the standard 1-RSB equations for the perceptron of [5] at zero temperature, with y taking the role of the Parisi parameter m. However, the 1-RSB solution of the standard equations shows no hint of the dense regions which we find in the present work, even if we relax the requirement 0 ≤ m ≤ 1 of [5]. This shows that the constraint on the distance is crucial to explore these sub-dominant regions.
From eq. (1) we can compute the internal and external entropies, as: S E (S, y) = φ (S, y) − y ∂φ ∂y (S, y) From the last equation, we define y by S E (S, y ) = 0. We sought this value numerically for each α and S. Therefore, in all our results, the typical number of reference solutionsW was sub-exponential in N ; however, we found that in all casesq < 1, which implies that the solutionsW are not unique. A closer inspection of the order parameters reveals that, q 1 ≥ S for S ∈ [S L, S R ] . The transition points S L and S R at which q 1 = S are manifestly unphysical, because in that case any of the solutions W (which are exponential in number, since S I > 0) could play the role of the reference solutionW , and yet the number ofW should be sub-exponential, because S E = 0. This is a contradiction. We conclude that those regions are inadequately described within the RS Ansatz as well.
As for the parts of the curves which are outside these problematic regions, the results obtained under the RS assumption are reasonable, and in very good agreement with the numerical evidence. In order to assess whether the RS equations are stable, further steps of RSB would be needed; unfortunately, this would multiply the number of order parameters (and thus enlarge the system of equations) and the number of nested integrals required for each of these equations, which is computationally too heavy at the present time. Also, we should observe that the true extremal cases are described only in the limit of y → ∞, for which the RS solution is inadequate, and thus that our reported values of S I are probably a lower bound. Note, however, that y → ∞ both when S → 1, i.e. in the limit of small distances where the solutions exhibit the highest density, and at small S, i.e. where the saddle point solution encompasses the typical equilibrium solutions of the standard analysis and S I becomes equal to the standard entropy of the equilibrium ground states.
In conclusion, our results suggest that the general picture is well described by the RS assumption with the zero external entropy requirement, and that quantitative adjustments due to further levels of RSB would likely be small, and limited to the intermediate regions of S.

C. Multi-layer network with binary synapses
We heuristically extended the CP+R algorithm to multi-layer classifiers with L possible output labels. The architecture we used (Fig. 1A) consists of an array of K 2 committee machines, each comprising K 1 hidden units, whose outputs are sent to L summation nodes, and from these to a readout node which performs an argmax operation. This network therefore realizes the following map: where Y k2l ∈ {−1, 1} are random quenched binary weights, and W k1k2 ∈ {−1, 1} N are the synaptic weights. The single-layer CP+R rule consists of two independent processes, a supervised one and a generalized, unsupervised one (see [4] for details). For the multi-layer case, we kept the unsupervised process unaltered, and used a simple scheme to back-propagate the error signals to the individual perceptron units, as follows: upon presentation of a pattern ξ whose required output is σ, in case of error (ψ (ξ) = σ), a signal is sent back to all committee machines which contributed to the error, i.e. all those for which sign K1 k1=1 τ W k1k2 , ξ i = Y k2σ . Each of these in turn propagates back a signal to the hidden unit, among those which provided the wrong output (i.e. for which Y k2σ N i=1 W k1k2 i ξ i < 0), which is the easiest to fix, i.e. for which −Y k2σ N i=1 W k1k2 i ξ i is minimum. Finally, the hidden units receiving the error signal update their internal state according to the CP+R supervised rule. We also added a "robustness" setting, such that an error signal is emitted also when ψ (ξ) = σ, but the difference between the maximum and the second maximum in the penultimate layer is smaller then some threshold r.
We tested this network on the MNIST database benchmark [6], which consists of 7 · 10 4 grayscale images of hand-written digits (L = 10); of these, 10 4 are reserved for assessing the generalization performance. The images were subject to standard unsupervised preprocessing by a Restricted Boltzmann Machine (N = 501 output nodes) [7,8], but this is not essential for training: the inputs could be used directly, or be simply pre-processed by random projections, with only minor effects on the performance. The smallest network which is able to perfectly learn the whole training dataset had K 1 = 11 and K 2 = 30, with r = 0; its generalization error was about 2.4%. Larger networks achieve better generalization error rates, e.g. 1.25% with K 1 = 81, K 2 = 200, r = 120.

Perceptron
Entropy driven Monte Carlo (EdMC) was applied and confronted with Simulated Annealing (SA) at increasing N for different values of α: the proposed strategy was able to reach a solution, i.e. a configuration at zero energy, in a time which scales almost linearly with N , while as expected SA often gets stuck in local minima even at low loading and with an extremely slow cooling rate.
As an example at α = 0.3 and N ∈ {201, 401, 801, 1601}, we studied the EdMC behavior over 100 random instances of the classification problem, and found that the number of required iterations scales approximately as N 1.2 .

Random K-satisfiability
For the random K-satisfiability problem we explored two regions of parameters: 3-SAT in its RS phase, where both EdMC and Simulated Annealing are expected to succeed, and random 4-SAT in the RSB regime where SA is known to fail. As for the perceptron problem, we observe a much faster running time in favor of EdMC in both cases.
For the 4-SAT case, in order to bypass the convergence problems of BP, it's possible to use temporal averages to approximate the local entropy. This technical problem should however be overcome by computing the entropy at the 1-RSB level, which is beyond the scope of this preliminary study.
As an example, for values of N ranging in {100, 500, 1000, 5000, 10000} and a number of samples between 1000 and 20, for 3-SAT at α = 3.0 we report a scaling of N 1.23 whereas for 4-SAT at α = 8.0 we found a scaling of N 1.18 .