Fluctuation Distributions of Energy Minima in Complex Landscapes

We discuss the properties of the distributions of energies of minima obtained by gradient descent in complex energy landscapes. We find strikingly similar phenomenology across several prototypical models. We particularly focus on the distribution of energies of minima in the analytically well-understood p-spin-interaction spin glass model. We numerically find non-Gaussian distributions that resemble the Tracy-Widom distributions often found in problems of random correlated variables, and non-trivial finite-size scaling. Based on this, we propose a picture of gradient descent dynamics that highlights the importance of a first-passage process in the eigenvalues of the Hessian. This picture provides a concrete link to problems in which the Tracy-Widom distribution is established. Aspects of this first-passage view of gradient-descent dynamics are generic for non-convex complex landscapes, rationalizing the commonality that we find across models.


INTRODUCTION
The notion of an underlying complex energy landscape in glassy, disordered systems is useful [1][2][3][4][5][6][7][8] to the extent that the landscape can be reduced to relatively few properties that are relevant to observed phenomena. The complexity, which counts stationary points in the landscape (minima, saddles, maxima) is an example of such a property. An energy landscape is complex if the number of stationary points depends exponentially on the system size.
An intuitive approach to probing complexity is to do a naive search for minima using gradient descent. [9] One follows an initial configuration along the (negative) gradient flow of the energy until a stationary point (vanishing gradient) is found. Because a numerical descent almost certainly ends in a minimum, gradient descent does not only constitute the simplest form of physical dynamics in a complex landscape, a quench to zero temperature, but also the most intuitive and simplest form of optimization. If one starts with flatly sampled random initial positions (corresponding to infinite-temperature T = ∞ configurations), gradient descent has the added advantage of sampling local minima with a probability that can be calculated because it is proportional to the volumes of their basins of attraction [10,11]. Finally, in addition to being a local optimization strategy, gradient descent is also the archetypal greedy algorithm, particularly if one considers a discretized version as one does with any numerical implementation: in every time step the displacement with the largest expected loss in energy is chosen. Within the field of glassy systems, gradient descent is used to obtain "inherent structures" [12][13][14][15], i.e. the minima at the bottom of the local basin of attraction around which the system thermally fluctuates, while in machine learning, gradient descent is the original go-to learning strategy [16]. Gradient descent is also used to obtain jammed packings of repulsive soft spheres, which are the least stable packings that are mechanically rigid [17,18].
Here we look at the shape of the distribution of minima obtained by gradient descent for several different models, with particular focus on the spherical p-spin-interaction spin glass. Such distributions, for example for jamming, have been assumed to be Gaussian [17]. Our central finding is that for all of these models, the distributions are non-Gaussian with non-trivial tail exponents on one side that are consistent with the Tracy-Widom distribution. We rationalize this finding with a novel perspective that might be the starting point for an eventual analytical approach.
In Sec. II we introduce the models studied. We then present our numerical results in Sec. III and use established results for the p-spin model in Sec. IV to formulate a toy process that allows us to understand these numerical results. We close in Sec. V with some final remarks on the applicability of these ideas to other contexts.

MODELS & COMPLEXITY
We study various models with complex landscapes. A unifying perspective is provided by all of them being random constraint-satisfaction problems, i.e. assemblies of equations or inequalities. Generically, the question of interest is whether a specific realization allows for an assignment of the variables that satisfies all constraints or whether there are frustrations (which are easily introduced in randomized problems) that prevent satisfaction of all of the constraints. Generically there is a (SAT/UNSAT) transition between a phase where a satisfying assignment is possible (SAT) and a phase where this is not possible (UNSAT) upon tweaking the hardness of the satisfaction problem, e.g. by changing a control parameter such as the ratio of (in-)equalities and variables. Versions with discrete (particularly Boolean) variables are of fundamental importance to computer science [19], whereas SAT/UNSAT transitions in continuous constraint-satisfaction problems are conjectured to form an important universality class [20] in statistical physics. The focus of our attention is the spherical pspin model which we therefore introduce first, before the k-SAT, perceptron and jamming models.
The p-spin model. Specifically we consider the spherical p-spin model [21,22]: i.e., we have N spins S i whose combined length is constrained to S 2 i = N (leaving effectively N − 1 degrees of freedom) with an energy functional containing random Gaussian couplings J with mean zero and variance J 2 c = N/#J. Here, #J ∼ N p is the number of terms appearing in the energy functional (while adhering to the constraint of ascending indices). We use this convention to account for finite-size effects from lowerorder terms, but ultimately only the scaling with N is important. Note that particularly in the older physics literature a different convention is used that introduces an additional factor of two here. This energy is an extensive quantity scaling with system size and we therefore also introduce the corresponding intensive quantity ε = E/N . As the qualitative nature of the energy landscape defined by this functional is independent of p for p > 2 (p = 2 corresponds to a convex eigenvalue problem and therefore only has a single, trivially global minimum), we choose to limit ourselves to the numerically most accessible case of p = 3. Still, the cost of a simple evaluation of the energy inevitably scales as N p .
The energy scale ε th = −2 (p − 1)/p = − 8/3 is called the threshold energy as it constitutes the upper energy boundary below which an exponentially large number of stationary points exist. This is quantified by looking at the (cumulative) complexities. If we define N k (ε) to be the number of stationary points of index k with an (intensive) energy not larger than ε, the corresponding complexity Σ(ε) is given by The complexity was studied earlier within the TAP approach [23] and has been the subject of rigorous mathematical analysis in the limit of large N [24]. Remarkably, a qualitatively similar structure has been found for rather small system sizes by numerical enumeration of the critical points [25].
In this paper, we focus on the shape of the distribution of energies of minima, as obtained by gradient descent for finite systems. This corresponds to the shape of the normalized distribution corresponding to N k≡0 (ε). The distribution of final energies found as a result of gradient descent for the p-spin model is shown in fig. 1 (a).
For a suitable choice of couplings, the p-spin model provides a natural energy landscape for the optimization problem corresponding to a k-SAT decision problem [26]. The model also provides insight into structural glasses [27]. It is also a valuable model in its own right. The overall gestalt of the energy landscape, as captured by the complexities, is the relevant property that drives interest in the p-spin model as a prototypical complex energy landscape. Physical systems usually have a well-defined notion of a ground-state energy which sets a lower bound to an extensive number of minima. Additionally, the existence of an upper bound reflects that "over-frustration" of a complex system-it is exponentially hard to construct a state with an energy less favorable than some native scale.
The k-SAT model. The prototypical satisfiability problem is that of Boolean (or propositional) satisfaction, see for example ref. 28 for an introduction. Given a number N of literals (Boolean variables s i with s i ∈ {TRUE, FALSE}) and a number M of clauses (combinations of the literals and the fundamental logical operators OR (∨), AND (∧) and NOT (¬)) which can always be brought into conjunctive normal form, which means that we consider conjunctions (AND-connected sub-clauses) of disjunctions (OR-connected (possibly negated) literals), the goal is to find a choice of the literals that satisfies the clauses (evaluates to a true statement). Particularly, we focus on the k-SAT version of this problem consisting only of random clauses that are disjunctions of exactly k (possibly negated) literals. Interestingly, it turns out that there is a sharp change in the difficulty of the problem with k. For k ≤ 2, the solution (or the existence of a solution) can be found easily in a time that depends polynomially on the problem size (see, for example, [29]), whereas the problem is NP-hard [30] for larger values of k (for efficiency reasons we limit ourselves to k = 3), meaning that the question of the existence of such an algorithm with polynomial runtime is an important outstanding problem [31]. Here, we are not interested in designing a particularly good algorithm. Instead, in analogy to gradient descent, we use a local greedy optimization strategy: pick an unsatisfied clause and an undetermined literal and set the literal to the value satisfying the clause. If at some point there are no undetermined literals left to satisfy an unsatisfied clause, the system is considered unsatisfied; if every clause is satisfied, the system is considered satisfied. The variable controlling the fraction of unsatisfied systems is M/N , the ratio of the number of clauses to the number of literals (the solution is obviously trivial if every literal appears in at most one clause). Because the literals are Boolean the distribution of results is not continuous, but the relevant combination M/N becomes continuous in the thermodynamical limit and we will, thus, treat the data as if they were binned continuous data. We perform runs of the greedy algorithm for k = 3 (ensuring that every literal is used in any clause at most once) with N = 128 literals for M = 1 . . . 10 3 and measure the fraction of unsatisfied systems. This data is shown in fig. 1 (a).
Reduction from k-SAT is usually used to prove that other Boolean satisfaction or decision problems are NPhard; for example, there is a direct connection between the 3-Sat and the 3-coloring of a graph (by means of factor graphs). The optimization problem associated with the k-SAT problem is the p-spin model discussed above.
The perceptron model. Generalizing from Boolean to continuous variables, there are two types of constraints: equality constraints (f (S) = 0) and inequality constraints (f (S) ≥ 0). Every independent equality constraint reduces the dimension of the solution space by one, meaning that the set solutions to problems with equality constraints is always one of zero measure. This is not only peculiar, but it is also inappropriate for many, if not most actual systems one might want to model: a very simplified descriptions of neurons, for example, is that they give an output if the input is exceeding some threshold (see jamming below for another example). This model [32] of neurons is the origin of the perceptron model [33]. Following the notation of Ref. 34, we consider continuous variables S i (i = 1...N ) subject to linear inequality con- We limit the variable space to a sphere, S 2 = N , use normally distributed ξ and set σ µ ≡ σ. For a given ratio α = M/N there is a critical value σ c (α) < 0 that marks the satisfiability transition (lower σ corresponding to the phase in which all constraints are satisfied). Tuning σ allows for further control of the topology (convex/nonconvex) of the energy landscape that is constructed by considering E(S) = µ h 2 µ Θ(−h µ ), transforming the decision into an optimization problem. At low values of σ the system is convex and at sufficiently high values (but for σ > σ c ) the system is non-convex. We perform gradient descents on that energy landscape to obtain the distributions of the final energies in both the non-convex ( Fig. 1(c)) and convex regimes ( Fig. 1(d)).
jamming We consider the packing [17,35] of spheres with harmonic repulsions in low (d = 2, 3) dimensions. Starting from randomly placed spheres, the energy is lowered by reducing the overlap of the spheres; the constraints to be satisfied are, thus, of the form |r ij | ≥ σ ij where r ij denote the pairwise displacement vectors between particles and σ ij are the added particle radii. As a result of the inequality structure of these constraints, the perceptron model provides the appropriate mean-field framework [34,36]. The relevant control parameter that sets whether or not an unjammed configuration (which we choose to have E = 0) is found is the packing fraction, i.e. the ratio of combined volume of the spheres to volume available in the simulation box. For finite dimensions and particle sizes, there is not a sharp satisfiability threshold, but as one increases the packing fraction the fraction of systems for which the descent ends in a jammed configuration increases. The derivative of the satisfaction curve can be interpreted as the distribution of jamming thresholds ϕ c and is shown (as inferred via numerical derivation from the data of ref. 37) in Fig. 1 (f). Typically, these distributions are used to infer finite-size scaling properties [17,37], such as the scaling of the width of the distribution and of the jamming threshold value in the thermodynamic (large system size) limit. In contrast, we focus on the shape of these satisfiability distributions.
Additionally, we consider the distribution of energies of packings (using the same model as e.g. ref. 17 with α = 2) prepared [38] at a fixed pressure above the jamming transition, where not all of the constraints are satisfied. These curves are shown in Fig. 1(e).

NUMERICAL RESULTS
We present the results of gradient descent simulations for all the models studied in Fig. 1. Because shifting and global rescaling of the energy landscape do not qualitatively affect gradient descent, we only present histograms of normalized variables (mean zero, unit variance). The bulk of these simulations was done employing the FIRE algorithm [40] instead of a naive, direct integration of the equation of motion. This algorithm converges significantly faster, allowing for better statistics and analysis of large deviations. The additional inertial degree of freedom within the FIRE scheme can in some individual cases change the basin of attraction such that the relaxation from a specific initial condition with it leads to a different final minimum than would application of a direct gradient descent (this is also true for gradient descents with different time steps). However in smaller runs, we find no indication that this changes the statistics significantly. This is in line with previous applications of FIRE in similar quenches to zero temperature in jamming.
As a visual aid and for comparison, we show the respective numerical probability functions alongside two distributions: (1) the Gaussian (normal) distribution, which is the least biased estimator for the distribution having fixed the mean and variance, and (2) the (normalized) Tracy-Widom distribution, which is characterized by tails that decays more slowly than a Gaussian on one side (x 0) and more rapidly than a Gaussian on the other side (x 0): There is a striking qualitative similarity across models and system sizes. The distribution functions are trivially similar to the Gaussian around zero, but the large deviations are asymmetric with a soft tail (in our presentation . c) Distribution of jamming thresholds ϕc found in the jamming of soft spheres according to ref. 37. The binning size is determined following ref. [39].
for x < 0) that decays more slowly than the Gaussian and a hard tail (for x > 0) that decay more rapidly than the Gaussian. The soft tail seems to be well-described by the Tracy-Widom distribution. The strong commonality across systems and the Tracy-Widom form of the soft tail constitute the main results of this paper. Although there is some additional N -dependence not eliminated by normalizing (which partially is to be expected for small systems due to corrections to scaling [41]), the soft tail appears robust in the thermodynamic limit, as we will elaborate below. Interestingly, the notable exception to this is the perceptron in the convex regime that has a trivial basin of attraction. This indicates that the important similarity between the analogous systems is indeed the FIG. 2. a) Mean and variance of the final energies as found via steepest descent in the p-spin as a function of system size N (double logarithmic plot). We find that the energy clearly descents towards the threshold energy from above with a very clear power-law dependence (see inset). b) Collapsed distributions for N = 32, 64, 128. Here, we perform collapsing using a tentative non-normalized scaling form that is inspired by the Tracy-Widom distribution. The labels I,II and III indicate the three regions used for the spectral densities in Fig. 4. Inset: Large deviations, power-laws are found at the far tails of the energy distribution.
quench from a flat measure in a complex energy landscape and the exploration of at least partially concave (some eigenvalues of the energy landscape are negative) regions.
We now look carefully at finite size effects to see whether the soft tail survives in the limit N → ∞. We present results for the spherical p-spin model, for which we have the best statistics and which is also expected to have very small corrections to scaling due to its structure. This is highlighted by the extremely clear power-laws found for the first two moments of the final energies (see Fig. 2(A)). Using overlines to denote averages over gradient descent samples (so as not be confused with unbiased averaging over disorder for which we use angular brackets), we find that the finite-size deviations to the energy can be characterized via Studying the large deviation tails, we find that the "soft" tail (corresponding to low energies) has the same asymptotics as in the Tracy-Widom law, but the "hard" tail decays more like a Gaussian and therefore decays considerably more slowly than the Tracy-Widom law, with again the rescaled variable x = (ε − µ ε (N ))/σ ε (N ). This is shown in Fig. 2(B). From Fig. 1 (a) it is hard to tell whether this holds for large N as the shape of the distribution appears to cross over from something close to the Tracy-Widom distribution towards the Gaussian distribution. To address this question, we construct an estimate for the converged shape for very large N using the following procedure. We sample the inverse function Q N (c) = C −1 N (c) to the empirical cumulative distribution functions (CDFs) C N (x) for every N (using the N -specific normalization) at a number of selected values c = 10 −5 , . . . , 1. At any given value of c, this gives a set of pairs (1/N, x c = Q N (c)) which we use to extrapolate to x c (0). We find that the finite size effects in the shape are well described by x(N ) − x c (0) ∝ N −1/3 (corresponding to 1/N corrections in non-normalized variables). As the extrapolation is done at each value independently, this method does not constrain the moments of the final distribution and we therefore normalize it as a final step. The result of this procedure is shown in Fig. 3, which suggests that although the final distribution is extremely close to a Gaussian, the soft tail does prevail for large N .

RATIONALIZATION OF RESULTS
To gain insight into the finding that the distribution of minima energies is non-Gaussian, we focus on the pspin model. In this model, gradient descent is given by integration of [42] where z is a Lagrange parameter ensuring the the spherical constraint, S i S i = N , and is fixed to be by demanding S iṠi = 0 or, equivalently, S i S i = const. The descent terminates onceṠ i = 0 and this will be not only a stationary point, but a stable minimum. To understand the implications of this, it is helpful to also consider the dynamical matrix M associated with the descent, which is given by the Hessian of the energy function (technically the Lagrange function, but we refrain from reflecting the special nature of the spherical constraint in our wording in the following) where the omitted terms correspond to all index permutations of i, j, k. We can identify two contributions to the matrix M: a Gaussian random part (J and S are practically independent) and a deterministic shift that only depends on the energy. From this observation, it is straightforward to infer that the spectral density of eigenvalues in the limit of large system sizes is given by a shifted Wigner semicircle law [43] ρ(λ) dλ = 1 with σ = −3ε th and µ = −6ε, see left panel of Fig. 4. If we ignore the shift (the second term of the Hessian in Eq. 8) for now, then we know that finite system size causes the edges of the Wigner semicircle to develop fluctuations. It is rather intuitive that these fluctuations have to be asymmetric: finding a lowest eigenvalue that is smaller than the lower boundary of the support of the semi-circle should (for sufficiently large system sizes) be entropically less costly then finding a fluctuation where the lowest eigenvalue is located somewhere within the bulk of the semi-circle; this implies that an extensive number of eigenvalues must lie at atypically large values. This intuition has been made rigorous by establishing that the edge fluctuations are described by the acclaimed Tracy-Widom distribution introduced earlier in Eq. 3 [44][45][46]. This distribution has gathered considerable interest in recent years as it has been found to appear in many systems of correlated variables that are beyond its original scope within random matrix theory. Interestingly, the charge-like repulsion between the eigenvalues that is that the core of the Tracy-Widom distribution is of purely topological origin and can be understood solely from imposing a non-crossing bias onto random walkers [47]. As of now, there is no simple closed form representation of the Tracy-Widom distribution, but the distribution of the lowest eigenvalue has a characteristic system-size dependence [48], with W (x) being the Tracy-Widom distribution with tails described by Eq. 3 [49]. Note that this eigenvalue spectrum has two tails with different N -scalings: in the soft tail the argument of the exponential scales like N whereas it scales like N 2 in the hard tail. This corresponds to the fact that deviations of the lowest eigenvalue to smaller values are entropically suppressed by the definition of the matrix ensemble, but fluctuations to higher values require a displacement of extensively many eigenvalues.
The termination of the descent is subject to the gradient and, thus, cannot be understood by the dynamics of the eigenvalues alone, but we can identify a necessary contribution that will get us close to understanding the full dynamics. Once the lowest eigenvalue crosses zero to become positive, the descent is in its final valley and the energy will only change slightly. Neglecting this final part, gradient descent becomes a first-passage problem in the lowest eigenvalue. As time progresses and the system lowers its energy, the eigenvalues will move (with fluctuations) towards higher values while never crossing each other, until the lowest eigenvalue crosses zero so that all eigenvalues are positive. A direct empirical corroboration of the importance of eigenvalue fluctuations from the data, and a connection to problems usually connected to Tracy-Widom-laws, is that the power-laws seen in the finite size effects, see eq. (4), are consistent with the scaling seen in the Tracy-Widom law, cp. eq. (10).
To start examining the descent from this spectral perspective, we calculate the spectrum of eigenvalues of the Hessian in three different ranges of energies of the minima in Fig. 2(B). In each of these ranges, the distribution is close to the semicircle expected at the threshold energy in unbiased sampling independent of the energy, with a shift that increases with energy. This is somewhat consistent with the finding that all the states found are close to threshold. However, they are always above the large N threshold, ε th = − 8/3, which means that the naively expected value for the lowest eigenvalue is negative and a large deviation is needed to constitute a mechanically stable state. Intuitively, the entropically least expensive way to do this is to aggregate all these eigenvalues closely above zero. This intuition has been made rigorous by an analysis by Dean and Majumdar (DM) [50]. An important physical consequence of this aggregation around zero (forming an integrable singularity in the spectrum) would be an excess of very soft modes which is not only unphysical, but also completely contrary to the empirical findings in physical realizations of disordered systems in general or our data for the p-spin in particular. An immediate explanation for why such an aggregation of soft modes is not observed can be given by the sampling bias due to the gradient descent. The measure with which the minima are sampled is the relative size Ω of their basins of attraction. Both numerically (employing the Einstein method explained in [10]) and analytically (from a naive reading of the Kac-Rice formula, see for example [24]), we find that Ω ∼ det M, i.e. fictitious minima with an abundance of very soft modes would very likely have very small basins of attraction and, thus, not contribute significantly to the empirical distributions.
This observation highlights an important and wellknown aspect of the gradient descent: it is an inherently out-of-equilibrium process that should be looked at dynamically. Thus, we are not to consider the DMensemble with a permanently non-negative spectrum, but a transition from the initial equilibrium spectrum to a nonnegative spectrum under the descent dynamics. Given that the average of the spectrum is set by the Lagrange multiplier z, i.e. by the energy, the constantly decreasing energy corresponds to an overall drift in the eigenvalues, shifting them towards higher values. We quantify this by expanding S i = S i + dS i and z = z + dz in the Hessian (8) to first order, which results in Our strategy to make gainful progress from this spectral perspective is to simplify the matrix dynamics of eq. (11) by only considering two important factors that must be there: a source of noise and an entropic confinement establishing a well-defined ensemble. The conceptual background of this approach is the seminal insight by Dyson [51] that the equilibrium sampling in random matrix theory [52] can be done by deriving the associated Langevin equation Here, the first term is a Gaussian noise term (with dW ∼ β −1 dt being a Wiener process) and the second term is an entropic spring that ensures that the matrix stays in the correct ensemble. This overall structure corresponds to the effective form of that the actual p-spin eigenvalue dynamics must have, aside from the energy-dependent drift (z-term) of eq. (11). The rationale for this is to note that the dynamics of the Hessian without the energy-shift are effectively uncorrelated with the change in energy (omitting index permutations) dE = J ijk (S i S j dS k + . . .). Thus, we can reduce the first term in eq. (11) to a centered Gaussian increment. The second term in eq. 12 reflects the unavoidable correlations in these updates: there is a well-defined entropic ensemble (the spectrum is a semi-circle at all energies in distribution) and, thus, a restoring drift to this ensemble. Using these increments of the matrix elements, it is straightforward to use matrix perturbation theory to determine the resulting dynamics of the eigenvalues This is Dyson Brownian motion. Here, σ 2 J denotes the O(N −2 ) variance of the couplings. The repulsive third term with its divergence is the technical equivalent of the statement that two eigenvalues do not cross. Thus, the eigenvalues stay in one representation of the permutation group for all times.
Comparing the two expressions of eqs (11) and (12), it is tempting on first glance to regard them as qualitatively identical, since the descent dynamics will also lead to some random noise (which ultimately will be Gaussian due to the Gaussian distribution of J), but with a Hessian that remains in the same well-defined ensemble. The influence of the additional drift term zδ ij on the eigenvalues might at first seem trivial. However, there is an important difference related to time-translational invariance. In the Dysonian case, everything is in equilibrium and thus invariant in statistics under time-translation and timereversal. For descent dynamics, these symmetries are trivially broken by the drift. This can easily be handled, but the symmetries are also broken in the noise term. This can be seen if one goes back to earlier analytical approaches [27] to the dynamics of the p-spin model, where the correlation function C(t + ∆t, t) = S i (t + ∆t)S i (t) decays exponentially for t ∆t with a rate that is inversely proportional to t, i.e. we expect Thus, even a temporally coarse-grained version of eq. (11) that would get rid of correlations in time without drift would not be Markovian, because the strength of noise would be time-dependent. Since we are only interested in first-passage statistics, this is easily mitigated as we can switch to dynamics in a reparametrized time [27] τ with dτ ∝ dt/t. This logarithmic reparametrization makes the process time-translational invariant.
The final missing piece is the energy, which decreases with time during gradient descent. We had to separate this drift from the matrix dynamics to bring the latter into a treatable form, so the relevant process is no longer a first crossing of zero, but the first crossing of a curve given by the energy ε(τ ). As the fluctuations of the energy (given by the gradient, the first derivative of a Gaussian field) are independent from the fluctuations in eigenvalues (aside from the mean they are given by the random part of the Hessian (8), the second derivative of a Gaussian field) and of lower order (O(N −2/3 in the eigenvalues, but O(N −1 ) in the intensive energy) we can replace the actual energy by the asymptotic trajectory ε ∞ (τ ) for large N . This of course not only neglects the dynamical fluctuations along the trajectory, but also the deterministic noise in trajectories from the initial conditions. However, we are interested in the behavior at large times τ , where the effect of the initial configurations is negligible.
We can deduce from the structure of the analytical equations [27,53] that the intensive energy in real-time ε ∞ (t) ∼ ε th + constt −γ asymptotically is a power-law, thus in rescaled time the energy decays exponentially ε ∞ (τ ) ε th + conste −τ /τc . A more pedestrian way to look at this is by directly writing down an equation of motion for the energy Again changing variables to establish time-translation invariance and performing an average over the disorder we see that ∂ τ ε = const+ε 2 , from which one gathers (using the known asymptotic value) that ε ∞ ≈ ε th tanh(const τ ) with the aforementioned asymptotically exponential decay. Putting everything together, we get a model version of the gradient descent as a first-passage process in the standard Dyson dynamics with a time dependent boundary ε model = ε th tanh(τ ). From this model definition, one can (somewhat a posteriori admittedly) rationalize the following numerical findings described earlier.
(1) It is inherently plausible for the finite-size scaling to be of the same form as the Tracy-Widom distribution because the effective process is indeed one dominated by edge eigenvalue fluctuations; (2) The shape of the distribution (Gaussian on one side and Tracy-Widom on the other) is plausible. The fluctuations of the lowest eigenvalue are of order N −2/3 with a hard border to the right, thus there is some time τ 0 where the typical distance is of the same order, so that there is a very high probability for the boundary to not have been crossed. Thus, we know that at τ 0 the fluctuation distribution of the lowest eigenvalue is given by the Tracy-Widom law. However, τ 0 will be close to the actual final time and at small times (up to eigenvalue distances of O(N −1/2 )) the diffusive part dominates. Thus, it seems within reason that we would see a distribution that effectively looks like a convolution of a Gaussian (the propagator on short times) and the Tracy-Widom law (the fictitious initial condition at τ 0 ), which would bear the hallmarks we find in the original numerics. Numerical exploration of this effective description is straightforward with various options for sampling this process. One way would be to go back to the initial idea of the Dyson Brownian motion and diagonalize a matrix subject to small noise in time. The non-crossing is manifest in this approach, but diagonalization is a rather costly operation. Alternatively, one might consider event-based Monte-Carlo of the thermal ensemble whose Langevin equation is given by the Dyson Brownian motion. Finally, there is the option to do straightforward integration of the equation of motion with adaptive time-steps that ensure that the trajectories of eigenvalues never cross. Opting for the latter, we find that we can indeed get to satisfying agreement of the numerics by tuning the strength of the white noise inflicted upon the eigenvalues which one can characterize by an effective temperature T . A first-principle determination of the specific value of T to be used is beyond the scope of our arguments. A numerical determination is possible, but simulations of spectral trajectories are slow for two reasons: the cost of the diagonalization itself and the need to switch from FIRE to the direct integration to see the dynamics in physical time. We opt for a more pragmatic procedure and simulate the process for a few values (large values of T are slow as increasing the noise induces more collisions); see Fig. 5 for N = 64. We see that indeed there is a satisfying agreement between the distributions found from this simple first-passage problem and the real ones in fig. 1.
We close with a final remark on the contributions to the final energies that we ignored. These are the depth of the final valley after the system has become mechanically stable and then contributions corresponding to energy fluctuations in the first-passage problem. All evidence seems to corroborate that these are of higher order in 1/N and almost Gaussian distributed. In this case, they do not contribute to the finite-size scaling and also not to the fluctuation distribution of normalized variables as they would only change the first two cumulants. We are therefore positive that the eigenvalue process outlined here indeed captures the essential mechanism underlying the distribution for the p-spin model and, also, the other constraint satisfaction problems discussed earlier.

CONCLUSIONS
We performed gradient descent simulations in several prototypical constraint-satisfaction problems with com- plex landscapes and found similar asymmetric distributions in the normalized distributions of final energies ( fig. 1). These feature a soft tail corresponding to betterthan-typical solutions and a hard tail for worse-thantypical solutions. Inspecting in more detail for the spherical p-spin spin-glass model, we found that both the finitesize scaling as well as the functional form of the soft tail ( fig. 2) are reminiscent of the Tracy-Widom distribution, which is usually associated with the fluctuations of extremal eigenvalues in random matrix problems.
We made this connection manifest by proposing a novel interpretation of gradient-descent problems as a firstpassage process into mechanical stability, i.e. we argue that the energy at which the lowest eigenvalue becomes non-negative is a good proxy for the actual final energy at which the gradient descent terminates with respect to the fluctuation distribution. This is a purely dynamical picture of the out-of-equilibrium gradient-descent process in which typical landscape features such as the basins of attraction are emergent from the random matrix ensemble associated with the dynamical matrix. The very simple nature of the ersatz-process found by reducing the spectral dynamics to their core ingredients could allow for an exact analytical treatment in the future.
An open question remains concerning the extent to which the observed phenomenology survives with increasing system size. At least for the p-spin model, extrapolation to the large-N limit does lead to a non-Gaussian distribution with the same tail behavior as seen in finite systems. However, this is less clear for the other models studied here, for which it was difficult to obtain comparable statistics. Nevertheless, the perspective of gradient descent as a first-passage process suggests that the highly similar non-Gaussian features seen in the distributions for the other models are not a finite-size effect, and should persist in the thermodynamic limit.
The view of the gradient descent process as a firstpassage problem could be a rather broadly fruitful one. Most aspects of the (matrix) dynamics of the p-spin model are believed to be somewhat general for many complex systems. Additionally, the topological feature that fluctuations towards lower energies (corresponding to minima with atypically soft modes) are substantially easier to find than those towards higher energies (hard modes) should prevail in a vast variety of systems with complex landscapes. This way of thinking should be helpful in understanding phenomenology in experiments such as Ref. 54 that prominently feature asymmetrical distributions of the fluctuations within the inherent structure landscape. Our reasoning should be applicable to results from finite temperature quenches as long as the initial temperature is sufficiently high that the system is ergodic and the final temperature sufficiently low that the system is confined to a single basin after the quench. Finally, we note that a good understanding of the first-passage into mechanical stability might inspire new ways of tweaking interactions to convert complex landscapes into less rough ones (similar to the methods proposed in ref. 55) in order to find better (lower energy) solutions.
We end with a caveat: In finite-dimensional models and data, a simple Dysonian random-matrix view as proposed here will necessarily face some issues, one very important one being the existence of sum rules constraining the Hessian, particularly the ones corresponding to mechanical equilibrium. The details of the coordination structure have been argued [56][57][58] to be crucial in understanding essential features of low-dimensional jammed packings such as the scaling of the vibrational density of states. As this is directly linked to the statistics of the extremal eigenvalues, it is very intriguing for future work to study the effect of these constraints (which develop as the system descends in the landscape) on the distributions studied here. Even when such constraints exist, however, the notion is still valid that there is one contributing process to the statistics of descents in disordered landscapes, related to the passage into mechanical stability that we have isolated for the p-spin model.
We thank S. Teitel for providing us with the data used in Ref. 37 and S. Ridout for additional isobaric jammed configurations and S. R. Nagel for discussions that inspired this investigation. This work was supported by the "Cracking the glass problem" collaboration grant (348126 to HHB, 454943 to JK and 454945 to AJL), as well as an Investigator grant (327939 to AJL) from the Simons Foundation.