Marvels and Pitfalls of the Langevin Algorithm in Noisy High-dimensional Inference

Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor model. The typical behaviour of this algorithm is described by a system of integro-differential equations that we call the Langevin state evolution, whose solution is compared with the one of the state evolution of approximate message passing (AMP). Our results show that, remarkably, the algorithmic threshold of the Langevin algorithm is sub-optimal with respect to the one given by AMP. We conjecture this phenomenon to be due to the residual glassiness present in that region of parameters. Finally we show how a landscape-annealing protocol, that uses the Langevin algorithm but violate the Bayes-optimality condition, can approach the performance of AMP.


I. MOTIVATION
Algorithms based on noisy variants of gradients descent [1,2] stand at the roots of many modern applications of data science, and are being used in a wide range of high-dimensional non-convex optimization problems.The widespread use of stochastic gradient descent in deep learning [3] is certainly one of the most prominent examples.For such algorithms, the existing theoretical analysis mostly concentrate on convex functions, convex relaxations or on regimes where spurious local minima become irrelevant.For problems with complicated landscapes where, instead, useful convex relaxations are not known and spurious local minima cannot be ruled out, the theoretical understanding of the behaviour of gradient-descent-based algorithm remains poor and represents a major avenue of research.
The goal of this paper is to contribute to such an understanding in the context of statistical learning, and to transfer ideas and techniques developed for glassy dynamics [4] to the analysis of non-convex high-dimensional inference.In statistical learning, the minimization of a cost function is not the goal per se, but rather a way to uncover an unknown structure in the data.One common way to model and analyze this situation is to generate data with a hidden structure, and to see if the structure can be recovered.This is easily set up as a teacherstudent scenario [5,6]: First a teacher generates latent variables and uses them as input of a prescribed model to generate a synthetic dataset.Then, the student observes the dataset and tries to infer the values of the latent variables.The analysis of this setting has been carried out rigorously in a wide range of teacher-student models for high-dimensional inference and learning tasks as diverse as planted clique [7], generalized linear models such as compressed sensing or phase retrieval [8], factorization of matrices and tensors [9,10] or simple models of neural networks [11].In these works, the information theoretically optimal performances -the one obtained by an ideal Bayes-optimal estimator, not limited in time and memory-have been computed.
The main question is, of course, how practical algorithms -operating in polynomial time with respect to the problem size-compare to these ideal performances.The last decade brought remarkable progress into our understanding of the performances achievable computationally.In particular, many algorithms based on message passing [6,12], spectral methods [13], and semidefinite programs (SDP) [14] were analyzed.Depending on the signal-to-noise ratio, these algorithms were shown to be very efficient in many of those task.Interestingly, all these algorithm fail to reach good performance in the same region of the parameter space, and this striking observation has led to the identification of a well-defined hard phase.This is a regime of parameters in which the underlying statistical problem can be information-theoretically solved, but no efficient algorithms are known, rendering the problem essentially unsolvable for large instances.This stream of ideas is currently gaining momentum and impacting research in statistics, probability, and computer science.
The performance of the noisy-gradient descent algorithms remains an entirely open question.Do they allow to reach the same performances as message passing and SDPs?Can they enter the hard phase, do they stop to be efficient at the same moment as the other approaches, or are they worse?The ambition of the present paper is to address these questions by analyzing the performance of the Langevin algorithm in the high-dimensional limit of a particular spiked mixed matrix-tensor model, defined in detail in the next section.
Similar models have played a fundamental role in statistics and random matrix theory [15,16].Tensor factorization is also an important topic in machine learning and is widely used in data analysis [17][18][19][20][21][22].At variance with the pure spiked tensor case [18], this mixed matrixtensor model has the advantage that the algorithmic threshold appears at the same scale as the informationtheoretic one, similarly to what is observed in simple models of neural networks [8,11].We view the spiked mixed matrix-tensor model as a prototype for non-convex high-dimensional landscape.The key virtue of the model is its tractability.
We focus on the Langevin algorithm for two main reasons: Firstly it is the gradient-based algorithm that is most widely studied in physics.Secondly, at large time (possibly growing exponentially with the system size) it is known to sample the associated Boltzmann measure thus evaluating the Bayes-optimal estimator for the inference problem.We evaluate performance of the algorithm at times that are large but not growing with the system size.We explicitly compare thus obtained performance to the one of the Bayes optimal estimator and to the best known efficient algorithm so-far -the approximate message passing algorithm [6,12].In particular, contrary to what has been anticipated in [23,24], but as surmised in [25], we observe that the performance of the Langevin algorithm is hampered by the many spurious metastable states still present in the AMP-easy phase.
In showing that, we shed light on a number of properties of the Langevin algorithm that may seem counterintuitive at a first sight (e.g. the performance getting worse as the noise decreases).
The possibility to describe analytically the behavior of the Langevin algorithm in this model is enabled by the existence of the Crisanti-Horner-Sommers-Cugliandolo-Kurchan (CHSCK) equations in spin glass theory, describing the behavior of the Langevin dynamics in the socalled spherical p-spin model [26,27], where the method can be rigorously justified [28].These equations were a key development in the field of statistical physics of disordered systems that lead to detailed understanding and predictions about the slow dynamics of glasses [4].In this paper, we bring these powerful methods and ideas into the realm of statistical learning.

II. THE SPIKED MATRIX-TENSOR MODEL
We now detail the spiked mixed matrix-tensor problem: a teacher generates a N -dimensional vector x * by choosing each of its components independently from a normal Gaussian distribution of zero mean and unit variance.In the large N limit this is equivalent to have a flat distribution over the N -dimensional hypersphere S N −1 defined by |x * | 2 = N .In the paper we will use either of these two, as convenient.The teacher then generates a symmetric matrix Y ij and a symmetric order-p tensor T i1,...,ip as where ξ ij and ξ i1,...,ip are iid Gaussian components of a symmetric random matrix and tensor of zero mean and variance ∆ 2 and ∆ p , respectively; ξ ij and ξ i1,...,ip correspond to noises corrupting the signal of the teacher.In the limit ∆ 2 → 0, and ∆ p → 0, the above model reduces to the canonical spiked Wigner model [29], and spiked tensor model [18], respectively.The goal of the student is to infer the vector x * from the knowledge of the matrix Y , of the tensor T , of the values ∆ 2 and ∆ p , and the knowledge of the spherical prior.The scaling with N as specified in Eq. ( 1) is chosen in such a way that the information-theoretically best achievable error varies between perfectly reconstructed spike x * and random guess from the flat measure on S N −1 .Here, and in the rest of the paper we denote x ∈ S N −1 the N -dimensional vector, and x i with i = 1, . . ., N its components.This model belongs to the generic direction of study of Gaussian functions on the N -dimensional sphere, known as p-spin spherical spin glass models in the physics literature, and as isotropic models in the Gaussian process literature [30][31][32][33][34].In statistics and machine learning, these models have appeared following the studies of spiked matrix and tensor models [16,18,29].Analogous mixed matrix-tensor models, where next to a order-p tensor one observes a matrix created from the same spike are studied e.g. in [17] in the context of topic modeling, or in [18].From the optimization-theory point of view, this model is highly non-trivial being high-dimensional and non-convex.For the purpose of the present paper this model is chosen with the hypothesis that its energy landscape presents properties that will generalize to other non-convex high-dimensional problems.The following three ingredients are key to the analysis: (a) It is in the class of models for which the Langevin algorithm can be analyzed exactly in large N limit.(b) The different phase transitions, both algorithmic and information theoretic, discussed hereafter, all happen at ∆ 2 = O(1), ∆ p = O(1).This means that when the problem becomes algorithmically tractable it is still in the noisy regime, where the optimal mean squared error is bounded away from zero.(c) The AMP algorithm is in this model conjectured to be optimal among polynomial algorithms.It is this second and third ingredient that are not present in the pure spiked tensor model [18], making it unsuitable for our present study.We note that the Langevin algorithm was recently analyzed for the pure spiked tensor model in [21] in a regime where the noise variance is very small ∆ ∼ N −p/2 , but we also note that in that model algorithms such as tensor unfolding, semidefinite programming, homotopy methods, or improved message passing schemes work better, roughly up to ∆ ∼ N −p/4 [17-19, 35, 36].

III. BAYES-OPTIMAL ESTIMATION AND MESSAGE-PASSING
In this section we present the performance of the Bayes-optimal estimator and of the approximate message passing algorithm.This theory is based on a straightforward adaptation of analogous results known for the pure spiked matrix model [9,29,37] and for the pure spiked tensor model [10,18].
The Bayes-optimal estimator x is defined as the one that among all estimators minimizes the mean-squared error (MSE) with the spike x * .Starting from the posterior probability distribution the Bayes-optimal estimator reads To simplify notation, and to make contact with the energy landscape and the statistical physics notations, it is convenient to introduce the energy cost function, or Hamiltonian, as so that keeping in mind that for N → ∞ the spherical constraint is satisfied |x| 2 = N , the posterior is written as P (x|Y, T ) = exp[−H(x)]/ Z(Y, T ), where Z is the normalizing partition function.
With the use of the replica theory and its recent proofs from [9,10,38] one can establish rigorously that the mean squared error achieved by the Bayes-optimal estimator (2) is given as MMSE = 1 − m * where m * ∈ R is the global maximizer of the so-called free entropy of the problem This expression is derived, and proven, in the Appendix Sec.B 2. We note that the proof applies to the posterior distribution (2) with the Gaussian prior.
We now turn to the approximate message-passing (AMP) [10,18], that is the best algorithm known so far for this problem.AMP is an iterative algorithm inspired from the work of Thouless-Anderson and Palmer in statistical physics [39].We explicit its form in the Appendix Sec.B 1. Most remarkably performance of AMP can be evaluated by tracking its evolution with the iteration time and it is given in terms of the (possibly local) maximum of the above free entropy that is reached as a fixed point of the following iterative process with initial condition m t=0 = with 0 < 1. Eq. ( 6) is called the State Evolution of AMP and its validity is proven for closely related models in [40].We denote the corresponding fixed point m AMP and the corresponding estimation error MSE AMP = 1 − m AMP .
The phase diagram presented in Fig. 1 summarizes this theory for the spiked 2 + 3-spin model.It is deduced by investigating the local maxima of the scalar function (5).Notably we observe that the phase diagram in terms of ∆ 2 and ∆ p splits into three phases • Easy in green for ∆ 2 < 1 and any ∆ p : The fixed point of the state evolution ( 6) is the global maximizer of the free entropy (5), and m * = m AMP > 0.
• Hard in orange for ∆ 2 > 1 and low ∆ p < ∆ IT p (∆ 2 ): The fixed point of the state evolution ( 6) is not the global maximizer of the free entropy (5), and m * > m AMP = 0.
• Impossible in red for ∆ 2 > 1 and high ∆ p > ∆ IT p (∆ 2 ): The fixed point of the state evolution (6) is the global maximizer of the free entropy (5), and m * = m AMP = 0.
For the 2 + p-spin model with p > 3 the phase diagram is slightly richer and is presented in the Appendix Sec.D 4.

IV. LANGEVIN ALGORITHM AND ITS ANALYSIS
We now turn to the core of the paper and the analysis of the Langevin algorithm.In statistics, the most commonly used way to compute the Bayes-optimal estimator (3) is to attempt to sample the posterior distribution (2) and use several independent samples to compute the expectation in (3).In order to do that one needs to set up a stochastic dynamics on x that has a stationary measure at long times given by the posterior measure (2).The Langevin algorithm is one of the possibilities (others include notably Monte Carlo Markov chain).The common bottleneck is that the time needed to achieve stationarity can be in general exponential in the system size.In which case the algorithm is practically useless.However, this In the easy (green) region AMP achieves the optimal error smaller than random pick from the prior.In the impossible region (red) the optimal error is as bad as random pick from the prior, and AMP achieves it as well.In the hard region (orange) the optimal error is low, but AMP does not find an estimator better than random pick from the prior.In the case of Langevin algorithm the performance is strictly worse than that for AMP in the sense that the hard region increases up to line 1/∆ * 2 = max(1, ∆3/2), depicted in green dots.The green circles are obtained by numerical extrapolation of the Langevin state evolution equations.
is not always the case and there are regions in parameter space where one can expect that the relaxation to the posterior measure happens on tractable timescales.Therefore it is crucial to understand where this happens and what are the associated relaxation timescales.
The Langevin algorithm on the hypersphere with Hamiltonian given by Eq. (4) reads ẋi where η i (t) is a zero mean noise term, with η i (t)η j (t ) = 2δ ij δ(t − t ) where the average • is with respect to the realizations of the noise.The Lagrange multiplier µ(t) is chosen in such a way that the dynamics remains on the hypersphere.In the large N -limit one finds µ where the H 2 (t) is the 1st term from (4) evaluated at x(t), and H p (t) is the value of the 2nd term from (4).The presented spiked matrix-tensor model falls into the particular class of spherical 2 + p-spin glasses [41,42] for which the performance of the Langevin algorithm can be tracked exactly in the large-N limit via a set of integropartial differential equations [26,27], beforehand dubbed CHSCK.We call this generalised version of the CHSCK equations Langevin State Evolution (LSE) equations in analogy with the state evolution of AMP.
In order to write the LSE equations, we defined three dynamical correlation functions where h i is a pointwise external field applied at time t to the Hamiltonian as H + i h i x i .We note that the correlation functions defined above depend on the realization of the thermal history (i.e. of the noise η(t)) and on the disorder (here the matrix Y and tensor T ).However, in the large-N limit they all concentrate around their averages.We thus define C(t, t ) = lim N →∞ E Y,T C N (t, t ) η and analogously for C(t) and R(t, t ).Standard field theoretical methods [43] or dynamical cavity method arguments [44] can then be used to obtain a closed set of integro-differential equations for the averaged dynamical correlation functions, describing the average global evolution of the system under the Langevin algorithm.The resulting LSE equations are (see the Appendix for a complete derivation) where we have defined The Lagrange multiplier, µ(t), is fixed by the spherical constraint, through the condition C(t, t) = 1 ∀t.Furthermore causality implies that R(t, t ) = 0 if t < t .Finally the Ito convention on the stochastic equation (7) gives ∀t lim t →t − R(t, t ) = 1.

V. BEHAVIOR OF THE LANGEVIN ALGORITHM
In order to assess the perfomances of the Langevin algorithm and compare it with AMP, we notice that the correlation function C(t) is directly related to accuracy of the algorithm.We solve the differential equations (11) numerically along the lines of [45,46], for a detailed procedure see the Appendix Sec.C 1, codes available online at [47].In Fig. 2 we plot the correlation with the spike C(t) as a function of the running time t for p = 3, fixed ∆ 2 = 0.7 and several values of ∆ p , we use as initial condition C(t = 0) = 10 −4 .In the inset of the plot we compare it to the same quantity obtained from the state evolution of the AMP algorithm, with the same initial condition.
For the Langevin algorithm in Fig. 2 we see a pattern that is striking.One would expect that as the noise ∆ p decreases the inference problem is getting easier, the correlation with the signal is larger and is reached sooner in the iteration.This is, after all, exactly what we observe for the AMP algorithm in the inset of Fig. 2. Also for the Langevin algorithm the plateau reached for large times t becomes higher (better accuracy) as the noise ∆ p is reduced.Furthermore the height of the plateau coincides with that reached by AMP, thus testifying the algorithm reached equilibrium.However, contrary to AMP, the relaxation time for the Langevin algorithm increases dramatically when diminishing ∆ p (notice the log scale on x-axes of Fig. 2, as compared to the linear scale of the inset).We define τ as the time it takes for the correlation to reach a value C plateau /2.We then plot the value of this equilibration time in the insets of Fig. 3 as a function of the noise ∆ 2 having fixed ∆ p .The data are consistent with a divergence of τ at a certain finite value of ∆ * 2 .We found that the divergence points are affected by the initial condition of the dynamics C(t = 0), this aspect is discussed in the Appendix Sec.D 5. In the analysis of the phase diagram we initialize the dynamics to C(t = 0) = 10 −40 (smaller values have not led to noticeable changes in ∆ * 2 ).We calculate the divergence time and we fit the data with a power law τ and we obtain in the particular case of fixed ∆ p = 1.0The main pannel then presents a fit using a power law consistent with a divergence at ∆ * 2 ≈ 0.72 The circles are obtained with numerical solution of LSE that uses the dynamical grid while crosses are obtained using a fixed-grid, initial condition was C(t = 0) = 10 −40 (details in the Appendix).
that γ = 2.24 and ∆ * 2 = 0.72.We are not able to strictly prove that the divergence of the relaxation time truly occurs, but at least our results imply that for ∆ 2 > ∆ * 2 the Langevin algorithm (7) is not a practical solver for the spiked mixed matrix-tensor problem.We will call the region ∆ * 2 < ∆ 2 < 1 where the AMP algorithm works optimally without problems yet Langevin algorithm does not, the Langevin-hard region.∆ * 2 is then plotted in Fig. 1 with green points and delimits the Langevin-hard region that extends considerably into the region where the AMP algorithm works optimally in a small number or iterations.Our main conclusion is thus that the Langevin algorithm designed to sample the posterior measure works efficiently in a considerably smaller region of parameters than the AMP as quantified in Fig. 1.
Fig. 4 presents another way to depict the observed data, the correlation C(t) reached after time t is plotted as a function of the tensor noise variance ∆ p .The results of AMP are depicted with dotted lines and, as one would expect, decrease monotonically as the noise ∆ p increases.The equilibrium value (black dashed) is reached within few dozens of iterations.On the contrary, the correlation reached by the Langevin algorithm after time t is nonmonotonic and close to zero for small values of noise ∆ p signaling again a rapidly growing relaxation time when ∆ p is decreased.

VI. GLASSY NATURE OF THE LANGEVIN-HARD PHASE
The behaviour of the Langevin dynamics as presented in the last section might seem counter-intuitive at first sight, because one would expect any problem to get simpler when noise ∆ p is decreased.In the present model instead, as ∆ p is decreased, the tensor part of the cost function (4) becomes more important.This brings as a consequence that the landscape becomes rougher and causes the failure of the Langevin algorithm.
In the presence of the hard (for AMP) phase, it was recently argued in [25] that sampling-based algorithms are indeed to be expected to be worse than the approximate message passing ones.This is due to residual glassiness that extends beyond the hard phase.We repeated the analysis of [25] in the present model (details in the Appendix Sec.E) and conclude that while this explanation provides the correct physical picture, the transition line obtained in this way does not agree quantitatively with the numerical extrapolation of the relaxation times we have obtained numerically in the previous section, at least on the timescales on which we were able to solve the LSE equations.The reasons behind this remain open.
In order to obtain a theoretical estimate that quantitatively agrees with the observed behaviour of the LSE, we found the following argument.We first notice that the Langevin dynamics initialized at very small overlap C(t = 0) remains for a long time at small values of the correlation with the signal.We assume that during this time the dynamics behaves as it would in the mixed 2+pspin model without the spike.The model without the spike has been studied extensively in physics literature, precisely with the aim to understand the dynamical properties of glasses [4,27,48].One of the important results of those studies is that the randomly initialized dynamics converges asymptotically to the so-called threshold states.Indeed in Fig. 5 this aspect can be observed in the evolution of the energy.It soon approaches a value that can be evaluated [4,27,48] and corresponds to threshold state energy E th (horizontal lines) In the above equation q th represents the correlation, a.k.a.overlap, of two configurations randomly picked from the same threshold state, which can be also evalu-ated as the solution of The derivation of these expressions can be found in Appendix E 1 and in Appendix F. Supported by the numerical results of Fig. 5, we make an approximation that already on the observed time-scales the algorithm converges to the threshold states 1 .The presence of the signal decides whether the algorithm develops a correlation with the signal.To understand how it occurs one has to study the statistical properties of the Thouless-Anderson-Palmer (TAP) free-energy landscape, which is the finite temperature counterpart of the energy landscape, as it has been shown in early days results of spinglass theory [44,49] and in the recent ones of the mathematical community [50].The generic picture, that comes out from several years of studies on spin-glass models, is that threshold states corresponds to marginal local minima of the TAP free-energy.Critical points of the TAP free-energy functional that are below the threshold states are typically local minima, while those above are saddles with extensively many negative directions.The threshold states lie in between and have just a few very flat directions.In order to obtain an analytical prediction for the Langevin dynamics threshold, one has to find out how the presence of the spike destabilises the threshold states.This can be achieved by studying the free-energy Hessian at a threshold state.As shown in App.F, such Hessian reads: where f (q th ) is positive and its expression can be found in App.F, m i is the average magnetization of site i in the given threshold state, and G ij can be shown to be statistically equivalent to a random matrix having elements which are i.i.d.Gaussian random variables with mean zero and variance The free-energy Hessian evaluated at a typical threshold state is therefore a random matrix belonging to the Gaussian Orthogonal Ensemble plus two rank-one perturbations; one is negative and in the direction of the signal, whereas the other is positive and in the direction of the threshold state.
Results from random matrix theory allow us to completely characterise the spectral properties of the Hessian.Its bulk density of eigenvalues is shifted semi-circle whose left edge touches zero, hence leading to the marginality of the threshold states.For small signal to noise ratio the minimal eigenvalue is zero, whereas when the signal to noise ratio exceeds a certain critical value, the rankone perturbation in the direction of the signal induces a BBP (Baik, Ben Arous, Peché) transition [15,51], where a negative eigenvalue pops out from the Wigner semicircle, and correspondingly a downward descent direction toward the spike emerges and makes the threshold states unstable.Note that the last term of the Hessian has no effect on the development of an unstable direction as it is positive and uncorrelated with the signal.By adapting the known formulas for the BBP transition to our case, see App.F, we find a landscape-based conjecture for the algorithmic threshold, that is the larger value of ∆ * 2 between ∆ * 2 = 1 and the roots of This is the threshold depicted in Fig. 1 for p = 3 in green dotted line.We note a very good agreement with the data points obtained with extrapolation of the relaxation time from numerical solution of the LSE equations.
In the following, we present a complementary argument that interestingly also makes a direct link with AMP state evolution.We again assume that Langevin dynamics approaches the threshold states.This time we use AMP to determine whether it will remain there or not.If the initial correlation is 0 < m t=0 1 its evolution follows This equation is obtained from the state evolution of the AMP algorithm where the overlap is fixed to q th , as detailed in the Appendix B 3. The stability condition that decides whether an infinitesimal correlation will grow or decrease under (16) reads q th = 1 − ∆ 2 .Using (13), this then leads to eq. (15).

VII. DISCUSSION AND PERSPECTIVES
Motivated by the general aim to shed light on behaviour and performance of noisy-gradient descent algorithms that are widely used in machine learning, we investigate analytically the performance of the Langevin algorithm in the noisy high-dimensional limit of a spiked matrix-tensor model.We compare it to the performance of the approximate message passing algorithm.While both these algorithms are designed with the aim to sample the posterior measure of the model, we show that the Langevin algorithm fails to find correlation with the The system first tends to the threshold energy, Eq. ( 12) horizontal lines, and then for sufficiently small ∆2 finds the direction toward the signal.
signal in a considerable part of the AMP-easy region.
Neither of the two algorithm enters the so-called hard phase.Our analysis is based on the Langevin State Evolution equations, generalization of the dynamical theory for mean field spin glasses, that describe the evolution of the algorithm in the large size limit.The Langevin algorithm performs worse than the AMP due to the underlying glass transition in the corresponding region of parameters.Relying on result from spin glass theory, we present a simple heuristic expression of the Langevin-threshold ( 15) line which appears to be in agreement with the value obtained from numerical solution of the LSE equations.
We note that so far, in our study of the spiked matrixtensor model with Langevin dynamics, we only accessed the cost function (4) and its derivatives.We did not allow ourselves to split the cost function in the tensor-related H p and the matrix-related H 2 parts.If we did then there is a simple way to overcome the Langevin-hard regime by first considering only the matrix measurements and then slowly turning on the tensor, in a similar way as temperature is tuned in simulated annealing.We study this procedure in the Appendix D 6.It is interesting to underline that from the point of view of Bayesian inference this finding remains somewhat paradoxical.In the setting of this paper we know perfectly the model that generated the data and all its parameters, yet we see that for the Langevin algorithm it is computationally advantageous to mismatch the parameters and perform the annealing in the tensor part in order to reach faster convergence to equilibrium.This is particularly striking given that for AMP it has been proven in [7] that mismatching the parameters can never improve the performance.In fact, from a physics point the principle thanks to which AMP does hot share the hurdles of the Langeving algorithm remains an interesting open question.
We stress that the above annealing procedure is a particularity of the present model and will not generalize to a broad range of inference problems, because it is not clear in general how to split the cost function into simple to optimize yet informative part and the rest.A formidably interesting direction for future work consists instead in investigating whether the performance of the Langevin algorithm can be improved in a manner that only accesses the cost function or its derivatives.
While we studied here the spiked matrix-tensor model, we expect that our findings, based on the existence of an underlying glass transition, will hold more universally.We expect them to apply to other local sampling dynamics, e.g. to Monte Carlo Markov chains, and to a broader range of models, e.g.simple models of neural networks.An interesting extension of this work would be to investigate algorithms closer to stochastic gradient descent and models closer to current neural network architectures.We consider a teacher-student setting in which the teacher constructs a matrix and a tensor from a randomly sampled signal and the student is asked to recover the signal from the observation of the matrix and tensor provided by the teacher [6].
The signal, x * is an N -dimensional vector whose entries are real i.i.d.random variables sampled from the normal distribution (i.e. the prior is P X ∼ N (0, 1)).The teacher generates from the signal a symmetric matrix and a symmetric tensor of order p.Those two objects are then transmitted through two noisy channels with variances ∆ 2 and ∆ p , so that at the end one has two noisy observations given by where, for i < j and i 1 < • • • < i p , ξ ij and ξ i1,...,ip are i.i.d.random variables distributed according to ξ ij ∼ N (0, ∆ 2 ) and ξ i1,...,ip ∼ N (0, ∆ p ).The ξ ij and ξ i1,...,ip are symmetric random matrix and tensor, respectively.Given Y ij and T i1,...,ip the inference task is to reconstruct the signal x * .
In order to solve this problem we consider the Bayesian approach.This starts from the assumption that both the matrix and tensor have been produced from a process of the same kind of the one described by Eq. (A1-A2).Furthermore we assume to know the statistical properties of the channel, namely the two variances ∆ 2 and ∆ p , and the prior on x.Given this, the posterior probability distribution over the signal is obtained through the Bayes formula where (A4) Therefore we have where Z(Y, T ) is a normalization constant.Plugging Eqs.(A1-A2) into Eq.(A5) allows to rewrite the posterior measure in the form of a Boltzmann distribution of the mixed 2 + p-spin Hamiltonian [41,42,52] In the following we will refer to Z(Y, T ) as the partition function.We note here that in the large N limit, using a Gaussian prior on the variables x i is equivalent to consider a flat measure over the N -dimensional hypersphere N i=1 x 2 i = N .This choice will be used when we will describe the Langevin algorithm and in this case the last term in the Hamiltonian will become an irrelevant constant.
Appendix B: Approximate Message Passing, state evolution and phase diagrams Approximate Message Passing (AMP) is a powerful iterative algorithms to compute the local magnetizations x i given the observed matrix and tensor.It is rooted in the cavity method of statistical physics of disordered systems [39,44] and it has been recently developed in the context of statistical inference [12], where in the Bayes optimal case it has been conjectured to be optimal among all local iterative algorithms.Among the properties that make AMP extremely useful is the fact that its performances can be analyzed in the thermodynamic limit.Indeed in such limit, its dynamical evolution is described by the so called State Evolution (SE) equations [12].In this section we derive the AMP equations and their SE description for the spiked matrix-tensor model and solve them to obtain the phase diagram of the model as a function of the variances ∆ 2 and ∆ p of the two noisy channels.

Approximate Message Passing and Bethe free entropy
AMP can be obtained as a relaxed Gaussian closure of the Belief Propagation (BP) algorithm.The derivation that we present follows the same lines of [10,37].The posterior probability can be represented as a factor graph where all the variables are represented by circles and are linked to squares representing the interactions [53].
FIG. 6.The factor graph representation of the posterior measure of the matrix-tensor factorization model.The variable nodes represented with white circles are the components of the signal while black squares are factor nodes that denote interactions between the variable nodes that appear in the interaction terms of the Boltzmann distribution in Eqs.(A6-A7).There are three types of factor nodes: PX is the prior that depends on a single variable, PY that is the probability of observing a matrix element Yij given the values of the variables xi and xj, and finally PT that is the probability of observing a tensor element Ti 1 ,...,ip .The posterior, apart from the normalization factor, is simply given by the product of all the factor nodes.
This representation is very convenient to write down the BP equations.In the BP algorithm we iteratively update until convergence a set of variables, which are beliefs of the (cavity) magnetization of the nodes.The intuitive underlying reasoning behind how BP works is the following.Given the current state of the variable nodes, take a factor node and exclude one node among its neighbors.The remaining neighbors through the factor node express a belief on the state of the excluded node.This belief is mathematically described by a probability distribution called message, mt ij→i (x i ) and tt ii2...ip→i (x i ) depending on which factor node is selected.At the same time, another belief on the state of the excluded node is given by the rest of the network but the factor node previously taken into account, m i→ij (x i ) and t i→ii2...ip (x i ) respectively.All these messages travel in the factor graph carrying partial information on the real magnetization of the single nodes, and they are iterated until convergence.The iterative scheme is described by the following equations and we have omitted the normalization constants that guarantee that the messages are probability distributions.
When the messages have converged to a fixed point, the estimation of the local magnetizations can be obtained through the computation of the real marginal probability distribution of the variables given by We note that the computational cost to produce an iteration of BP scales as O(N p ). Furthermore Eqs.(B1 -B4) are iterative equations for continuous functions and therefore are extremely hard to solve when dealing with continuous variables.The advantage of AMP is to reduce drastically the computational complexity of the algorithm by closing the equations on a Gaussian ansatz for the messages.This is justified in the present context since the factor graph is fully connected and therefore each iteration step of the algorithm involves sums of a large number of independent random variables that give rise to Gaussian distributions.Gaussian random variables are characterized by their mean and covariance that are readily obtained for N 1 expanding the factor nodes for small Once the BP equations are relaxed on Gaussian messages, the final step to obtain the AMP algorithm is the socalled TAPyfication procedure [37,39], which exploits the fact that the procedure of removing one node or one factor produces only a weak perturbation to the real marginals and therefore can be described in terms of the real marginals of the variable nodes themselves.By applying this scheme we obtain the AMP equations, which are described by a set of auxiliary variables A (k) and B (k) i and by the mean x i and variance σ i = x 2 i of the marginals of variable nodes.The AMP iterative equations are (B7) It can be shown that these equations can be obtained as saddle point equations from the so called Bethe free entropy defined as Φ Bethe = log Z Bethe (Y, T )/N where Z Bethe is the Bethe approximation to the partition function which is defined as the normalization of the posterior measure.The expression of the Bethe free entropy per variable can be computed in a standard way (see [53]) and it is given by where are a set of normalization factors.Using the Gaussian approximation for the messages and employing the same TAPyification procedure used to get the AMP equations we obtain the Bethe free entropy density as where we used the variables defined in eqs.(B6-B9) for sake of compactness and Z(A, B) is defined as 2. Averaged free entropy and its proof Eq. (B14) represents the Bethe free entropy for a single realization of the factor nodes in the large size limit.Here we wish to discuss the actual, exact, value of this free entropy, that is: where the partition function Z(Y, T ) is defined as the normalization of the posterior probability distribution, eq. ( 2).The free entropy is a random variable, since it depends a priori on the planted signal and the noise in the tensor and matrices.However one expects that, since free entropy is an intensive quantity, we expect from the statistical physics intuition that it should be self averaging and concentrate around its mean value in the large N limit [44].In fact, this is easily proven.First, since the spherical model has a rotational symmetry, one may assume the planted assignment could be any vector on the hyper-sphere, and we might as well suppose it is the uniform one x * i = 1∀ i: the true source of fluctuation comes from the noise Y and T .These can be controlled by noticing that the free entropy is a Lipshitz function of the Gaussian random variable Y and T .Indeed: , so that the free energy f N is Lipschitz with respect to Y with constant , where x represent a copy (or replica) of the system.In this case , where q is the overlap between the two replica x and x, that is bounded by one on the sphere, so L ≤ 1 ∆2 √ N .Therefore, by Gaussian concentration of Lipschitz functions (the Tsirelson-Ibragimov-Sudakov inequality [54]), we have for some constant K: and it particular any fluctuation larger than O(1/ √ N ) is (exponentially) rare.A similar computaton shows that f N also concentrates with respect to the tensor T .This shows that in the large size limit, we can consider the averaged free entropy: .
With our (non-rigorous) statistical physics tools, this can be obtained by averaging Eq. (B14) over the disorder, see for instance [37], and this yields an expression for the free energy called the replica symmetric (RS) formula: We now state precisely the form of Φ RS and prove the validity of Eq. (B17).The RS free entropy for any prior distribution P X reads as where where W is a Gaussian random variable of zero mean and unit variance and x * is a random variables taken from the prior P X .We remind that the function Z(A, B) is defined via Eq.(B15).
For Gaussian prior P X , which is the one of interest here, we obtain The expression given in the main text is slightly different but can be obtained as follow.First notice that the extremization condition for ΦRS (m) reads and by plugging this expression in Eq. (B19) we recover the more compact expression Φ RS (m) showed in the main text: The two expressions Φ RS (m) and ΦRS (m) are thus equal for each value of m that satisfy Eq. (B20).The parameter m can be interpreted as the average correlation between the true and the estimated signal The average minimal mean squared error (MMSE) can be obtained from the maximizer m of the average Bethe free entropy as where the overbar stands for the average over the signal x * and the noise of the two Gaussian channels.The validity of Eq. (B18) can be proven rigorously for every prior having a bounded second moment.The proof we shall present is a straightforward generalization of the one presented in [10] for the pure tensor case, and in [38] for the matrix case, and it is based on two main ingredients.The first one is the Guerra interpolation method applied on the Nishimori line [38,55,56], in which we construct an interpolating Hamiltonian that depends on a parameter t ∈ [0; 1] that is used to move from the original Hamiltonian of Eq. (A6), to the one corresponding to a scalar denoising problem whose free entropy is given by the first term in Eq. (B18).The second ingredient is the Aizenman-Sims-Starr method [57] which is the mathematical version of the cavity method (note that other techniques could also be employed to obtain the same results, see [9,[58][59][60]).The theorem we want to prove is: Theorem 1 (Replica-Symmetric formula for the free energy).Let P X be a probability distribution over R, with finite second moment Σ X .Then, for all ∆ 2 > 0 and ∆ p > 0 For almost every ∆ 2 > 0 and ∆ p > 0, ΦRS admits a unique maximizer m over R + × R + and Here, we have defined the tensor-MMSE T-MMSE N by the error in reconstructing the tensor: and the matrix-MMSE M-MMSE N by the error in reconstructing the matrix: where in both cases the infimum is taken over all measurable functions θ of the observations Y .The result concerning the MMSE is a simple application of the I-MMSE theorem [61], that relates the derivative of the free energy with respect to the noise variances and the MMSE.The details of the arguments are the same than in the matrix (p = 2) case ( [38], corollary 17) and the tensor one ([10], theorem 2).Indeed, as discussed in [10,38], these M-MMSE and T-MMSE results implies the vector MMSE result of Eq. (B23) when p is odd, and thus in particular for the p = 3 case discussed in the main text.
a. Sketch of proof In this section we give a detailed sketch of the proof theorem 1.Following the techniques used in many recent works [8-10, 38, 55, 56, 58, 59], we shall make few technical remarks: • We will consider only priors with bounded support, supp(P X ) = S ⊂ [−K; K].This allows to switch integrals and derivatives without worries.This condition can then be relaxed to unbounded distributions with bounded second moment using the same techniques as the ones that we are going to present, and the proof is therefore valid in this case.This is detailed for instance in [38] sec.6.2.2.
• Another key ingredient is the introduction of a small perturbation in the model that takes the form of a small amount of side information.This kind of techniques are frequently used in statistical physics, where a small "magnetic field" forces the Gibbs measure to be in a single pure state [62].It has also been used in the context of coding theory [63] for the same reason.In the context of Bayesian inference, we follow the generic scheme proposed by Montanari in [64] (see also [65]) and add a small additional source of information that allows the system to be in a single pure state so that the overlap concentrates on a single value.This source depends on Bernoulli random variables , the channel, call it A, transmits the correct information.We can then consider the posterior of this new problem, P (X|A, Y, T ), and focus on the associated free energy density F N, defined as the expected value of the average of the logarithm of normalization constant divided by the number of spins.Then we can immediately prove that for all N ≥ 1 and , ∈ [0; 1] it follows: This allows (see for instance [10]) to obtain the concentration of the posterior distribution around the replica parameter (q where x, x (1) , x (2) are sampled from the posterior distribution and the averages • and E[•] are respectively the average over the posterior measure and the remaining random variables.
• Finally, a fundamental property of inference problems which is a direct consequence of the Bayes theorem, and of the fact that we are in the Bayes optimal setting where we know the statistical properties of the signal, namely the prior, and the statistical properties of the channels, namely ∆ 2 and ∆ p , is the so-called Nishimori symmetry [6,66]: Let (X, Y ) be a couple of random variables on a polish space.Let k ≥ 1 and let X (1) , . . ., X (k) be k i.i.d.samples (given Y ) from the distribution P (X = • | Y ), independently of every other random variables.Let us denote • the expectation with respect to P (X = • | Y ) and E the expectation with respect to (X, Y ).Then, for all continuous bounded function f E f (Y, X (1) , . . ., X (k) ) = E f (Y, X (1) , . . ., X (k−1) , X) .
While the consequences of this identity are important, the proof is rather simple: It is equivalent to sample the couple (X, Y ) according to its joint distribution or to sample first Y according to its marginal distribution and then to sample X conditionally to Y from its conditional distribution P (X = • | Y ).Thus the (k + 1)-tuple (Y, X (1) , . . ., X (k) ) is equal in law to (Y, X (1) , . . ., X (k−1) , X).
The proof of Theorem 1 is obtained by using the Guerra interpolation technique to prove a lower bound for the free entropy and then by applying the Aizenman-Sims-Star scheme to get a matching upper bound.b.Lower bound: Guerra interpolation We now move to the core of the proof.The first part combines the Guerra interpolation method [67] developed for matrices in [56] and tensors in [10].
Consider the interpolating Hamiltonian depending of t ∈ [0, 1] where we have for t = 1 the regular Hamiltonian and for t = 0 the first term of Eq. (B18) where W j are i.i.d.canonical Gaussian variables.More importantly, for all t ∈ [0, 1] we can show that the Hamiltonian above can be seen as the one emerging for an appropriate inference problem, so that the Nishimori property is kept valid for generic t ∈ [0, 1] [56].
Given the interpolating Hamiltonian we can write the corresponding Gibbs measure, and the interpolating free entropy whose boundaries are ψ N (1) = 1 N F N (our target) and ψ N (0) = 1 N ΦRS + 1 4∆2 m 2 + p−1 2p∆p m p .We then use the fundamental theorem of calculus to write We work with the second term and use Stein's lemma which, given a well behaving function g, provides the useful relation for a canonical Gaussian variable Z: ].This yields where we have used the Nishimori property to replace terms such as x 2 by xx * .At this point, we can write The first integral is clearly positive.The second one, however, seems harder to estimate.We may, however, use a simple convexity argument on the function f (x) = x k .Indeed observe that ∀ a, b ≥ 0 and p ≥ 1: We would like to use this property but there is the subtlety that we need x • x * to be non-negative.To bypass this problem we can add again a small perturbation that forces x • x * to concentrate around a non-negative value, without affecting the "interpolating free entropy" ψ N (t) in the N → ∞ limit.This is, again, the argument used in [10] and originally in [55].In this way we can write This concludes the proof and yields the lower bound: so that for all m ≥ 0 lim inf c. Upper bound: Aizenman-Sims-Starr scheme.The matching upper bound is obtained using the Aizenman-Sims-Starr scheme [57] .This is a particularly effective tool that has been already used for these problems, see for example [10,38,65].The method goes as follows.Consider the original system with N variables, H N and add an new variable x 0 so that we get an Hamiltonian H N +1 .Define the Gibbs measures of the two systems, the first with N variables and the second with N + 1 variables, and consider the two relative free entropies.Call their difference.First, we notice that we have lim sup N F N ≤ lim sup N A N because Moreover, we can separate the contribution of the additional variable in the Hamiltonian H N +1 so that H N +1 = HN + x 0 z(x) + x 2 0 s(x), with x = (x 1 , . . ., x N ), and and H N +1 is the same expression as Eq.(A6) where the N in the denominators are replaced by N + 1.We rewrite also H N (x) as a perturbation of HN : H N (x) = HN (x) + y(x) + O(1) with where the Zs and the V s are standard Gaussian random variables.Finally we can observe the partition functions Z N can be interpreted as ensemble averages with respect to HN .Thus A N = E log P X (x 0 )e x0z(x)+x 2 0 s(x) dx 0 HN − E log e y(x)

HN
. Now, using the Nishimori property and the concentration of the overlap around a non-negative value -that we denote m(Y, T ) since it depends explicitly on the disorder-it yields (see [38], see section 4.3 for details) (B18) in the thermodynamic limit, with m(Y, T ) instead of m.From this, we can now obtain the upper bound that concludes the proof: The lines correspond to different phase transitions namely the stability threshold (dashed black), the information theoretic threshold (solid red), the algorithmic threshold (solid cyan), and the dynamical threshold (dotted orange).The vertical cuts represent the section along which the magnetization is plotted in Fig. 9. On the right: Phase diagram of the spiked matrix-tensor model for p = 4.The main difference with respect to case p = 3, is that the algorithmic spinodal (solid cyan) is strictly above the stability threshold (dashed black).The hybrid-hard phase appears between these two lines (combined green and orange color).The vertical cuts represent the section along which the magnetization is plotted in Fig. 10.
The dynamical evolution of the AMP algorithm in the large N limit is described by the so-called State Evolution (SE) equations.The derivation of these equations can be straightforwardly done using the same techniques as developed in [37].They can be written in terms of two dynamical order parameters namely m t = i xt i x * i /N , which encodes for the alignment of the current estimation xt i of the components of the signal with the signal itself at time t and q t = i xt i xt i /N .Keeping the spherical constraint in mind we obtain the following SE equations Note that eq.(B35) at fixed values of q describes the evolution of the parameter m, this is why we use in the main text to derive the Langevin threshold eq. ( 15).Finally, using the Nishimori symmetry it can be shown that m t = q t at all times, see e.g.[6], and therefore the evolution of the algorithm is characterized by a single order parameter m t whose dynamical evolution is given by Note that AMP satisfies the Nishimori property at all times while this condition is violated on the run by the Langevin dynamics.In that case the Nishimori symmetry is recovered only when equilibrium is reached and therefore it is violated when the Langevin algorithm gets trapped in the glass phase, see below.If we initialize the configuration of the estimator x at random, the initial value of m will be equal to zero on average.However, finite size fluctuations produce by chance a small bias towards the signal and therefore we consider the initialization to be m t=0 = being an arbitrarily small positive number.We will call m AMP the fixed point of Eq. (B37) reached from this infinitesimal initialization.The mean-square-error (MSE) reached by AMP after convergence is then given by MSE AMP = 1 − m AMP .We underline that Eq. (B37) can be proven rigorously following [18,40].Finally we note that the fixed point of the SE satisfies the very same Eq.(B20) that gives the replica free entropy.In the rest of this section we will study the fixed points of Eq. (B37).This will allow too determine the phase diagram of the spiked matrix-tensor model.
We start by observing that m = 0 is a fixed point of Eq. (B37).However, in order to understand whether it is a possible attractor of the AMP dynamics we need to understand its local stability.This can be obtained perturbatively by expanding Eq. (B37) around m = 0 It is clear that the non-informative fixed point m = 0 is stable as long as ∆ 2 > 1.We will call ∆ 2 = 1 the stability threshold.
When p = 3 the SE equations are particularly simple and the fixed points are written explicitly as In the regime where ∆ 2 > 1, m 0 and m + are stable while m − in unstable.When ∆ 2 becomes smaller than one, m + becomes the only non-negative stable solution and therefore ∆ 2 = 1 is also known as the algorithmic spinodal since it corresponds to the point where the AMP algorithm converges to the informative fixed point.The informative solution m + exists as long as ∆ 2 ≤ ∆ dyn 2 , where we have defined the dynamical spinodal by For a generic p we cannot determine the values of the informative fixed points explicitly but we can easily study Eq.(B37) numerically to get the full phase diagram.
Furthermore we can obtain the spinodal transition lines as follows.The key observation is that the two spinodals are critical points of the equation ∆ p (m; ∆ 2 ) where ∆ 2 is fixed, or analogously ∆ 2 (m; ∆ p ) where ∆ p is fixed (to have a pictorial representation of the idea you can see Fig. 10).We call x = m/∆ 2 + m p−1 /∆ p , and Then the stationary points are implicitly defined by Finally ∆ p (x ± (∆ 2 ); ∆ 2 ) describes the two spinodals.We can also derive the tri-critical point, when the two spinodals meet, which is given by the zero discriminant condition on (B42)

Phase diagrams of spiked matrix-tensor model
In this section we present the phase diagrams for the spiked matrix-tensor model as a function of the two noise levels ∆ 2 and ∆ p and for several values of p.These phase diagrams are plotted in Figs.7 and 8. Generically we can have four regions: • Easy phase (green), where the MSE obtained through AMP coincides with the MMSE which is better than random sampling of the prior.
• Impossible phase (red), where the MMSE and MSE of AMP coincide and are equal to 1 (meaning that m * = m AM P =0).
• Hard phase (orange), where the MMSE is smaller than the MSE obtained from AMP and m * > m AMP ≥ 0.
• Hybrid-hard phase [68] (mix of green and orange), is a part of the hard phase where the AMP performance is strictly better than random sampling from the prior, but still the MSE obtained this way does not match the MMSE, i.e. m * > m AMP > 0. The hybrid-hard phase can be found for p ≥ 4.
All these phases are separated by the following transition lines: • The stability threshold (dashed black line) at ∆ 2 = 1 for all p.This corresponds to the point where the uninformative fixed point m = 0 looses its local stability.
• The information theoretic threshold (solid red line).Here m * > 0 and the MMSE jumps to a value strictly smaller than one.
• The algorithmic threshold (solid cyan line).This is where the fixed point of AMP jumps to the MMSE< 1.
For p = 3 this line coincides with a segment of the stability threshold while for p ≥ 4 it is strictly above.
• The dynamic threshold (dotted orange line).Here the most informative fixed point (the one with largest m AMP ) disappears.
In Figs. 9, and 10 we plot the evolution of the magnetization m, as found through the fixed points of the SE equation, for several fixed values of ∆ p and p = 3 and p = 4, respectively.The values of ∆ p are identified by the vertical cuts in the phase diagrams of Fig. 7.The situation is qualitatively similar to Fig. 9, the difference being only the presence of the hybrid-hard phase.We can observe that when the transition is discontinuous, figure from (a) to (e), for 1/∆2 > 1.0 the uninformative solution becomes unstable and continuously goes to a stable-informative solution which is not the optimal one.
function R(t, t ) = lim N →∞ dηi(t ) and the magnetization C(t) = lim N →∞ i x i (t)x * i /N .To obtain these equations we use the techniques developed in the context of mean-field spin glass systems [44,69].We call η i (t) a time dependent noise and we indicate with • the average with respect to it.The noise is Gaussian and characterized by η i (t) = 0 for all t and i = 1, . . .N and η i (t)η j (t ) = 2δ ij δ(t − t ).As before we will denote by E[. . .] the average with respect to the realization of disorder that in this case goes back to the specific realization of the signal.
Before proceeding, it is useful to introduce a set of auxiliary variables that will help in the following.For k ∈ {2, p} we define x * i , and the random variable ξi1..
The time dependence in T k , will be used in the tensor-annealing protocol that will be used to avoid part of the Langevin hard phase.We introduce a time dependent Hamiltonian and the associated Langevin dynamics with µ a Langrange multiplier that enforces the spherical constraint for all k = 2, p, the stationary equilibrium distribution for the Langevin dynamics is given by the posterior measure.Using Ito's lemma one finds Since the spherical constraint imposes the left-hand-side to be zero, one obtains a condition on the right-hand-side.By plugging the expression (C1) in it, one gets that in the large N limit where are the parts of the Hamiltonian defined in Eq. (C) relative to the matrix (k = 2) and to the tensor (k = p).Note that we have not specified any initial condition for the variables x i (t = 0).Therefore, since we always employ the spherical constraint, the initial condition for the dynamics is a point on the N dimensional hypersphere |x| 2 = N extracted with the flat measure.
In order to analyze the Langevin dynamics in the large N limit, we will use the dynamical cavity method [44,70,71].We will consider a system of N variables, with N 1, and add a new one.This new variable will be considered as a small perturbation to the original system but at the same time will be treated self consistently.

Dynamical Mean-Field Equations
In the following we will drop the time dependence for simplicity restoring it only when it is needed.Given the system with N variables i = 1 . . .N , we add a new one, say i = 0, and define m = 1 (henceforth we use the symbol to denote two quantities that are equal up to terms that vanish in the large-N limit).The Langevin equation associated to the new variable is where we used that N N + 1 for N 1.We will consider the contribution of the new variable on the others in perturbation theory.In the dynamical equations for the variables i = 1, . . ., N we can isolate the variable i = 0 and write with Consider the unperturbed variables x 0 i = x i Hi=0 .At leading order in N we can write In the dynamical equation for the variable 0 we can identify a piece associated to the unperturbed variables x 0 i .This term can be thought of collectively as a stochastic term Ξ(t) Indeed Ξ(t) encodes the effect of a kind of bath made by of the unperturbed variables i = 1, . . ., N to the new one.We can show that at leading order in N , Ξ(t) is a Gaussian noise with zero mean and variance given by and the second term can be simplified as where we used (i1,...,i k ) = 1 k! 1≤i1,...,i k ≤N , we neglected terms sub-leading in N , and we used the definition of the dynamical correlation function x i (t)x i (t ) .
Therefore we have Now we can focus of the deterministic term coming from the first order perturbation in (C8).Consider just the integral for the p-body term, the other will be given by setting p = 2 where we have used the definition of the response function Plugging (C11) into (C8) we obtain an effective dynamical equation for the new variable in terms of the correlation and response function of the system with N variables In order to close Eq.(C12) we need to give the recipe to compute the correlation and response function.

Integro-differential equations
In order to obtain the final equations for dynamical order parameters we will assume that the new variable x 0 is a typical one, namely it has the same statistical nature of all the others.Therefore we can assume that Eqs. (C13) give a way to obtain the equation for all the correlation functions.Indeed we can consider Eq. (C12), multiply it by x 0 (t ), or differentiate it with respect to an external field h 0 (t ), or multiply it it by x * 0 and we can average the results over the disorder and thermal noise.Using the following identity we get the following Langevin State Evolution (LSE) equations Note that the last equation for µ(t) is obtained by imposing the spherical constraint C(t, t) = 1 ∀t using the fact that 0 = dC(t,t) dt = ∂C(t,t ) . The boundary conditions of this equations are: C(t, t) = 1 the spherical constrain, R(t, t) = 0 which comes from causality in the Itô approach and R(t, t → t − ) = 1.The initial condition for C(0) = C 0 is the overlap with the initial configuration with the true signal.If the initial configuration is random, C 0 = 0 but will have finite size fluctuations, as in the case of AMP.Therefore we can think that C 0 = being an arbitrary small positive number.
for computing the update in the the response function, (C16) as follows Analogously we defined the other integrators.A simple causal integration scheme, being careful with the Itô prescription, gives the pseudo-code below.FIG.11.Representation of the initialization and the first two iterations for the evaluation of a two-times observable using the dynamic-grid algorithm.The empty circles represent slots allocated in memory but not associated to any specific value, while the full circles are memory slots already associated.For any two time function, it first allocates the memory (a), than it fills half of the grid by linear propagation (b).Still using linear propagation it fills the slots with t − t 1 (c), and it sets the other values by imposing self-consistency (d).Finally it halves the grid (e), doubles time step and it allocates the memory (f).Then the algorithm loops following the same scheme as in (b-c-d-e).
The numerical scheme we are going to discuss is presented in the Bayes-optimal case where T 2 (t) ≡ T p (t) ≡ 1. However the derivation that we propose can be easily generalized to the case where the T s assume different values, but are constants.Therefore we do not employ this algorithm to solve the LSE equations in the annealing protocol for which instead we use the fixed time-grid algorithm.It is convenient to manipulate the equations to obtain an equivalent set of equations for the functions , where Q(t, t ) represents the deviation from Fluctuation Dissipation Theorem (FDT) at time t starting from time t .Indeed when the FDT theorem holds, it states that R(t, t ) = −∂ t C(t, t ).
We briefly anticipate the strategy that the algorithm uses to solve the equations.The algorithm discretizes the times into N t intervals, first starting from the boundary conditions, C(t, t) = 1, Q(t, t) = 0 and C = C 0 ∈ [0; 1], it fills the grid for small times (or small time differences τ = t − t 1) using linear propagation.Given a time t and the initial guess for the Lagrange multiplier obtained by the linear propagator, the integrals are discretized and evaluated, then the results is used to update the value of the Lagrange multiplier.This procedure is repeated iteratively until convergence.Once that the first grid is filled, it follows a coarse-graining procedure where the sizes of the time intervals is doubled and only half of the information is retained.This procedure is repeated a fixed number of doubling of the original grid.The doubling scheme allows to explore exponentially long times at the cost of loosing part of the information, the direct consequence of this is the loss of stability for very large times (especially when the functions C(t, t ), R(t, t ), C(t) undergo fast changes at large times).
a. Dynamical equations in the algorithm.We recall the function f k (x) = x k 2 and its derivatives, . For simplicity in the notation, we introduce also Following the lines of [72], we introduce the FDT violation function, Q(t, t ), and after some manipulation the systems becomes further simplifications can be obtained introducing (D8) b. First order expansion coefficients.In the numerics we will initialize the grid by a linear propagation of the initial conditions.To determine the coefficients to use we can expand the functions up the second term for small values of τ (and in the last equation of t) This gives the following coefficients: In particular the 6 integrals become (D10) (D11) jl ; (D12) where the superscript (v) and (h) represent the vertical (t ) and horizontal (t) derivatives in the discretized times, see Fig. 11 for an intuitive understanding.We also discretized the derivative using the last two time steps Given the time indices i and j, we will define and evaluate the following quantities plus the respective vertical and horizontal derivatives. Calling ii , the original dynamical equations are integrated as follow  12. Evolution of the correlation with the signal starting from the solution, C0 = 1.0 at fixed ∆2 for different ∆p.The dotted red line overlapping with other lines, is the same quantity evaluated using the fixed grid algorithm up to time 100.We have started the LSE from an informative initial condition.
a. Cross-checking using the fixed-grid algorithm.For short times the dynamical equations were solved using the fixed-grid algorithm and compared with the outcome of the dynamic-grid algorithm, obtaining the same results, see Fig. 12.In the figure we used the fixed-grid with t max = 100 and the ∆t = 6.25 * 10 −3 .
b. Same magnetization in the easy region.
In the impossible and easy regions, the overlap with the signal of both AMP and dynamic-grid integration, converges to the same value.In Fig. 13 we show the overlap obtained with AMP, black dashed line, and the overlap achieved by the integration scheme at a given time.We can see that the overlap with the signal as obtained solving the LSE equations converges to the same value of the fixed point of AMP.Given a fixed ∆ 2 we can observe that the time to convergence increase very rapidly as we decrease ∆ p .We fitted this increase of the relaxation time to get the boundary of the Langevin hard region.c.Dynamical transition.The dynamical transition where the finite magnetization fixed point disappears can be regarded as a clustering or dynamical glass transition.Indeed coming from the impossible phase, going towards the hard phase, at the dynamical transition the free energy landscape changes and the unique ergodic paramagnetic minimum of the impossible phase gets clustered into an exponential number of metastable glassy states (see Sec. E).Correspondingly the relaxation time of the Langevin algorithm diverges.Fitting this divergence with a power law we obtain an alternative estimation of the dynamical line.In the right panel of Fig. 14 we plot with yellow points the dynamical transition line as extracted from the fit of the relaxation time of the Langevin algorithm extracted coming from the impossible phase and entering in the hard phase.

Extrapolation procedure
In order to determine the Langevin Hard region, given a fixed value of ∆ p (∆ 2 ), we measure the relaxation time that it takes to relax to equilibrium.On approaching the Langevin hard region, this relaxation time increases and we extrapolate the growth to obtain the critical ∆ * p (∆ * 2 respectively) where the relaxation time appears to diverge.The extrapolation is done starting from C 0 = 10 −40 and assuming a power law divergence.Fig. 14 shows the results of this procedure for the cases 2 + 3 and 2 + 4. We remark that the divergence times increase as p increases.Therefor in reason of the instability of the code for large times it becomes difficult to extrapolate the threshold accurately.In particular in the right panel of Fig. 14 when estimating the threshold for 2 + 4, we consider horizontal sections and the points extrapolated for ∆ 2 close to the threshold ∆ 2 = 1.0 are very hard to estimate because of these instabilities.
a. Numerical checks on the extrapolation procedure.To test the quality of the fits we use a similar numerical procedure to locate the spinodal of the informative solution, which is given by the points where the informative solution ceases to exist.This spinodal must be the same for both the AMP and the Langevin algorithm [6].
Since we aim at studying the spinodal of the informative solution, we initialize the LSE with C 0 = 1 and let it relax, measuring the time it takes to equilibrate at the value of C given by the informative fixed point of AMP.We do this 2 at fixed ∆p = 0.9 with p = 3 as a function of the initial condition C, ranging from 10 −40 to 10 −4 .The vertical axis is in linear scale while the horizontal axis is in log-scale.We observe that the dependence of estimated divergence points on the initial condition is consistent with the asymptotic value 1/∆ * 2 = 2/∆p ≈ 1.491 following from eq. ( 15), and depicted by the dashed line.The dotted line instead represents the AMP threshold for comparison.
fixing ∆ 2 and changing ∆ p .As we approach the critical ∆ p,dyn , the relaxation time will diverge and we can fit this divergence with a power law.The dynamic threshold extracted in this way is finally compared with the one obtained from AMP.In Fig. 17 we show how this scheme has been applied for ∆ 2 ∈ {1.01, 1.05, 1.10, 2.00}.As we get closer to the critical line ∆ 2 = 1 the relaxation time increases (and the height of the plateau decreases), making the fit harder.All in all, we observe a very good agreement between the points found with these extrapolation procedures and the prediction obtained with AMP, as shown in Fig. 14.

Initial conditions
The LSE equations show a rather strong dependence on the initial condition C 0 .A low initial magnetization will give a low initial momentum in the direction of the signal, as it the can be observed in the linear expansion of C Eq. (D9), and consequently the system will not be able to cross even very small barriers.The direct consequence is that reducing the initial magnetization the estimated threshold will get worse, in the sense that a larger signal-to-noise ratio will be required to find the solution.This finding can be observe in Fig. 15, where different initial conditions are compared on the section ∆ p = 0.9.Finally we remark how the different initial conditions affect the phase diagram.In Fig. 16 we compare the Langevin hard-easy threshold evaluated starting from different initial condition, showing that the region gets larger as the initial condition decrease up to convergence.

Annealing protocol
In this section we show that using specific protocols that separate the matrix and tensor part of the cost function (something we do not allow in the main part of this paper for the purpose of having a model as generic as possible) we are able to enter in the Langevin hard region.A generic annealing scheme would lower the noises of both the channels simultaneously, which will not be able to avoid the Langevin hard region.Instead we can use the following protocol The constant C allows to select at the initial time the desired effective ∆ p,eff = ∆ p + Ce − t τann far from (and much larger than) the original one.Instead τ ann chooses the speed of the annealing protocol.Fig. 18 shows that using this protocol we are able to enter in Langevin Hard region even with ∆ 2 close to the AMP threshold.To this purpose we initiated the effective ∆ p,eff close to 100 (i.e.C = 100), very far from the Langevin hard region, and we used different speeds for the annealing of ∆ p (different colors in the figures).In the figures we can observe that approaching the ∆ 2 = 1 we need slower and slower protocols (larger and larger τ ann ).The reason for this behavior is due to the fact that approaching ∆ 2 = 1 with ∆ p = 100 a longer time is required to gain a non trivial overlap with the solution.Evidence of this growing timescale at ∆ p = 100 is given in Fig. 19 where we show the relaxation time for magnetizing to the solution varying ∆ 2 .In particular, we can observe that at ∆ 2 = 0.70 the relaxation time is of the order of 100 time units.
For the protocol to be successful it is therefore crucial that the annealing time τ ann is large enough to give the possibility of magnetizing the solution before ∆ p,eff has significantly decreased towards ∆ p .According to this analysis, it is not surprising that in Fig. 18 for ∆ 2 = 0.90 the proposed protocol seems not to be successful.For this value of the parameter ∆ 2 , the time to find a solution even with ∆ p = 100 should be larger than 1000 time units, which is much larger than the used τ ann and anyway out of the time window of our numerical solution.However, with an annealing time large enough it would be in principle possible to recover exactly the same boundaries of the AMP easy region.much less involved due to the fact that for integer n the average Ex n can be sometimes performed analytically.Indeed in this case the replicated partition function Z n can be regarded as the partition function of n identical uncoupled systems or replicas.Averaging over the disorder we obtain a clean system of interacting replicas.The Hamiltonian of this system displays an emerging replica symmetry since it is left unchanged by a permutation of replicas.This symmetry can be spontaneously broken in certain disordered models where frustration is sufficiently strong [44].
In mean field models characterized by fully connected factor graphs, the resulting Hamiltonian of interacting replicas depends on the configuration of the system only through a simple order parameter, the overlap Q between them, which is a n × n matrix that describes the similarities of the configurations of different replicas in phase space.Furthermore the Hamiltonian is proportional to N which means that in the thermodynamic limit N → ∞, the model can be solved using the saddle point method.In this case one needs to consider a simple ansatz for the saddle point structure of the matrix Q that allows to take the analytic continuation for n → 0. The solution to this problem comes from spin glass theory and general details can be found in [44].The saddle point solutions for Q can be classified according to the replica symmetry breaking level going from the replica symmetric solution where replica symmetry is not spontaneously broken to various degree of spontaneous replica symmetry breaking (including full-replica symmetry breaking).Here we will not review this subject but the interested reader can find details in [44].The model we are analyzing can be studied in full generality at any degree of RSB (see for example [41,42,52] where the same models have been studied in absence of a signal).However here we will limit ourselves to consider saddle point solutions up to a 1RSB level.
The complexity of the landscape can be directly related to replica symmetry breaking.A replica symmetric solution implies an ergodic free energy landscape characterized by a single pure state.When replica symmetry is broken instead, a large number of pure states arises and the phase space gets clustered in a hierarchical way [44].Making a 1RSB approximation means to look for a situation in which the hierarchical organization contains just one level: the phase space gets clustered into an exponential number of pure states with no further internal structure.
If we assume a 1RSB glassy landscape, we can compute the complexity of metastable states using a recipe due to Monasson [73] (see also [74] for a pedagogical introduction).The argument goes as follows.
Let us consider system with x real replicas infinitesimally coupled.If the free energy landscape is clustered into an exponential number of metastable states, the replicated partition function, namely the partition function of the system of x real replicas, can be written as where f * is the internal free energy of the dominant metastable states that is determined by the saddle point condition dΣ df (f * ) = βx and β the inverse temperature.Note that since we are interested in the Bayes optimal case, this corresponds to set β = 1.In the analysis we will consider a generic β before taking the limits in order to derive the averaged energy by taking its derivative.The function Σ(f ) is the complexity of metastable states having internal entropy f .Therefore, using the free parameter x we can reconstruct the form of Σ(f ) from the replicated free energy.In order to compute the replicated free energy we need to apply the replica trick on the replicated system, log Z x = lim n→0 ∂ ∂n (Z x ) n .Calling the replicated free energy Φ = − 1 N log Z x , we get the complexity as Σ = x ∂Φ ∂x − Φ.We can now specify the computation to our case where the partition function is the normalization of the posterior measure.With simple manipulations of the equations [70], the partition function can be expressed as the integral where the overlap Q is a (nx + 1) × (nx + 1) matrix that contains a special row and column that encodes the overlap between different replicas with the signal and therefore the corresponding overlap is the magnetization m.The 1RSB structure for the matrix Q can be obtained by defining the following nx × nx matrices: the identity matrix I ij = δ ij , the full matrix J x ) a block diagonal matrix where the diagonal blocks J (0) x have size x × x and are matrices full of 1.In this case the 1RSB ansatz for Q reads Q = (1 − q M )I nx + (q M − q m )J (1)  nx + q m J (0) nx .Using this ansatz we can compute S(Q) that is given by (E2) From Eq. (E2) we obtain the saddle point equations 0 = 2 ∂S ∂q M = (x − 1) 1 x (E3) The above 1RSB fixed point equations can be used to derive the de Almeida-Thouless instability of the RS solution towards 1RSB.This stability condition, sometimes called the replicon also determines the overlap of the marginal threshold states.The stability analysis is done by expansion of eqs.(E3) in a small parameters q M − q m = ε 1 and investigating whether under iterations such a small difference grows of decreases.This leads directly to the threshold condition on the overlap 1 β 2 (1 − q th ) 2 = (p − 1) This condition is then used in the derivation of the Langevin threshold (15) in the main text.From Eq. (E2) we obtain also the averaged energy In particular the threshold states are characterized by q M = q th , fixed by Eq. (E4), q m = 0 and m = 0. Imposing these values, we can use the saddle point equation for q M , Eq. E3, to fix the Parisi parameters x, x(q th ) = 1 (1 − q th ) (q th ) p−1 ∆p + q th ∆2 − 1 q th + 1 .These pieces together give Eq.( 12) showed in the main text.
Having obtained the energy we can consider β = 1 fixed for the rest of the analysis.We can observe that starting from this expression we can derive the RS free energy, (B18), q M = q m or equivalently in the limit x → 1.The low magnetization solution to these equations gives the complexity of the metastable branch of the posterior measure which is given by (E7) The free parameter x allows us to tune the free energy of the states of which we compute the complexity.Thus we can characterize the part of the phase diagram where an exponential number of states is present.
To complete the 1RSB analysis we compute the stability of the 1RSB saddle point solution for Q.This is done analogously to the derivation of the replicon condition (E4), analyzing stability of the 1RSB towards further replica symmetry breaking.Following [52,75] we obtain two replicon eigenvalues given by λ I = 1 − (1 − q M + x(q M − q m )) 2 (p − 1) ) We can analyze what happens to the landscape when we fix ∆ p < 1 and we start from a large value of ∆ 2 < ∆ 2,dyn (∆ p ) and we decrease ∆ 2 .In this case for sufficiently high ∆ 2 and large enough ∆ p the system is in a paramagnetic phase and no glassy states are present.At the dynamical transition line instead we find a positive complexity as plotted in Fig. 20.At this point the equilibrium states that dominate the posterior measure are the so called threshold states for which the complexity is maximal.For those states the eigenvalue λ II = 0 which confirms that these states are marginally stable [27].Decreasing ∆ 2 one crosses the information theoretic phase transition where the relevant metastable states that dominate the posterior measure have zero complexity.This corresponds to a freezing/condensation/Kauzmann transition.Below the information theoretic phase transition the thermodynamics of the posterior measure is dominated by the state containing the signal.However one can neglect the high magnetization solution of the 1RSB equations to get the properties of the metastable branch and computing the complexity of states that have zero overlap with the signal.The complexity curves as a function of the Parisi parameter x for decreasing values of ∆ 2 are plotted in Fig. 20 for fixed ∆ p = 0.5 and several ∆ 2 .The curves contain a stable 1RSB part and an unstable one where λ II is negative.The 1RSB line shown in Figs. 14 is obtained by looking at when the states with positive complexity and λ II = 0 disappear.This means that it gives the point where the 1RSB marginally stable states disappear and therefore it is expected to be a lower bound for the disappearance of glassiness in the phase diagram.The important outcome of this analysis is that for ∆ 2 < 1 but not sufficiently small, namely in part of the AMP-easy phase, the replica analysis predicts the existence of 1RSB marginally stable glassy states that may trap the Langevin algorithm from relaxing towards the signal [25] and therefore supports the existence of the Langevin hard phase.This approach, however, does not predict quantitatively correctly the extent on the Langevin-hard phase for reasons that remain obscure and should be investigated further.
Finally in Fig. 21 we plot the complexity as a function of the internal free energy of the states for some values of ∆ 2 and ∆ p .

Breakdown of the fluctuation-dissipation theorem in the Langevin hard phase
When the Langevin algorithm is able to reach equilibrium, being it the signal or the paramagnetic state, it should satisfy the Fluctuation-Dissipation Theorem (FDT) according to which the response function is related to the correlation function through R(t, t ) = − ∂C(t,t ) ∂t .Furthermore, time translational invariance (TTI) should arise implying that both correlation and response functions should be functions of only the time difference meaning that R(t, t ) = R(t−t ) and C(t, t ) = C(t − t ), note that all one time quantities are constant in equilibrium.When the dynamics is run in the glass phase, metastable states may forbid equilibration.In this case time translational invariance is never reached at long times, it is supposed to be reached only on exponential timescales in the system size, and the dynamics displays aging violating at the same time the FDT relation.The analysis of the asymptotic aging dynamics has been cracked by Cugliandolo and Kurchan in [27,48] (see also [69] for a pedagogical review) in the simplest spin glass model (see also [76] for a much more complex situation) where no signal is present.The outcome of this work is that when the dynamics started from a random initial conditions is run in the glass phase, it drives the system to surf on the threshold states.In the model analyzed in [27] these states correspond to the 1RSB marginally stable glassy states that maximize the complexity.In this section we analyze the Cugliandolo-Kurchan scenario by contrasting the

FIG. 1 .
FIG.1.Phase diagram of the spiked 2 + 3-spin model (matrix plus order 3 tensor are observed).In the easy (green) region AMP achieves the optimal error smaller than random pick from the prior.In the impossible region (red) the optimal error is as bad as random pick from the prior, and AMP achieves it as well.In the hard region (orange) the optimal error is low, but AMP does not find an estimator better than random pick from the prior.In the case of Langevin algorithm the performance is strictly worse than that for AMP in the sense that the hard region increases up to line 1/∆ * 2 = max(1, ∆3/2), depicted in green dots.The green circles are obtained by numerical extrapolation of the Langevin state evolution equations.

5 FIG. 2 .
FIG. 2. Evolution of the correlation with the signal C(t)starting from C(t = 0) = 10 −4 in the Langevin algorithm at fixed noise on the matrix (∆2 = 0.7) and different noises on the tensor (∆p).As we decrease ∆p the time required to jump to the solution appears to diverge.Inset: the behavior of C(t) as a function of the iteration time for the AMP algorithm for the same values of ∆p and the same initialization.

10 FIG. 3 .
FIG.3.Extrapolation of the Langevin relaxation time.The inset presents the relaxation time for fixed ∆p = 1.The main pannel then presents a fit using a power law consistent with a divergence at ∆ * 2 ≈ 0.72 The circles are obtained with numerical solution of LSE that uses the dynamical grid while crosses are obtained using a fixed-grid, initial condition was C(t = 0) = 10 −40 (details in the Appendix).

FIG. 4 .
FIG.4.Correlation with the signal of AMP and Langevin at the kth iteration (at time t) for fixed ∆2 = 0.7 where both the evolutions start with initial overlap 10 −4 .

FIG. 5 .
FIG.5.Dynamical evolution of the energy starting from a configuration with overlap 10 −40 at ∆p = 1.0 and for various ∆2.The system first tends to the threshold energy, Eq. (12) horizontal lines, and then for sufficiently small ∆2 finds the direction toward the signal.

) 3 .FIG. 7 .
FIG. 7.On the left: Phase diagram of the spiked matrix-tensor model for p = 3.The phase diagram identifies four regions: easy (green), impossible (red), and hard (orange).The lines correspond to different phase transitions namely the stability threshold (dashed black), the information theoretic threshold (solid red), the algorithmic threshold (solid cyan), and the dynamical threshold (dotted orange).The vertical cuts represent the section along which the magnetization is plotted in Fig.9.On the right: Phase diagram of the spiked matrix-tensor model for p = 4.The main difference with respect to case p = 3, is that the algorithmic spinodal (solid cyan) is strictly above the stability threshold (dashed black).The hybrid-hard phase appears between these two lines (combined green and orange color).The vertical cuts represent the section along which the magnetization is plotted in Fig.10.

FIG. 8 .
FIG. 8. On the left: Phase diagram of the spiked matrix-tensor model for p = 5.On the right: Phase diagram of the spiked matrix-tensor model for p = 10.In both cases we observe qualitatively the same scenario found in the right panel of Fig. 7.

1 FIG. 9 . 1 . 3 FIG. 10 .
FIG. 9. Fixed points of Eq. (B37) as a function of ∆2 for p = 3 and several fixed values of ∆p.The values of ∆p correspond to the vertical cuts in the left panel of Fig.7.Solid lines are stable fixed point, dashed lines are unstable fixed points.The blue line represent informative fixed points with positive overlap with the signal while the orange line represent a uninformative fixed points with no overlap with the signal.Starting from high ∆2 an informative fixed point appears at the dynamical threshold (vertical dashed line) but is energetically disfavored until the information theoretic threshold (vertical dotted line) and finally it becomes the only stable solution crossing the algorithmic threshold (vertical dotted-dashed line).When the transition is continuous the three vertical threshold lines merge and we have a single second order phase transition, which here occurs at ∆p ≥ 1.

(d)p = 3 ∆ 2
FIG.12.Evolution of the correlation with the signal starting from the solution, C0 = 1.0 at fixed ∆2 for different ∆p.The dotted red line overlapping with other lines, is the same quantity evaluated using the fixed grid algorithm up to time 100.We have started the LSE from an informative initial condition.

FIG. 13 .
FIG.13.Correlation with the signal of AMP (dotted lines) and Langevin (solid lines) at kth iteration and t time respectively starting in both cases with an initial overlap of 10 −4 .The black dashed line is the asymptotic value predicted with AMP.In the easy region, provided enough running time, Langevin dynamics finds the same alignment as AMP.The figures show qualitatively the same behaviour for different values of ∆2.

2 FIG. 15 .
FIG. 15.Estimated divergence point ∆ *2 at fixed ∆p = 0.9 with p = 3 as a function of the initial condition C, ranging from 10 −40 to 10 −4 .The vertical axis is in linear scale while the horizontal axis is in log-scale.We observe that the dependence of estimated divergence points on the initial condition is consistent with the asymptotic value 1/∆ * 2 = 2/∆p ≈ 1.491 following from eq. (15), and depicted by the dashed line.The dotted line instead represents the AMP threshold for comparison.

FIG. 17 .
FIG. 17. Relaxation time obtained from the LSE starting from an informative initial condition C0 = 1.The four cases refers to the 2 + 3 model and are fitted with a power law and the relaxation time appears to diverge very close to point predicted by AMP (the AMP prediction is given in the captions).
over the overlap matrix(Z x ) n = Z n x ∝ ab dQ ab e βN nxS(Q) lim sup Q e βN nxS(Q) ; (E1)

FIG. 20 .
FIG.20.Complexity with as a function of the Parisi parameter x for p = 3 on the line ∆p = 0.5.The solid line characterizes the stable part of the complexity while the dashed line the unstable one.
FIG.21.The stable part of the 1RSB complexity as a function of the free energy for p = 3 and ∆p = 0.5.
[78] A. Crisanti and H.-J. Sommers, Thouless-andersonpalmer approach to the spherical p-spin spin glass model, Journal de Physique I 5, 805 (1995).[79] A. Bray and M. Moore, Evidence for massless modes in the'solvable model'of a spin glass, Journal of Physics C: Solid State Physics 12, L441 (1979).Appendix A: Definition of the spiked matrix-tensor model where C 0 is the initial value of the overlap with the signal.c.Numerical integration and derivation.The set of equations derived above presents six types of integrals i , t l ) + A(t i , t l − δt)][B(t i , t l ) − B(t i , t l − δt)] 1 2 [C(t l , t j ) + C(t l − δt, t j )] .