A Deterministic and Generalized Framework for Unsupervised Learning with Restricted Boltzmann Machines

,


I. INTRODUCTION
The past decade has witnessed a groundswell of research in machine learning, bolstered by the deep learning revolution and the resurgence of neural networks [1].Since their inception, researchers have identified the deep connection between neural networks and statistical mechanics.Perhaps the most well-known unsupervised neural models studied through the lens of statistical physics have been the the Hopfield model [2,3] and the Boltzmann machine [4].These models were proposed from a connectionist perspective of cognitive science and were studied in the context emergent representation in unsupervised machine learning.
We can look to the Hopfield model to directly observe some of the contributions of physics to both machine learning and cognitive sciences.For example, by applying techniques from the study of spin-glasses, Amit et al. [3] were famously able to derive the memory capacity of the Hopfield model and provide a concrete understanding of the dynamics of the model via the study of its phase transitions.This fundamental understanding of the behavior of the Hopfield model has provided insight into the complexities of associative memory.
The closely related Boltzmann Machine is an undirected stochastic neural network which finds its physics parallel in Ising spin glass models [5].Specifically, for this model, one is interested in the inverse problem: learning the couplings between spins in order to generate a particular set of configurations at equilibrium.The process of learning couplings, or training, is often referred to as the inverse Ising problem in the physics literature [6][7][8].However, because couplings only exist between pairs of spins for the fully-visible Ising spin-glass, such models have limited practical application as they cannot successfully capture higher-order correlations which might exist in a set of training configurations.
For this reason, the general Boltzmann machine introduces a set of unobserved latent spins.The effect of these latent spins is to abstract high-order correlations within the set of observed spins.While an optimal training of the couplings would potentially lead to a very effective general model of high-dimensional joint distributions, the intractability of this joint latent model confounds the practical application of general Boltzmann machines.
A restricted Boltzmann Machine (RBM) is a special case of the general Boltzmann machine, where couplings only exist between latent and observed spins.This bipartite structure is key to the efficient and effective training of RBMs [9].RBMs have found many applications in machine learning problems as diverse as dimensionality reduction [10], classification [11], collaborative filtering [12], feature learning [13], and topic modeling [14].Additionally, RBMs can be stacked into multi-layer neural networks, which have played a historically fundamental role in pre-training deep network architectures [15,16].These constructions, known as deep belief networks, were the first truly deep neural architectures, leading to the current explosion of activity in deep learning [17].While access to vast training datasets has made such pre-training dispensable for certain tasks, RBMs remain a fundamental tool in the theory of unsupervised learning.As such, a better understanding of RBMs can be key to future developments in emergent machine intelligence.
To date, the most popular and effective approaches to training RBMs have centered on differing flavors of short-chain Monte Carlo sampling [9,18], which we cover in detail in the sequel.While such techniques can yield trained RBMs which produce sampled configurations very similar to the target dataset, and can be used in a number of applications as detailed previously, they do not bridge the gap in understanding what the RBM has learned.Furthermore, understanding the modes, or internal representations, of the RBM with sampling-based frameworks have mostly consisted of subjective comparisons of sampled configurations as well as a subjective analysis of the couplings themselves, often referred to as receptive fields in the machine learning literature.
Additionally, comparing two trained models, or even monitoring the training of one model, becomes problematic when using sampling-based investigative tools.For example, annealed techniques [19] can provide estimates of the log-likelihood of a model, but only at a large computational cost [20,21].At a much lower computational cost, pseudo-likelihoods can be used to monitor training, but the estimates produced in this manner are inaccurate, as compared to annealed importance sampling (AIS) [19], and even AIS can fail to detect model divergence in practice [22].
In the present work, we seek to address these concerns by developing a deterministic framework to train, compare, and analyze RBMs, as well as to leverage their modeling power for inference tasks.We accomplish this via statistical physics techniques through the use of the Thouless-Anderson-Palmer (TAP) formalism of spin-glass theory [5,[23][24][25].In this manner, we produce a model which no longer refers to a stochastic model possessing an intractable Gibbs measure, but to a TAP machine: an entirely self-consistent mean-field model which operates as a classical RBM, but which admits deeper introspection via deterministic inference.TAP machines also naturally handle non-binary variables as well as deep architectures.While Deep Boltzmann Machines' (DBMs) [26] state-of-the-art training algorithms mix both Monte Carlo sampling and "naïve" mean-field approximation, a deep TAP machine relies entirely on the TAP mean-field approximation.
Under this interpretation, a TAP machine is not a generative probabilistic model, but a deterministic model defining a set of representational magnetizations for a given training dataset.Advantageously, this learning output can be computed exactly in finite time by converging a fixed-point iteration, in contrast to the indeterminate stopping criterion of Markov-chain Monte Carlo sampling.This is a major distinction between the TAP machine and the classical RBM, for which the true probability density function is intractable.At its core, the TAP machine training consists of arranging the minima, solutions, in the proposed TAP-approximated free energy so as to maximize the correlation between these solutions and the dataset.In our experiments, we demonstrate how to track the growth and geometry of these solutions as a novel way to investigate the progress of unsupervised learning.We also show how to use a trained TAP machine as a prior for inference tasks.
The paper is organized as follows.In Sec.II we formally describe the classical binary RBM and review the literature on RBM training and analysis.Subsequently, in Sec.III, we describe our proposed modification of the binary RBM to a model with arbitrary real-valued distributions with bounded support.Next, in Sec.IV, we briefly describe how to apply belief-propagation to perform inference in the setting of real-valued spins.The details of this approach are pedagogically described in Appendices A & B. In Sec.V we derive the TAP approximation of the real-valued RBM via a high-temperature expansion of a two-moment Gibbs free energy.Then, in Sec.VI, we detail how to convert this approximation to a practical training algorithm.In Sec.VII, we conduct a series of experiments on real datasets, demonstrating how to use the properties of the TAP machine interpretation to provide insight into the unsupervised learning process.We additionally show how to use a trained model for bit-flip correction as a simple example of leveraging a TAP machine for inference tasks.Lastly, in Appendix C, we detail the derivations of necessary distribution-specific functions.

II. RESTRICTED BOLTZMANN MACHINES
Restricted Boltzmann Machines (RBMs) [27] are latentvariable generative models often used in the context of unsupervised learning.A set of weights and biases, the model parameters of the RBM, which correspond to the couplings and local fields present in the system, constructs an energy as a function of the data points from which follows a Gibbs-Boltzmann probability density function.In the well-known binary RBM, for which all visible and latent variables are in {0, 1}, the RBM distribution is (1) where θ = {b, c} is the set of local potentials, i.e. the set of values which define the biases acting on each variable, and Here, we use the notation x and h to refer to sums over the entire space of possible configurations of visible and latent variables, respectively.When taken with respect to the parameters of the model, Z [W, θ] is known as the partition function.We give a factor-graph representation of the RBM distribution in Fig. 1.
As evidenced by (2), an exact computation of the normalizing partition function, and thus the probability of a given high-dimensional data point, is inaccessible in practice.Sophisticated Monte Carlo (MC) schemes relying on importance sampling [20,21]  and bounds of the partition, but at the cost of substantial computation, running on the time scale of days or even weeks.
Thankfully, a precise estimate of the normalization is unnecessary for many RBM applications.Additionally, the bipartite structure of the RBM, which only admits couplings between the hidden and visible variables, can be leveraged to construct efficient sampling schemes.This approach was demonstrated in the contrastive divergence (CD) of [9], where very short-chain block-Gibbs sampling was shown to be sufficient for adequate RBM training.The CD approach consists of a sampling chain alternating between samples drawn from the conditional probabilities of each layer, which are dependent on the conditional expectations at the previously sampled layer.Specifically, the conditional probabilities for the hidden and visible units factorize as where sigm(y) = (1 + e −y ) −1 is the logistic sigmoid function.
In order to learn the parameters of the RBM for a given training dataset, one looks to maximize the following loglikelihood, via gradient ascent on the parameters W and θ.Commonly, one does not calculate these gradients for each data-point from the training set, but instead calculates the gradients in average across M data-points, often referred to as a mini-batch.At each mini-batch, the gradients of (5) are given as, where • Sampled refers to averages over particles sampled from the model, and • X refers to the so-called clamped expectations, where the values of x are fixed to the training data samples in the mini-batch.In the case of the expectations involving hidden units, which are unobserved and therefore have no training data, [9] originally proposed the use of configurations sampled from P (h|x; W, b, c).However, one could also use the exact conditional expectations directly to calculate these clamped averages; especially in cases where sampling from these conditionals may be problematic.Since [9], there have been a number of proposed modifications to the core sampling-based training scheme described above.The persistent trick [18] takes neatly advantage of the iterative gradient ascent over mini-batches to quickly obtain thermalized Markov chains through Gibbs sampling at no extra computational cost over one step CD (CD-1).Nevertheless, the probability density function of a trained RBM is typically highly multimodal, thus making this sampling inexact.Indeed, in such glassy landscapes mixing becomes very slow as Markov chains become stuck in metastable states, leading to over-and under-represented, as well as missed, modes of the highdimensional distribution.This in turn produces high variance estimates of means and correlations.A more accurate sampling can be achieved using parallel tempering [28,29], where particles are swapped between multiple Markov chains running at differing temperatures.This approach, however, requires not only the additional computational burden of running more chains, but also requires further tuning of hyper-parameters, such as the number of chains and at which temperatures to run them.
As accurate sampling-based inference on RBMs can be costly, it would seem that their usefulness is limited.As learning of the RBM via gradient ascent is dependent upon this inference, the difficulty of training a generative model with a high-degree of accuracy is compounded.However, RBMs have proven to be very useful in many applications where sampling from a full-fledged generative model is unneeded.For instance, RBMs can be used as an unsupervised "feature extraction" pre-training for feedforward networks [10,15,30].RBMs have also been used for data-imputation tasks, e.g.image in-painting, label recovery [11], or collaborative filtering [12] by reconstructing missing data with a single visible-hidden-visible step.In truth, the CD-k training algorithm which popularized RBMs, first with binary units [9], then Gaussian units [10,31] and finally with arbitrary units [32], does not use thermalized samples to evaluate means and correlations.Instead, it focuses on the region of the configuration space nearest to the training dataset [28] by using short block-Gibbs Markov chains, starting from training data points, to get fast and low variance estimates of moments.However, CD-k is prone to learn spurious minima in configuration space far from the data as it does not explore this region during training [28].It also does not systematically increase the true likelihood of training data [33].However, this training strategy has been found to be very efficient in the applications mentioned above, which consistently remain close to the dataset in configuration space.One finds that CD falls short for applications which require long-chain MCMC sampling from the trained RBM, as this represents a fundamental mismatch between the training and application of the RBM.In order to address some of theses shortcomings of sampling-based approaches, we now turn our attention to deterministic mean-field approximations of the RBM.
The TAP approximation [23] for disordered systems relies on the deterministic inference of approximated magnetizations, from which one can obtain estimators of all kinds of observables, starting from the log-partition or free energy.TAP is derived from a small weight expansion of the variational approach and can be considered as an extension of the naïve mean-field (NMF) method [34,35].Previous works which have attempted to make use of the NMF approximation of the RBM have shown negative results [18,36].
The TAP approximation was first considered for Boltzmann machines in the context of small random models without hidden units in [37].In the recent work of [38], this approximation was extended to a practical training algorithm for full-scale binary RBMs which was shown to be competitive with Persistent Contrastive Divergence (PCD) [18] when applied to real-world datasets.In parallel, other works have used TAP, and the related Bethe approximation, to perform inference on binary Boltzmann machines [39][40][41][42].
In the next sections, we detail how to re-write the RBM model in the non-binary case, for generalized distributions on the visible and hidden units, similar in spirit to [32].However, unlike earlier techniques, we will approach the problem of estimating the normalization of the RBM model via the tools of statistical mechanics, resulting in a fully-deterministic framework for RBM inference, training, and application.

III. GENERAL DISTRIBUTIONS FOR RBMS
We now turn our attention to the case of the general RBM (GRBM), where the distributions of the hidden and visible units are not fixed.We define the distribution of interest in the following manner, where the sum over i, j indicates a sum over the all visible and hidden units in the model, and θ = {θ v 1 , . . ., θ v Nv , θ h 1 , . . ., θ h N h } are the parameters defining the local distributions, P v i and P h j , on respectively the variables of x and the variables of h.In the case that x ∈ {±1} Nv and h ∈ {±1} N h , we can see that the distribution above reduces to a bipartite spin glass model with θ representing the local fields acting on the system spins; the fields b and c for binary spins as described in Eq. ( 1).This specific case is simply a binary RBM, as described in the previous section, and which we have already considered within an extended mean-field framework in [38].The important distinction with the model we evaluate here is that we do not assume a binary discrete distribution on the variables, but instead allow for a formulation where the variables in the system can possess distributions on discrete-or real-valued bounded support.By considering this more general class of models, one can include a wide range of different models, including the Hopfield model and spike-and-slab RBMs [43], and data sets, such as images or genomic data, by varying the distributions of the hidden and visible units.The distribution of visible variables is obtained by marginalizing out the latent variables, giving the log-probability If we take the gradients of (11) with respect to the model parameters, in the case of distribution terms θ we find, and which are generalizations of ( 7) and (8).However, in the case of the gradient with respect to the couplings, we find where the function computes the conditional expectation of h l knowing the value of the visible units.It is important to note that the data-dependent term is tractable thanks to the bipartite structure of the one-hidden layer RBM.
In contrast to the data-dependent terms, the second terms of Eqs. ( 12)-( 14) require knowledge of the partials of the log normalization w.r.t. the parameter of interest.However, this term cannot be written exactly as the explicit calculation of the normalization is intractable.Rather than resorting to sampling, we will attempt to approximate the free energy F = − ln Z [W, θ] in a parametric and deterministic way, as in [38].In the next section, we discuss how belief propagation (BP) can be used to estimate F and to conduct inference on RBMs.

IV. APPROXIMATION VIA BELIEF PROPAGATION
One method by which we might estimate the partition of F is via belief-propagation (BP) [44], which we review in Appendix A for pairwise models such as the RBM.Essentially, given a factor graph for some joint statistical model, such as that of our RBM in Fig. 1, the BP algorithm attempts to estimate a set of marginal distributions at each variable.In the case of tree-like graphs, BP provides an exact calculation of these marginals.The application of BP to factor graphs containing cycles, loopy BP, is not guaranteed to provide accurate estimates of the marginals.However, often these estimated marginals have significant overlap with the true ones [45].Additionally, it is known that the solutions of loopy BP are the fixed-points of the Bethe free energy [45], which allows the construction of an approximation of F. Applying this to the inverse learning problem, one can compute the gradients of this Bethe free energy in terms of the parameters of the model, allowing a gradient ascent on Bethe-approximated log-likelihood of the training data.
One significant hurdle in the application of loopy BP to RBM learning for real-valued variables is that the messages propagated on the edges of the factor graph are continuous PDFs.In the case of discrete variables, such as Ising or Potts spins, BP messages can be written using magnetizations or the full discrete PMF, respectively.For binary variables, both BP and mean-field approximations of fully-connected Boltzmann machines were considered in [39] in the context of inference with fixed parameters.A similar study of binary RBMs was conducted with loopy BP in [46].It is important to note that both of these studies investigated the properties of Boltzmann machines with i.i.d.random weights.While such studies permit many analytical tools for studying the behavior of the RBM, one cannot directly map these observations to RBM inference in practice, where trained weights may exhibit strong correlations both within and between receptive fields.
In order to construct a BP algorithm for PDFs over realvalued support, one requires a finite memory description of the messages.Some examples of such descriptions are given in non-parametric BP [47], moment matching [48], and relaxed BP (r-BP) [49].In Appendix B, following the example of r-BP, we show how to arrive at a two-moment approximation of the continuous BP messages via a smallweight expansion on the RBM coupling parameters W . There, we also show the r-BP approximated free energy of pairwise models, as well as demonstrating the need for distributions with bounded support in order to preserve bounded messages.
In the next section, building upon this derivation, we consider mean-field approximations of the RBM via hightemperature Plefka expansion.

V. TAP APPROXIMATION FOR PAIRWISE MODELS
While one could utilize the r-BP approach in order to estimate the free energy of a generalized real-valued spin model, as detailed in the earlier section, such an approach might not be desirable in practice.Specifically, if one wishes to solve the inverse learning problem, estimating model parameters from a given dataset, it is necessary to estimate the gradients of the model parameters w.r.t. the model likelihood for each parameter update.Using a steepest-ascent approach, as detailed in Sec.III, requires one to estimate these gradients many thousands of times.For systems of large size N , the r-BP scales quite poorly.Estimating a gradient requires the iteration of the r-BP equations on O(N 2 ) messages.Additionally, one must distinguish between cavity terms (i → j) and marginal terms (→ i).If the final O(N ) gradients we desire can be estimated using the marginal terms alone, then requiring an iteration on the O(N 2 ) set of messages is an extremely costly operation.
Instead, one can turn to a mean-field approach, writing the free energy, and its stationary points, in terms of the marginals alone.This can be done by including certain correction terms, up to a specified degree in the weights.In the context of RBMs, such approaches have been proposed at both the 1st order, the naive mean-field [36], and the 2nd order, using the so-called Thouless-Anderson-Palmer (TAP) [23] equations for introducing an additional correction term [38,39,42,50,51].In the case of a GRBM with arbitrary distributions on each unit, however, we must re-derive the TAP approximation in terms of parameters of these distributions as well as the approximate marginalized distribution at each site, up to their first two moments.This task turns out to be closely related to the TAP approach to low-rank matrix factorization [50][51][52][53][54].
While it is possible to derive the stationarity conditions for the inferred marginals from the r-BP messages directly by Taylor expansion, we rather focus on the free energy directly that will provide the gradients we require for training the GRBM parameters via a high-temperature expansion we present below.
Lastly, we point out that the TAP free energy secondorder (TAP) term depends on the statistical properties of the weight distribution.The derivation presented below assumes independent identically distributed weights, scaling as O(1/ √ N ).This assumption is a simplification, as in practice the weight distribution cannot be known a priori.The distribution depends on the training data and changes throughout the learning process according to the training hyper-parameters.The adaptive TAP (adaTAP) formalism [55] attempts to correct this assumption by allowing one to directly compute the "correct" second-order correction term for a realization of the W matrix without any hypothesis on how its entries are distributed.Although this algorithm is the most principled approach, its computational complexity almost rules out its implementation.Moreover, practical learning experiments indicate that training using adaTAP does not differ significantly from TAP assuming i.i.d.weights.A more detailed discussion of the computational complexity and learning performance is described in Appendix D.

V.1. Derivation of the TAP Free Energy
In the limit N → ∞, if we assume that the entries of W scale as O(1/ √ N ) and that all sites are widely connected, on the order of the size of the system, then we can apply the TAP approximation -a high temperature expansion of the Gibbs free energy up to second-order [34,35].In the case of a Boltzmann distribution, the global minima of the Gibbs free energy, its value at equilibrium, matches the Helmholtz free energy F [56].We will derive a twovariable parameterization of the Gibbs free energy derived via the Legendre transform [57].Additionally, we will show that this two-variable Gibbs free energy is both variational and attains the Helmholtz free energy at its minima.For clarity of notation, we make our derivation in terms of a pairwise interacting Hamiltonian without enforcing any specific structure on the couplings; the bipartite structure of the RBM is reintroduced in Section VI.
We will first introduce the inverse temperature term β to facilitate our expansion as β → 0, P (x; β, W, θ) = e −β(H−F β ) , where We wish to derive our two-variable Gibbs free energy for this system in terms of the first two moments of the marginal distributions at each site.To accomplish this, we proceed as in [35,55] by first defining an augmented system under the effect of two auxiliary fields, where we see that as the fields disappear, F β (0, 0) = F β , and we recover the true Helmholtz free energy.We additionally note the following identities for the augmented system, namely, where • λ,ξ is the average over the augmented system for the given auxiliary fields.Since the partial derivatives of the field-augmented Helmholtz free energy generate the cumulants of the Boltzmann distribution, it can be shown that the Hessian of −βF β (λ, ξ) is simply a covariance matrix and, subsequently, positive semi-definite.Hence, −βF β (λ, ξ) is a convex function in terms of λ × ξ.This convexity is shown to be true for all log partitions of exponential family distributions in [57].
We now take the Legendre transform of −βF β (λ, ξ), introducing the conjugate variables a and c, where we define the solution of the auxiliary fields at which G β (a, c) is defined as λ * λ * (β, a, c), ξ * ξ * (β, a, c), where we make explicit the dependence of the auxiliary field solutions on the values of the conjugate variables.
Looking at the stationary points of these auxiliary fields, we find that and The implication of these identities is that G β (a, c) cannot be valid unless it meets the self-consistency constraints that a and c are the first and second moments of the marginal distributions of the augmented system.Now, we wish to show the correspondence of G β (a, c) to the Helmholtz free energy at its unique minimum.First, let us look at the stationary points of −βG β (a, c) with respect to its parameters.We take the derivative with careful application of the chain rule to find, Carrying through a very similar computation for the derivative with respect to i .This shows that, at its solution, the Gibbs free energy must satisfy which can only be true in the event that the solutions of the auxiliary fields are truly λ * = 0 and ξ * = 0. Looking at the inverse Legendre transform of the Gibbs free energy for λ = 0, ξ = 0, we find that inf a,c G β (a, c) = F β (0, 0) = F β , which implies that the minimum of the Gibbs free energy is equivalent to the Helmholtz free energy.This holds since −βG β (a, c) is convex, as the Legendre transform of a convex function is itself convex.
Since the Gibbs free energy can therefore only possess a single solution, then its minimum must satisfy (26), and therefore, must be F β (0, 0) = F β .Finally, we can now rewrite the Gibbs free energy defined in Eq. ( 20) as a function of the moments a, c and parameterized by β and the GRBM parameters W and θ, where where the Lagrange multipliers are given as functions of the temperature in order to make clear the order in which we will apply β → 0 later.As this exact form of the Gibbs free energy is just as intractable as the original free energy, we will apply a Taylor expansion in order to generate an approximate Gibbs free energy [34].We make this expansion at β = 0, as in the limit of infinite temperature, all interactions between sites vanish and the system can be described only in terms of individual sites and their relationship to the system average and their local potentials, allowing the Gibbs free energy to be decomposed into a sum of independent terms.Specifically, if we take the expansion up to s terms, where Z β is the normalization of the Boltzmann distribution defined by H at temperature β.At β = 0 we can find the first term of the expansion directly where we recognize that the last term is simply the normalization of the Gaussian-product distribution whose moments were defined in (B9), (B10).We define the TAP free energy by writing the remainder of the expansions terms in the specific case of s = 2 [34,35], where Z i (B, A; θ) dx P i (x; θ)e Bx−Ax 2 .Now, we still need to define the extremal values of λ * (0) and ξ * (0) which define the Gibbs free energy.By taking the stationarity of the expanded Gibbs free energy with respect to a and c, we find where we make the definitions of A and B for convenience and as a direct allusion to the definitions of the cavity sums for BP inference, given in Eqs.(B5), (B6).Conversely, by deriving the stationarity conditions of the auxiliary fields, we obtain the self-consistency equations for a, c, which show us that the TAP free energy is only valid when the following self-consistencies hold, Substituting these values closes the free energy on the marginal distribution moments a, c and completes our derivation of a free energy approximation which is defined by O(N ) elements versus the O(N 2 ) values required by r-BP.

V.2. Solutions of the TAP Free Energy
As given, the TAP free energy is only valid when the self-consistency equations are met at its stationary points.Thus, only a certain set of a, c can have any physical meaning.Additionally, we know that only at the minima of the exact Gibbs free energy will we have a correspondence with the original exact Helmholtz free energy.
While the exact Gibbs free energy in terms of the moments a, c, is convex for exponential family P 0 (x; θ) such that e −β H ≥ 0, the TAP free energy can possess multiple stationary points whose number increases rapidly as β grows [5].Later in Sec.VII, we show that as GRBM training progresses, so does the number of identified TAP solutions.This can be explained due to the variance of the weights W growing with training.For fixed β = 1, as we use in our practical GRBM implementation, the variance of the weights serves as an effective inverse temperature, and its increasing magnitude has an identical effect to the system cooling as β increases.
Additionally, while the Gibbs free energy has a correspondence with the Helmholtz free energy at its minimum, this is not necessarily true for the TAP free energy.The approximate nature of the second-order expansion removes this correspondence.Thus, it may not be possible to ascertain an accurate estimate of the Helmholtz free energy from a single set of inferred a, c, as shown in Fig. 2. In the case of the naïve mean-field estimate of the Gibbs free energy, it is true that β .This implies that one should attempt to find the minima of G (1) β in order to find a more accurate estimate of F β , a foundational principle in variational approaches.However, while the extra expansion term in the TAP free energy should improve its accuracy in modeling G β over a, c, it does not provide a lower bound, and so an estimate of F β from the TAP free energy could be an under-or over-estimate.
Instead, one might attempt to obtain an estimate of the Helmholtz free energy by utilizing either all or a subset of the equilibrium solutions of the TAP free energy.Since there is no manner by which we might distinguish the equilibrium moments by their proximity to the unknown F β , averaging the TAP free energy across its solutions, denoted as , can serve as a simple estimator of F β [5].In [58] a weighting was introduced to the average, correcting the Helmholtz free energy estimate at low temperature by removing the over-influence of the exponential number of high-energy solutions.The weights in this approach are proportional to the exponents of each solution's TAP free energy, placing much stronger emphasis on low-energy solutions.However, such an approach is not well-justified in the our general setting of P i , where we expect large deviations from the expectations derived for the SK model.Additionally, while this weighting scheme is shown across the entire set of solutions for a particular random SK model, in our case, we are interested in the solution space centered on the particular dataset we wish to model.Since {a, c} Cartoon description for estimating the Helmholtz free energy (dotted ) via the Gibbs (blue dash) and TAP (red ) free energies.For this example of a convex Gibbs free energy, there exists one unique minimum over the moments a, c, and the Gibbs free energy here matches F β .The range of TAP free energies (gray box ) gives a boundary on the location of F β .Averaging the TAP free energies (dash-dot) provides an estimate of F β .
the solutions are computed by iterating the TAP selfconsistency equations, we can easily probe this region by initializing the iteration according to the training data.Subsequently, we do not encounter a band of high-energy solutions that we must weight against.Instead, we obtain a set of solutions over a small region of the support of the TAP free energy.Due to the uniformity of these solutions, un-weighted averaging across the solutions seems the best approach in terms of efficiency.In the subsequent section, we explore some of these properties numerically for trained RBMs.

VI. RBMS AS TAP MACHINES
To utilize the TAP inference of Sec.V, we need to write the TAP free energy in terms of the variables of the RBM.To clarify the bipartite structure of the GRBM, we rewrite the TAP free energy in terms of the hidden and visible variables at fixed temperature β = 1, where {a v , c v } and a h , c h are the means and variances of the visible and hidden variables, respectively.As in Sec.V, solutions of the TAP GRBM free energy can be found by a fixed-point iteration, as shown in Alg.
1, which bears much resemblance to the AMP iteration derived in the context of compressed sensing [59,60] and matrix factorization [51,53,54].We note that rather than updating over the entire system at each time step, fixing one side at a time has the effect of stabilizing the fixedpoint iteration.For clarity, Alg. 1 is written for a single initialization of the visible marginals.However, as noted in Sec.V.2, there exist a large number of initializationdependent solutions to the TAP free energy.Thus, in order to capture the plurality of modes present in the TAP free energy landscape, one should run this inference independently for many different initializations.
If the use case of the GRBM requires that we only train the GRBM tightly to the data space (e.g.data imputation), it makes sense to fix the initializations of the inference to points drawn from the dataset, In order to train the GRBM more holistically, structured random initializations can help probe modes outside of the data space.For a set of TAP solutions {a k , c k , B k , A k } for k ∈ {1, . . ., K} at fixed GRBM parameters {W, θ}, the TAPapproximated log-likelihood of can be written as where Z h j (B, 0; θ) is the normalization of the conditional expectation of Eq. ( 15), since f a (B, 0; θ) = f (B; θ).
After re-introducing an averaging of the log-likelihood over the samples in the mini-batch, the gradients of the TAP-approximated GRBM log-likelihood w.r.t. the model parameters are given by Algorithm 1 TAP Inference for GRBMs In the presented gradients, we make the point that the set of data samples and the set of TAP solutions can have different cardinality.For example, one might employ a mini-batch strategy to training, where the set of data samples used in the gradient calculation might be on the order 10 2 .However, depending on the application of the GRBM, one might desire to probe a very large number of TAP solutions in order to have a more accurate picture of the representations learned by the GRBM.In this case, one might start with a very large number of initializations, resulting in a very large number, K M , of unique TAP solutions.Or, contrary, while one might start with a number of initializations equal to M , the number of unique solutions might be K M , especially early in training or when the number of hidden units is small.Using these gradients, a simple gradient ascent with a fixed or monotonically decreasing step-size γ can be used to update these GRBM parameters.We present the final GRBM training algorithm in Alg. 2.
Besides considering non-binary units, another natural extension of traditional RBMs is to consider additional hidden layers, as in Deep Boltzmann Machines (DBMs).It is possible to define and train deep TAP machines, as well.Probabilistic DBMs are substantially harder to train than RBMs as the data-dependent (or clamped) terms of the gradient updates (39-41) become intractable with depth.Interestingly, state-of-the-art training algorithms retain a Monte Carlo evaluation of other intractable terms, while introducing a naıve mean-field approximation of these data-dependent terms.For deep TAP machines, we consistently utilize the TAP equations.The explicit definition and training algorithm are fully described in Appendix E.

MNIST -
The MNIST handwritten digit dataset [61] consists of both a training and testing set, each with 60,0000 and 10,000 samples, respectively.The data samples are real-valued 28 × 28 pixel 8-bit grayscale images which we normalize to the dynamic range of [0, 1].The images themselves are centered crops of the digits '0' through '9' in roughly balanced proportion.We construct two separate versions of the MNIST dataset.The first, which we refer to as binary-MNIST, applies a thresholding such that pixel values > 0.5 are set to 1 and all others to 0. The second, real-MNIST, simply refers to the normalized dataset introduced above.
For our experiments, we utilize only the face images.The database contains 2,429 training and 472 testing samples of face images.For our experiments, we normalize the samples to the dynamic range of [0, 1].FIG. 3. Training performance over 100 epochs for the tested datasets over varying numbers of hidden units, N h .Performance is measured in terms of the normalized (per-unit) TAP log-likelihood estimate computed for 10,000 training data samples.The TAP free energy is estimated using the unique TAP solutions thermalized from initial conditions drawn from the data samples, as in ( 36)- (37).Thermalization is determined by the convergence of the magnetizations up to a difference of 10 −8 in MSE between iterations.Solid lines indicate the average normalized TAP log-likelihood over the tested training samples and shaded regions indicate standard error.

VII.2. Learning Dynamics
We now investigate the behavior of the GRBM over the course of the learning procedure, looking at a few metrics of interest: the TAP-approximated log-likelihood of the training dataset, the TAP free energy, and the number of discovered TAP solutions.We note that each of these metrics is unique to the TAP-based model of the GRBM.While it was shown in [38] that CD does indeed maximize the TAP log-likelihood in the case of binary RBMs, the specific construction of CD is entirely independent from the TAP model of the GRBM.Thus, it is hard to say that a CD or TAP-trained GRBM is "better" in a general case.At present, we present comparisons between TAP GRBMs of varying complexity trained under fixed hyper-parameters settings, as indicated in Table I.
In Fig. 3 we see a comparison of the TAP log-likelihood as a function of training epochs for binary-MNIST for binary RBMs consisting of differing numbers of hidden units.As the gradient-ascent on the log-likelihood is performed batch-by-batch over the training data, we define one epoch to be a single pass over the training data: every example has been presented to the gradient ascent once.The specifics of this particular experiment are given in the caption.We note that for equal comparison across varying model complexity, this log-likelihood is normalized over the number of visible and hidden units present in the model.In this way, we observe a "per-unit" TAP log-likelihood, which gives us a measure of the concentration of representational power encapsulated in each unit of the model.Increasing values of the normalized TAP log-likelihood indicate that the evaluated training samples are becoming more likely given the state of the GRBM model parameters.
It can be observed that at each level of complexity, the TAP log-likelihood of the data rapidly increases as the values of W quickly adjust from random initializations to receptive fields correlated with the training data.However, across each of the tested models, by about the 20 th epoch the rate of increase of the TAP log-likelihood tapers off to a constant rate of improvement.
For reference, we also show a subset of the trained receptive fields, i.e. the rows of W , for each of the tested experiments.Since the full set of receptive fields would be too large to display, we attempt to show some representative samples in Fig. 4 by looking at the extreme samples in terms of spatial spread/localization and activity over the training set.We observe that the trained GRBMs, in the case of both binary-MNIST and real-MNIST, are able to learn both the localized and stroke features commonly observed in the literature for binary RBMs trained on the MNIST dataset [9,63].It is interesting to note that even in the case of real-MNIST, where we are using the novel implementation of truncated Gauss-Bernoulli visible units, we are able to observe similar learned features as in the case of binary-MNIST.We take this as an empirical indication that the proposed framework of GRBM learning is truly learning correlations present in the dataset as intended.Finally, we see feature localization increase with the number of hidden units.
To date, understanding "what" an RBM learns from the unlabelled data has mostly been a purely subjective exercise in studying the receptive fields, as shown Fig. 4.However, the interpretation of the GRBM as a TAP machine can provide us a novel insight in the nature and dynamics of GRBM learning via the stationary points of the TAP free energy, which we detail in the next section.

VII.3. Probing the GRBM
Given the deterministic nature of the TAP framework, it is possible to investigate the structure of the modes which a given set of GRBM parameters produces in the free energy landscape.Understanding the nature and concentration of these modes gives us an intuition on the representational power of the GRBM.To date, observing the modes of a given GRBM model could only be approached via long-chain sampling.Given enough sampling chains from a diverse set of initial conditions, thermalizing these chains produces a set of samples from which one could attempt to derive statistics, such as concentrations of the samples in their high-dimensional space, to attempt to pinpoint the likely modes in the model.However, the number of required chains to resolve these features increases with the dimensionality of the space and the number of potential modes which might exist in the space.Because of this, the numerical evaluation we carry out here would be impractical with sampling techniques.
The r-BP and mean-field models of the RBM allow us to directly obtain the modes of the model by running inference to solve the direct problem.Given a diverse set of initial conditions, such as a given training dataset, running r-BP or TAP provides a deterministic mapping between the initial conditions drawn from the data, as in ( 36)-( 37), and the "nearest" solution of the TAP free energy.If the initial point was drawn from the dataset, then this solution can be interpreted as the RBM's bestmatching internal representation for the data point.
If a large number of structurally diverse data points map to a single solution, then this may be an indicator that the GRBM parameters are not sufficient to model the diverse nature of the data, and perhaps further changes to the model parameters or hyper-parameters are required.Conversely, if the number of solutions explodes, being roughly equivalent to the number of initial data points, then this indicates a potential spin-glass phase, that the specific RBM is over-trained, perhaps memorizing the original data samples during training.Additionally, when in such a phase, the large set TAP solutions may be replete with spurious solutions which convey very little structural information about the dataset.In this case, hyper-parameters of the model may need to be tuned in order to ensure that the model possess a meaningful generalization over the data space.
To observe these effects, we obtain a subset of the TAP solutions by initializing the TAP iteration with initial conditions drawn from the data set, running the iteration until convergence, and then counting the unique TAP solutions.We present some measures on these solutions in Fig. 2. Here, we both count the number of unique TAP solutions, as well as the distribution of the TAP free energy over these solutions, across training epochs.There are a few common features across the tested datasets.First, the early phase of training shows a marked increase of the TAP free energy, which then gradually declines as training continues.Comparing the point of inflection in the TAP free energy against the normalized TAP loglikelihood shown in Fig. 3 shows that the early phase of GRBM training is dominated by the reinforcement of the empirical moments of the training data, with the GRBM model correlations playing a small role in the the gradient of ( 14).This makes sense, as the random initialization of W ∼ N (0, σ) for σ ≈ 10 −3 implies that the hidden units are almost independent of the training data.Thus, the TAP solutions at the early stage of learning are driven by, and correlated with, the local potentials on the hidden and visible variables.
The effect of this influence is that the TAP free energy landscape possesses very few modes in the data space.Fig. 5 shows this very clearly, as the number of TAP solutions starts at 1 and then steadily increases with train-ing.Because the positive data-term of ( 14) is dominant, the GRBM parameters do not appear to minimize the TAP free energy, as we would expect.However, as more TAP solutions appear, the data and model terms of the gradient become balanced, and the TAP free energy is minimized.It is at this point of inflection that we see leveling off of the normalized TAP log-likelihood.
Second, we observe free-energy bands in the TAP solutions.This feature is especially pronounced in the case of the binary-MNIST experiment.Here, at all training epochs, there exist two significant modes in the free energy distribution over the TAP solutions.We see this effect more clearly in the training-slice histograms shown in the bottom row of Fig. 5(a).In the case of the real-MNIST experiment, we see that the free energy distributions do not exhibit such tight banding, but they do show the presence of some high-and low-energy solutions which persist across training.The main feature across experiments is the multi-modal structure of the free energy distribution.Finally, we note that for both real-MNIST and binary-MNIST, in the case of N h = 100, we don't empirically observe an explosion of TAP solutions, a potential indicator a spin-glass phase, since the proportion of unique TAP solutions to the initial data points remains less than 10%.
In order to investigate whether the modes in the TAP free energy distributions are randomly assigned over configuration space, or exist in separate continuous partitions of the configuration space, we need to look at the proximity of the solutions in the configuration space.Because this space cannot be observed in its ambient dimensionality, we project the configuration space into a two-dimensional embedding in Fig. 6.Here, we utilize the well known Isomap [64] algorithm for calculating a two-dimensional manifold which approximately preserves local neighborhoods present in the original space.Using this visualization we observe that as training progresses the assignment of high and low free energy to TAP solutions does not appear random in nature, but seems to be inherent to the structure of the solutions themselves, that is, their location in the configuration space.Additionally, in Fig 6 we can see the progression from few TAP solutions to many, and how they spread across the configuration space.It is interesting to note how the solutions start from a highly correlated state and then proceed to diversify.
We can also observe the TAP solutions with respect to the initializations which produced them, as shown in Fig. 7.In these charts, we use a similar approach as Fig. 6, mapping all high-dimensional data points, as well as TAP magnetizations, into a 2D embedding using Isomap.This allows us to see, in an approximate way, how the TAP solutions distribute themselves over the data space.We also show how the number of TAP solutions grows from few to many over training, and how they maintain a spread distribution over the data space.This demonstrates how the training procedure is altering the parameters of the model so as to place TAP solutions within dense regions of the data space.For the sake of clarity, we have not included lines indicating the attribution of an initial data point to its resultant TAP solution.However, as training progresses, one sees that the TAP solutions act as attractors over the data space, clustering together data points which the TAP machine recognizes as similar.

VII.4. Inference for Denoising
Serving as a prior for inference is one particular use case for the TAP machine interpretation of the GRBM.As a simple demonstration, we turn to the common signal processing task of denoising.Specifically, given a planted signal, one observes a set of noisy observations which are measures of the true signal corrupted by some stochastic process.Denoising tasks are ubiquitous in signal processing, both at an analog level (e.g.additive and shot noise), and at the level of digital communications (e.g.binary symmetric and erasure channels).The goal of this task is to produce the most accurate estimate of the unknown signal.In the analog case, this may be a measure of mean-square-error (MSE) between the estimate and the true signal.In the binary case, this may be a measure of accuracy, counting the number of incorrect estimates, or some other function of the binary confusion matrix, such as the F1-score or Matthews correlation coefficient (MCC).
For a fixed set of observations and channel parameters, if we assume that the original signal was drawn from some unknown and intractable generating distribution, then as we construct more and more accurate tractable approximate priors, the more accurately we can construct an estimate of the original signal.
In other words, the more we know about the structure and content of the unknown signal a priori, the closer our estimate can be.Often, as in the case of wavelet-based image denoising, statistics are gathered on the transform coefficients of particular images classes and heuristic denoising approaches are designed by-hand accordingly [65].By-hand derivation of denoising algorithms works well in practice owing to its generality.Specific a priori information about the original signal is not required, beyond its signal class (e.g.natural images, human speech, radar return timings).However, meaningful features must be assumed or investigated by practitioners before successful inference can take place.For binary denoising problems, we assume a binary symmetric channel (BSC) defined in the following manner.Given some binary signal x ∈ {0, 1} N , we observe the signal y ∈ {0, 1} N as x with independent bit flips occurring with probability p.This gives the following likelihood at each observation, which can be shown to have the equivalent representation as a Boltzmann distribution, where D i ln p 1−p (2y i − 1).For a given prior distribution P (x), the posterior distribution is given by Bayes' rule, P (x|y) = e i Dixi P (x) x e i Dixi P (x) By assuming a factorized , where m i might be per-site empirical averages obtained from available training data, the posterior factorizes and we can construct the Bayes-optimal pointwise estimator (OPE) as the average x i P (xi|yi) which is just P (x i = 1|y i ) for our binary problem.Thus, the OPE at each site x i is given as For a given dataset, the OPE gives us the best-case performance using only pointwise statistics from the dataset, namely, empirical estimates of the magnetizations m i .We can see from Eq. ( 45) that the OPE either returns the observations, in the case of p = 0, or the prior magnetizations m i , in the case p = 0.5.In this case of complete information loss, the worst case performance is bounded according to the deviation of the dataset from its mean.We present the performance of the OPE in Fig. 8 for the binary-MNIST dataset.This makes the OPE a valuable baseline comparison and sanity check for the GRBM approximation of P (x).As the GRBM model takes into account both pointwise and pairwise relationships in the data, a properly trained GRBM should provide estimates at least as good as the OPE.
The k-Nearest-Neighbor (k-NN) algorithm represents a different heuristic approach to the same problem [66].In this case, the noisy measurements are compared to a set of exemplars: the training dataset.Then, according to some distance metric such as MSE or correlation, one finds the k exemplars with minimal distance to the noisy observations to serve as a basis for recovering the original binary signal.One can use some arbitrary approach for fusing these exemplars together into the final estimate, but the simplest case would be a simple average.In the case that k → ∞, the performance when using averaging is again bounded by the empirical magnetizations.In the other limit of k = 1, the estimate is simply the nearest exemplar.It is hard to show the limiting performance of this approach, as it is dependent on the distances, in the chosen metric, between the exemplars and the observations, as well as the interplay between the noise channel and the distance metric.
However, it can be seen directly that this approach is non-optimal, as this approach will not yield the true signal at p = 0 unless the true signal is itself contained within the training data.We show the performance for k = 1 in Fig. 8.The advantage of this approach is that it successfully regularizes against noise as p → 0.5, as the nearest exemplar is always noise-free and at least marginally correlated with the original signal, up to the distance metric.Additionally, we see that it performs better than the OPE in the regime p > 0.2.This can be explained since we can think of the k-NN approach as implicitly, though indirectly, taking into account higherorder correlations in the dataset by naïvely returning data exemplars; all the estimates trivially posses the same arbitrarily complex structure as the unknown signal.
Using the GRBM, we can hope to capture the best points of both approaches.First, we hope to perfectly estimate the original signal in the case p = 0. Second, we hope to leverage the pairwise correlations present in the dataset, returning estimates which retain the structure of the data even as p → 0.5.For GRBM denoising of the BSC, we no longer have a factorized posterior.Instead, we have the GRBM likelihood given in Eq. ( 9) summed over the hidden units.Using the definition of the binary prior given in Appendix 3, Since both the GRBM and the BSC channel likelihood are written as exponential family distributions, P (y|x)P (x; W, U ) ∝ h e ij xiWij hj + i (Di+Ui)xi+ j Uj hj .Finding the averages x i for this model simply consists in running the TAP-based inference of Alg. 1 for the modified visible binary prior B(x i ; U i + D i ).One heuristic caveat of this approach is that we must take into account the multi-modal nature of the TAP free energy.Since we must initialize somewhere, and the resulting inference estimate is dependent upon this initialization, we initialize the inference with the OPE result.
We can see that as p → 0, D i → ±∞ and the highest probability configuration becomes observations.So, in the limit, we are able to obtain the true signal, just as the OPE, especially since we initialize within the well of this potential.This is shown for binary RBMs trained with varying numbers of hidden units in Fig. 8.In every case, for p = 0, the true signal is recovered.In the case of N h = {25, 50, 100}, we see that the TAP inference on the binary RBM always outperforms the OPE.Additionally, we see that in each of these cases, the performance closely mirrors that of the 1-NN as p → 0.5.In the limit p = 0.5, we see that the result of the TAP inference is, essentially, uncorrelated with the original signal, as in this case, there is no extra potential present to bias the inference, and the resulting estimate is simply an arbitrary solution of the TAP free energy.As this closely mirrors the exemplar selection in 1-NN, the MCC curves for the two approaches are similar.
In the case of N h = 500, we can see that an over-training effect occurs.Essentially, at low values of p, the TAP inference over the binary RBM is able to more accurately identify the original signal.However, at a certain point, owing to the increased number of solutions in the TAP free energy, there exist many undesirable minima around the noisy solutions, leading to poor denoising estimates.One can observe this subjectively in Fig. 9, where in the case of N h = 500, the TAP inference results in either nearly zero-modes, or in very localized ones.This would seem to indicate that landscape of the TAP free energy around the initializations is becoming more unstable as the density of solutions increases around it.Additionally, since the TAP free energy landscape was only probed using data points during training, the clustering of solutions around noisy samples remains ambiguous.Augmenting the initializations used when calculating the TAP solutions for the gradient estimate with noisy data samples could help alleviate this problem and regularize the TAP free energy landscape in the space of noisy data samples.For the OPE and RBM approaches, the inferred posterior averages xi are shown, rather than the final configurations, where white and black represent the values 0 and 1, respectively.At each tested value of p, the same noise realization is used for each method.As p increases, for N h = {25, 50, 100}, the TAP inference for the RBM provides estimates which still possess digit structure.In the case N h = 500, the TAP inference gets caught in spurious and undesirable minima as p increases.

VIII. DISCUSSION
In this paper, we have proposed a novel interpretation of the RBM within a fully tractable and deterministic framework of learning and inference via TAP approximation.This deterministic construction allows novel tools for scoring unsupervised models, investigation of the memory of trained models, as well as allowing their efficient use as structured joint priors for inverse problems.While deterministic methods based on NMF for RBM training were shown to be inferior to CD-k in [36], the level of approximation accuracy afforded by TAP finally makes the deterministic approach to RBMs effective, as shown in the case of binary RBMs in [38].
Additionally, our construction is generalized over the distribution of both the hidden an visible units.This is unique to our work, as other works propose unique training methods and models when changing the distribution of the visible units.For example, this can be seen in the modified Hamiltonians used for real-valued data [67,68].This construction allows us to consider binary, real-valued, and sparse real-valued datasets within the same framework.Additionally, one can also consider other architectures by changing the distributions imposed on the hidden unit.Here, we present experiments using only binary hidden units, but one could also use our proposed framework for Gaussian-distributed hidden units, thus mimicking a Hopfield network [2].Or, also, sparse Gauss-Bernoulli distributed hidden units could mimic the same functionality as that proposed by the spike-and-slab RBM [43].We have left these investigations to further works on this topic.
Our proposed framework also offers a possibility to explore the statistical mechanics of these latent variable models at the level of TAP approximation.Specifically, for a given statistical model of the weights W , both the cavity method and replica can begin to make predictions about these unsupervised models.Analytical understanding of the complexity of the free energy landscape, and its transitions as a function of model hyper-parameters, can allow for a richer understanding of statistically optimal network construction for learning tasks.In the case of random networks, there has already been some progress in this area, as shown in [69] and [70].However, similar comprehensive studies conducted on learning in a realistic setting are still yet to be realized.Finally, as our framework can be applied to deep Boltzmann machines with minimal alteration, it can also potentially lead to a richer understanding of deep networks and the role of hierarchy in regularizing the learning problem in high dimensionality.
(A3) Here, the notation i → j represents a message from variable index i to variable index j, and ∂ i /j refers to all neighbors of variable index i except variable index j.We denote neighboring variables as those which share a pairwise factor.Finally, the super-scripts on the messages refer to the time-index of the BP iteration, which implies the successive application of (A3) until convergence on the set of messages ν ¯= {ν i→j (x i ) : (i, j) ∈ E}, where E is the set of all pairs of neighboring variables.We also note the inclusion of the message normalization term Z i→j which ensures that all messages are valid PDFs.Additionally, it is possible to write the marginal beliefs at each variable by collecting the messages from all their neighbors, j→i (x j ).
(A4) Subsequently, the Bethe free energy can be written for a converged set of messages, ν ¯ * , according to [59] as where Z * →i refers to the normalization of the set of the marginal belief at site i derived from ν ¯ * .
Unfortunately, the message-passing of (A3) cannot be written as a computable algorithm due to the continuous nature of the PDFs.Instead, we must find some manner by which to parameterize the messages.In the case of binary variables, as in [38], each message PDF can be exactly parameterized by its expectation.How-ever, for this general case formulation, we cannot make the same assumption.Instead, we turn to relaxed BP (r-BP) [49], described in the next section, which assumes a two-moment parameterization of the messages.We will now consider one possible parametric approximation of the message set via r-BP [49].This approach has also gone by a number of different names in parallel re-discoveries of the approach, e.g.moment matching [48] and non-parametric BP [47].In essence, we will be assuming that all messages ν ¯can be well-approximated by their mean and variance, a Gaussian assumption.This approximation arises from a second-order expansion assuming small weights W ij .By making this assumption, we will ultimately be able to close an approximation of the messages on their two first moments, a i→j x i νi→j and c i→j x 2 i νi→j − x i 2 νi→j .

Derivation via Small Weight Expansion
Considering the marginalization taking place in (A3), we will perform a second order expansion assuming that W il → 0. We start by taking the Taylor series of the incoming message marginal for negligible weights, Now, we approximate the series by dropping the terms less than O(W 3 il ).This approximation can be justified in the event that all weight values satisfy |W il | < 1. Identifying the integrals from the expansion as moments, we see the following approximation However, we would like to write this approximation in terms of the central second moment.Through a second approximation that neglects O(W 3 il ) terms we arrive at our desired parameterization of the incoming message marginalization in terms of the message's two first central moments, We now substitute this approximation back into (A3) to get where From here, we can see that we have now a set of closed equations due to the dependence of A (t) and B (t) on the moments a (t) and c (t) , and vice versa.The values of these moments can be written as a function of A (t) and B (t) which is dependent upon the form of the local potentials φ(x i ; θ i ), i.e. the prior distribution we assign to the variables themselves, where and Z is simply the normalization dx φ(x; θ) e xB− 1 2 x 2 A .The inferred marginal distributions at each site can be calculated via the same functions, but instead using all of the incoming messages, i.e. a →i ; θ i .In Appendix 1 we give the closed-forms of these moment calculations for a few different choices of φ(x; θ).
If one wants to obtain an estimate of the free energy for a given set of parameters θ, it is possible to iterate between (B6), (B5) and (B7), (B8) until, ideally, convergence.It is important to note, however, due to both the potentially loopy nature of the network as well as smallweight expansion, that the BP iteration is not guaranteed to converge [39,59].Additionally, while we retain the time-indices in our derivation, it is not clear whether one should attempt to iterate these message in fully sequential or parallel fashion, or if some clustering and partitioning of the variables should be applied to determine the update order dynamically.

r-BP Approximate Bethe Free Energy
Additionally, we can write the specific form of the Bethe free energy under the r-BP two-moment parameterization of the messages.In this case, we can simply apply the small weight expansion of (B3) to the Bethe free energy for pairwise models given in (A5), Subsequently, using the final message definition given in (B4), we can see that which, correcting for double counting, can also be written as where d i is the degree at site i, |∂ i |.

Enforcing Bounded Messages
While we write the r-BP messages (B4) as though they are Gaussian distributions, this is a slight, since l→i ≥ 0 ∀i, l.The implication of the expansion is that, in general, the messages are in fact unbounded.This unboundedness is a direct result of the form of the conventional RBM pairwise factor, e xiWij xj .
There are a few avenues available to us to address these unbounded messages and produce a meaningful messagepassing for generalized RBMs.Let us consider the cases for which the messages are unbounded given a specific variable distribution.Assume that site x i is assigned a Gaussian prior, φ(x i ; θ i = {V i , U i }) ∝ e xiUi− 1 2 x 2 i Vi .In this case the r-BP message reads In this case, the message is unbounded in the event that weighted sum of all incoming neighbor variances at i exceeds the inverse variance of Gaussian prior on x i , where σ 2 i is the variance of the Gaussian prior.Said another way, this condition is telling us that when the message-passing starts to tell us that if the variance at x i is smaller than that of its prior, the messages become unbounded and fail to be meaningful probability distributions, and our expansion fails.The implication is that the r-BP message passing should be utilized in contexts where there exists some, preferably strong, evidence at each site, or the weights in W should be sufficiently small.The stronger this local potential, or the smaller the weights, the more favorable the model is to the r-BP inference.This observation mirrors those made in [39], however, here the authors make the observation that in this setting, BP based on small-weight expansion for binary variables fails to converge.In our case, without taking some form of regularization, the inference fails entirely.Thus, large magnitude couplings W il must be backed with a high degree of evidence at site i and l.This property could be utilized for the inverse learning problem, where one must learn the couplings W given a dataset, in order to constrain the learning to parameters which are amenable to the r-BP inference.
One direct manner to create probability distributions from otherwise unbounded continuous functions is via truncation.Specifically, we enforce a non-infinite normalization factor by restricting the support of the distribution to some subset of R. In this case, just slightly violating the bounded condition above will induce a uniform message distribution over the distribution support, while a strong violation will cause the distribution to concentrate on the boundaries of the support.Another approach might simply be to fix a hard boundary constraint on A i→j , thus never permitting unbounded messages to occur.and writing the distribution for the parameters θ = {U, V, [α, ω]} as , (C3) where Erf[•] is the error function and the last step follows from the identity Φ Here, the subscript + is used to indicate V > 0.
In the case that V < 0, we will write T G in terms of the imaginary error function, Erfi[x] −iErf[ix], and noting that for .
(C4) We use this negative variance version of the truncated Gaussian to handle the special case of A + V < 0 first detailed in Sec. 3.
We now detail the computation of the partition and first two moments of the Gaussian-product distribution of T G, Ax 2 +Bx .The calculation of the moments as a function of A and B will provide the definitions of f a and f c , while the calculation of the normalization Z will provide terms necessary for both the computation of the TAP free energy as well as the gradients necessary for learning θ during RBM training.
First, we will calculate the normalization of Q(x; θ, A, B) in terms of all free parameters.To do this, consider the truncated normalization of the following product of Gaussians, where x .We will need to make note of the special case of A + V < 0, thus, and x − B+U A+V .Since Q(x; θ, A, B) is simply a truncated Gaussian with updated parameters, the first moment is given according to the well known truncated Gaussian expectation.While this expectation is usually written in terms of a mean and variance of the un-truncated Gaussian distribution, and for the case of positive mean, we will instead write the expectation in terms of the exponential polynomial coefficients and note the special case A + V < 0, Next, we will write the variance of Q(x; θ, A, B) as a function of A and B. As earlier, since Q has the specific form of a truncated Gaussian distribution we can utilize the well-known variance formula for such a distribution.As in the case of f a , we modify this function for the special case of A + V < 0. Specifically,

Gradients of the Log-likelihood
To determine the gradients of the log-likelihood w.r.t. the model parameters, it is necessary to calculate the gradients of both ln T G + (x; θ) and ln Z Q in terms of the distribution parameters U and V .We assume that the boundary terms α and ω remain fixed.Since both of these distributions are truncated Gaussians, we can treat them both in terms of the derivatives of the log normalization of a general-case truncated Gaussian, For Boltzmann measures of the given quadratic form, we know that From these relations, we can write the necessary derivatives.For the case of ln T G + (x; θ) we have which we can see as just the difference between the data and the moments of the truncated Gaussian distribution.
Next, the derivatives of Z Q , for fixed A and B, can be written in terms of f a and f c , the moments of Q, Finally, we are ready to write the gradients of the loglikelihood to be used for both hidden and visible updates with the truncated Gaussian distribution.For a given set of mini-batch data indexed by m ∈ {1, . . ., M }, and a number of TAP solutions indexed by p ∈ {1, . . ., P }, the gradients of a visible variable are given by and where • M and • P are averages over the mini-batch and TAP solutions, respectively.For updates of hidden side variables using the truncated Gaussian distribution we have the following gradients for updating their parameters, and where the moments a j (m) , 0; θ j ) are defined for convenience.Using these gradients, we can now update the local biasing distributions for truncated Gaussian variables.

b. Numerical Considerations
Using the truncated Gaussian prior comes with a few numerical issues which must be carefully considered.While we have already addressed the cases when A + V < 0, we have not addressed the case where One can see how this complicates matters by observing the term 1 d +/− which occurs in both the first and second moment computations.When the magnitude of the limits ω and α become vanishingly small in comparison to the scaled joint mean term, then we see that d +/− → 0. However, the numerators of both f a and f c also go to zero, which implies that we may be able to find some method of approximation to find estimates of these moments without, up to numerical precision, dividing by zero.
In order to handle this eventuality in our implementation, we make a Taylor series expansion of the 1 d +/− term in the following way.First, we note that d +/− has the form Erf[η(ω − µ)] − Erf[η(α − µ)], which can be rewritten as Erf[z]−Erf[z − ] for z η(ω −µ) and η(α−ω) is a multiple of the difference between the two boundaries of the truncated Gaussian distribution.Since we wish to consider the case that µ → ∞, we will take the Taylor expansion centered at z = ∞, since the value of µ dominates.From here, we find that the following approximation works well in practice, where S (n) (•) is the n-term power series representation of the error function.In our experiments, we use n = 11.
A similar approximation can be used for the variances in the same situation.While this approximation could potentially be computationally costly for large n, we note that it is only used for updates on variables for which a small value of d +/− has been detected.

Truncated Gauss-Bernoulli Units
For truncated Gauss-Bernoulli distributed units, we form the distribution as a mixture between a delta function and T G, with an extra term (1 − ρ) which controls the density at x = 0, thus, for θ = {ρ, U, V, [α, ω]} we have where Z T GB (1 − ρ) + ρZ T G and δ(x) is the Dirac delta function such that δ(0) = 1 and 0 everywhere else.By using a construction such that the truncation is done on the Gaussian mode alone, and not across the entire distribution, we can easily write the necessary functions of this distribution in terms of the values we have already calculated in Appendix 1 as long as 0 ∈ [α, ω].Additionally, it will be useful for our calculations to define the probability of x to be non-zero according to T GB, We now continue as in the previous appendices and write the normalization and first two moments of the Gaussian product distribution Q( 2 Ax 2 +Bx .First, the normalization can be written simply as a function of the truncated Gaussian normalization modified by A and B, as in (C6), Next, the first moment of Q can be found by recalling the relation where the non-zero probability is calculated according to A and B. For the second moment, we note that ∂ Next, we turn our attention to the log-likelihood gradients necessary for updating the parameters U , V , and ρ during training.First, we will look at the derivatives of ln T GB required for updates on visible units.In order to calculate these derivatives, we will split the log probability into two cases, Consequently, the derivatives of the log probability can are written as the following, and The derivative w.r.t.ρ is a bit more complicated, as we cannot use the same identities.Additionally, we must also consider the two cases of x = 0 and x = 0 separately.Thus, which can be rewritten in the more concise form by noting the complement of the support probability P [x = 0] = 1 − P [x = 0] = 1−ρ Z T GB = 1 − ρ and making the appropriate substitution.
We now write the derivatives of the ln Z Q in terms of U , V , and ρ.These take the same form as those written in Appendix 1. Subsequently, the equations for ∆U i , ∆V i , ∆U j , and ∆V j all remain consistent, just under the modification of all moments being taken w.r.t. the T GB.We need only to write the gradients ∆ρ i , ∆ρ j .Starting from the log partition and applying the same identity used to write Eq. (C30) which gives us the final gradients for visible units, and C33) for hidden units, where P Q is calculated in the naive-meanfield manner described in Appendix 1, where

Binary Units
We define the distribution for binary units to be the Bernoulli distribution such that x ∈ {0, 1} where m Prob[x = 1] = x B .We can also write B as the Boltzmann distribution where U ln m 1−m and Z B 1 + e U .Next, we calculate the normalization and moments of the distribution Q( For the normalization we have, Subsequently, for the moments of Q we have where sigm is the logistic sigmoid function.Subsequently, the variance for the binary unit can be calculated directly as f c (B, A; U ) = f a (B, A; U ) − f a (B, A; U ) 2 .(C38) P (x, h (1) , .., h (L) ; W (1) , .., W (L) , θ) = 1 Z W (1) , .., W (L) , θ e i,j xiWij h (1)   j + l i,j h Similarly to RBMs, the distribution of visible variables is obtained by marginalizing out the latent variables, P (x; W (1) , .., W (L) , θ) =   l j dh (l) j   P (x, h (1) , .., h (L) ; W (1) , .., W (L) , θ), (E2) yielding the following log-likelihood ln P (x; W (1) , .., W (L) , θ) = − ln Z W (1) , .., W (L) , θ + The major difference between RBMs and DBMs lies in the complexity of evaluating the above expression.Whereas for RBMs only the log-partition features a problematic multidimensional integral Eq. ( 11), here both the logpartition and the last term are intractable.This additional complication carries through to the computation of the gradients necessary for training, since the datadependent term deriving from the last term of (E3) is no longer tractable.
This intractability follows from the fact that hidden units in neighboring layers are now connected to each other, and are thus no longer conditionally independent.Interestingly, the first proposal to deal with the datadependent terms of DBMs consisted in using a naive mean-field approximation [26], while keeping a Monte Carlo based strategy to compute gradients deriving from the log-partition.In this work, we propose instead to use the TAP approximation for both of them, hence improving on the NMF approximation and avoiding any sampling of the rather complicated RBMs.
The TAP equations related to the log-partition Z W (1) , .., W (L) , θ follow directly from the general derivation of Sec.V for fully connected models, with however a different weight matrix being used.For in-stance, the effective weight matrix of a DBM with 2 hidden layers is defined by blocks as Thus, implementing the GRBM inference algorithm Alg. 1 with the proper weights outputs TAP solutions {a, c, B, A}, each of them a vector with components corresponding to the different units in the DBM.
For the last term of (E3), we recognize the log-partition of a model closely related to the considered DBM, where visible units are not anymore variable but fixed, or clamped, at values x i , and the original interaction between visible and first hidden layer units is replaced by an additional local field on each h (1) j equal to i W (1) ij x i .Finally, under this simple modification of the Hamiltonian, TAP equations follow again from the general derivation in Sec.V.The resultant TAP solutions depending on data points x are said to be clamped and denoted as {ā(x), c(x), B(x), Ā(x)}.

Training Algorithm and Experiments
The gradients of the log-likelihood with respect to the model parameters θ are similar to the RBM ones, given by ( 12), ( 13) and (14).However, the first data-dependent term cannot be analytically computed anymore, and we use the clamped TAP solutions to approximate it.The second term is evaluated using the data-independent TAP solutions, similarly to our strategy for RBMs.The corresponding expressions of the gradients are ln Z h j ( Bh j (x (m) ), Āh j (x (m) ); θ h j (l) ) ) (E8) These expressions can be plugged to a gradient ascent algorithm, as in the RBM training algorithm Alg. 2. Nevertheless, this simple strategy of simultaneous training of all the parameters of the model (joint training) usually fails as the magnitude of weights of deep layers typically remains very small and the model eventually resembles a mere RBM.Several regularizations have been proposed to tackle this well-known problem of DBM training [16,26,[72][73][74].In our experiments, we used a greedy layerwise pre-training [26], which consists in computing a meaningful initialization of the weights by training the RBMs layer-by-layer, before performing the joint training.The complete algorithm is described in Alg. 4.
Fig. 11 shows the evolution of the TAP log-likelihood for a 2-hidden layer and a 3-hidden layer DBMs, trained with the above described algorithm.

Algorithm 4 GDBM Training
Input: X, Tpretrain,Tjoint train, M , K, R(•) Pretraining W (1) , θ v , θ h (1)  ← Alg.2(X, Tpretrain, M , K, R(•)) for All hidden layers l ≥ 2 do H (l−1) ∼ P h (l) |H (l−1) , ..., H 1 , X W (l) , θ h (l) ← Alg.2(H (l−1) , Tpretrain, M , K, R(•)) end for Joint training Initialize: t = 0 repeat for All mini-batches X B of size M do a (t+1) , c (t+1) , B (t+1) , A (t+1) ← Alg.1(W (t) , θ (t) ) × K Training was performed using the 5000 first images of binarized MNIST, with a learning rate of 0.001 and batches of size 100.The different curves correspond to different strategies of estimation of the likelihood gradients, either with TAP or adaTAP.Both algorithms were iterated for a fixed number of times (3, 10 and 100).In all cases, a damping of 0.5 was used.All methods yield comparable results in terms of training performance, except for adaTAP with only 3 iterations, which shows poorer performance.Right: Computation time for one iteration of the inference algorithm, as a function of batch size.Time is reported in seconds for identical experimental settings.The need for a matrix inversion for each batch element makes VAMP 3 orders of magnitude slower than TAP.

FIG. 4 .
FIG. 4. Subsets of the final receptive fields, i.e. the columns of W , obtained by TAP training of GRBM models with varying numbers of hidden units, N h .For the receptive fields, dark blue and yellow are mapped to −1 and +1, respectively, and green indicates a value of 0. Receptive fields are ranked according to two criteria.First, spread, and conversely localization, as measured by the p-norm of the receptive field, for p = 0.1.Second, by activation level, as measured by the mean activation of each receptive field's corresponding hidden unit averaged across the training dataset.
FIG. 5. Distribution of free energy estimates of TAP solutions as a function of training epochs for the three different datasets.In the case of the two MNIST experiments, the number of hidden units is the same N h = 100 and 10, 000 samples drawn from the training data are used as initial conditions.For CBCL, 2, 400 training samples are used.Top Row: TAP Free energy for all unique TAP solutions (transparent blue dots), and the Helmholtz free energy estimate via uniform averaging (red line).The number of unique TAP solutions is also given (green line).Bottom Row: Detail of TAP free energy distributions for slices of training.Histograms are given as bars, while kernel density estimates of the TAP free energy distribution are given as curves.

FIG. 6 .FIG. 7 .
FIG. 6. Isomap visualization of TAP solutions for binary-MNIST over training epochs for N h = 100.All TAP solutions are mapped to the same two-dimensional embedding via an Isomap transform fitted to the TAP solutions of Epoch 100.The embedding is performed on both the hidden and visible inferred expectations, a v and a h .The color mapping corresponds to the TAP free energy values of each solution, with the range of colors normalized between the minimum and maximum free energies of the solution at each training epoch.

FIG. 8 .
FIG.8.Average denoising performance for reconstruction from bit-flip errors on binary-MNIST over the probability of a bit to be flipped.Denoising via inference on the binary-binary RBM is denoised by the varying numbers of hidden units (N h = {25, 50, 100, 500}).Also shown as baseline comparisons are the OPE given the empirical factorized magnetizations at each site (dashed black ), and a 1-NN matching from the training set (solid black ).All experiments were run over the same 1,000 data samples drawn from the held-out test set and compared using the MCC.Binary estimates are obtained for the OPE and TAP inferred estimates by rounding the resulting magnetizations.
FIG.9.Subjective comparison of denoising estimates for a single digit image for p = [0, 0.5].For the OPE and RBM approaches, the inferred posterior averages xi are shown, rather than the final configurations, where white and black represent the values 0 and 1, respectively.At each tested value of p, the same noise realization is used for each method.As p increases, for N h = {25, 50, 100}, the TAP inference for the RBM provides estimates which still possess digit structure.In the case N h = 500, the TAP inference gets caught in spurious and undesirable minima as p increases.
Appendix B: r-BP for Pairwise Models

i
ln P i (x i ; θ v i )

āFIG. 10 .
FIG.10.Left: Evolution of the pseudo-likelihood along the training of an RBM with 784 binary visible units and 500 binary hidden units.Training was performed using the 5000 first images of binarized MNIST, with a learning rate of 0.001 and batches of size 100.The different curves correspond to different strategies of estimation of the likelihood gradients, either with TAP or adaTAP.Both algorithms were iterated for a fixed number of times (3, 10 and 100).In all cases, a damping of 0.5 was used.All methods yield comparable results in terms of training performance, except for adaTAP with only 3 iterations, which shows poorer performance.Right: Computation time for one iteration of the inference algorithm, as a function of batch size.Time is reported in seconds for identical experimental settings.The need for a matrix inversion for each batch element makes VAMP 3 orders of magnitude slower than TAP.
FIG. 11.Training performances over 3000 training epochs for a 2-hidden layer (left) and 3-hidden layer (right) deep Boltzmann machines (DBMs) on the binarized MNIST datasets.Both models were pretrained for 50 epochs with the same learning rate of 0.001.The training performance is measured as the normalized (per unit) TAP log-likelihood for the test images (blue) and the train images (orange).
can produce estimates FIG. 1. Factor graph representation of the RBM distribution.Variables are indicated by circles, with latent variables denoted by shaded circles.The shaded rectangles indicated layer partitions within the RBM structure.Factors are represented by squares, with the right hand side factors representing the pairwise relationships between variables and the left hand side factors representing the influence of the localized prior distributions on the variables.