Scalar field Restricted Boltzmann Machine as an ultraviolet regulator

Restricted Boltzmann Machines (RBMs) are well-known tools used in Machine Learning to learn probability distribution functions from data. We analyse RBMs with scalar fields on the nodes from the perspective of lattice field theory. Starting with the simplest case of Gaussian fields, we show that the RBM acts as an ultraviolet regulator, with the cutoff determined by either the number of hidden nodes or a model mass parameter. We verify these ideas in the scalar field case, where the target distribution is known, and explore implications for cases where it is not known using the MNIST data set. We also demonstrate that infrared modes are learnt quickest.


I. INTRODUCTION
In recent years machine learning (ML) has gained tremendous popularity in the physical sciences [1].In theoretical nuclear and high-energy physics, ML is applied to a wide range of problems, see e.g. the reviews [2,3].In lattice field theory (LFT), there are applications to all aspects of LFT computations [4], with the development of normalising flows to generate field configurations a particularly active area of research [5,6].From a theoretical perspective, it is of interest to explore synergies between ML on the one hand and statistical physics and LFT on the other hand, as many ML problems can be studied using the tools of the latter, see e.g.Ref. [7].The connection between neural networks, Markov random fields and (Euclidean) lattice field theory have indeed not gone unnoticed, leading to the notions of quantum field-theoretic machine learning (QFT/ML) [8] and neural network/QFT correspondence [9,10].Further exploration of this connection may be fruitful in both directions, providing potential insights relevant to both the ML and the LFT/QFT communities.
In this paper, we take a step in this direction by considering one of the simplest generative ML models, the restricted Boltzmann machine (RBM) [11,12].We analyse the RBM with continuous fields as degrees of freedom from the perspective of a Euclidean LFT and give a complete understanding in the case of Gaussian fields.We verify our analytical insights using simple scalar field theories in one and two dimensions, for which the target distribution is known, and also the MNIST dataset, to demonstrate that our findings are indeed relevant for typical ML datasets without known target distributions.We are in particular interested in the choice of "architecture," which admittedly is quite straightforward for an RBM, namely the number of hidden nodes as well as the choice of certain hyperparameters.Our main conclusion is that the scalar field RBM acts as an ultraviolet regulator, with the cutoff determined by either the number of hidden nodes or a model mass parameter.We will make clear what this implies for the MNIST dataset, but note here already that in QFT language the MNIST dataset is ultraviolet divergent and infrared safe.
The paper is organised as follows.In Sec.II we introduce scalar field RBMs from the perspective of LFT and give some exact solutions for the Gaussian case.The standard equations to train an RBM are summarised in Sec.III.In Sec.IV we analyse these equations analytically and work out some simple examples in detail.The findings of this section will be further explored in the two following sections.First, we consider as target theories free scalar fields in one and two dimensions in Sec.V, for which the target distribution is known.In Sec.VI we validate our findings for a dataset with an unknown distribution, namely the MNIST dataset.Options to add interactions are discussed in Sec.VII.A summary is given in the final section.Appendix A contains some more details on the algorithm employed, while in Appendix B the Kullback-Leibler divergence is evaluated in the Gaussian case.

II. SCALAR FIELDS ON A BIPARTITE GRAPH
RBMs are defined on a bipartite graph, consisting of one visible layer (with N v nodes) and one hidden layer (with N h nodes), see Fig. 1.Importantly, there are no connections within each layer, only between the two layers.The degrees of freedom living on the nodes can be discrete, as in an Ising model, continuous or mixed; Ref. [13] is a useful review.
In this section, we consider an RBM from the viewpoint of lattice field theory.We consider continuous fields and denote these as ϕ i (i = 1, . . ., N v ) for the visible layer and h a (a = 1, . . ., N h ) for the hidden layer.The layers are coupled via bilinear terms and involve the N v × N h weight matrix W , as The aim is to describe a probability distribution p(ϕ) on the visible layer, constructed by integrating over the hidden nodes in the joint probability distribution p(ϕ, h), as follows, where we have denoted the "energy" in the exponential as an action (following LFT notation) and the partition function reads Z = DϕDh exp(−S(ϕ, h)).
The integrals are over all nodes, Due to the absence of intralayer connections, the action takes a simple form in general, where the two potentials can be any function (as long as the integrals are well defined) and be node dependent, i.e., Since there is no coupling between nodes within a layer, there is no "kinetic" or nearest-neighbour term; these are only generated via the coupling to the other layer.
To proceed, a natural starting point is to consider quadratic potentials, i.e. free fields (we discuss interac-tions in Sec.VII).We hence consider as action, A few comments are in order.We have denoted the prefactor as a mass term (µ 2 ) in the case of ϕ and as a variance (1/σ 2 h ) in the case of h; this is inessential, but emphasises that the model on the visible layer is ultimately the one we are interested in.Both µ 2 and σ 2 h are independent of the node; this is sufficient, as node dependence can be introduced via the weight matrix W , as we will see shortly.Finally, a source (or bias) η a is introduced in the hidden layer but not in the visible layer; again this is sufficient, as a nonzero bias breaks both symmetries, h → −h, ϕ → −ϕ.
Integrating out the hidden nodes then leads to the following distribution on the visible layer, with and where Z now reads We note therefore that the distribution on the visible layer resembles a generating function for a scalar field theory, with the possibility of all-to-all bilinear interactions between the fields via the nonlocal kernel K, and the bias resulting in a source term J coupled to ϕ.The connected two-point function or propagator is given by The hidden layer has provided auxiliary degrees of freedom to establish correlations between the visible nodes.
To continue the discussion we now assume the target probability distribution p target (ϕ) is known and Gaussian, such that we can solve the RBM explicitly, i.e. we give explicit expressions for the weight matrix W and the bias η.We denote the target kernel as K ϕ and consider the symmetric case (ϕ → −ϕ, η = J = 0) for simplicity.Since K ϕ is a real and symmetric matrix, it can be diagonalised; for the theory to exist, all its eigenvalues are assumed to be semipositive.The RBM is then solved by equating the two kernels, K ϕ = K, i.e., which implies Since W W T is semipositive, we find conditions on the parameter µ 2 , namely Consider now the case that N h = N v .It is then easy to find some solutions for W , given that the rhs of Eq. ( 13) is symmetric and positive: • The rhs of Eq. ( 13) can be decomposed in a Cholesky decomposition, K = LL T , where L is a lower triangular matrix with real and positive diagonal entries.The solution is then simply W = L.
The triangular structure means that hidden node a connects to visible nodes with a ≤ i only.
• The rhs of Eq. ( 13) can be diagonalised via an orthogonal transformation, yielding the symmetric solution Hence we have found two explicit solutions.Additional solutions are found from either of the above by a right multiplication of W by an orthogonal transformation, rotating the hidden nodes, since O R drops out of the combination W W T .We conclude therefore that an infinite number of solutions is present.These can be constrained by imposing further conditions on W , as in the first two cases above.We will discuss this degeneracy further below.
Next, we may consider the case where N h < N v .From Eq. ( 13) it is clear that the accuracy of reproducing the target distribution depends on the ranks of the matrices involved.We find Only when the ranks are equal will the target distribution be reproducible; this is particularly relevant when choosing N h ≪ N v .Below we will consider in detail what happens of either of the two conditions found so far, i.e.Eq. ( 14) and rank W W T = rank (K) is not valid.

III. TRAINING RBM PARAMETERS
The exact solutions above are only useful when the target model is a known Gaussian model and N h = N v .In general, the target distribution is not known and one has to learn from a finite dataset.The training of the model is then done by maximising the log-likelihood function L(θ|ϕ).The learnable parameters are collectively indicated as θ = {W, η, µ 2 }.Note that we will consider the case of unbroken symmetry and hence the bias is taken to be zero throughout, η a = 0. We are hence concerned with determining the weight matrix W and the mass parameter µ 2 .
The model distribution is given by Eq. ( 8), with J = 0. Given data consisting of N conf configurations, labelled as ϕ (d) , d = 1, . . ., N conf , the log-likelihood function of the model is written as This log-likelihood function can be optimised with gradient ascent algorithms, where the gradient is taken with respect to the coupling matrix W and the mass parameter µ 2 .Explicitly, where the two-point correlation matrices for the data (i.e. the target) and the model are given, respectively, by Similarly, for µ 2 one finds When all the data is available, one is able to evaluate the two-point function by summing over configurations before training the RBM.This would yield the target twopoint function, computed via the data.In the numerical implementations below, we will analyse the properties of this two-point function further, since the matrix sizes are such that this is feasible.Alternatively, we may consider the case where the target distribution p target (ϕ) is known and the correlation matrix C target ij of the target theory is obtainable.In that case, there is no need to use data but one can use the correlation function directly.It should be noted that in general the correlation matrix C target ij is not directly accessible due to computational complexity, even if the analytical form of the target distribution is known.
If the target distribution is known, then the same equations can also be derived by extremising the Kullback-Leibler (KL) divergence, keeping in mind that only the model distribution depends on the learnable parameters θ.With the distribution given by Eq. ( 8) and the θ dependence contained in the kernel K only (recall that J = 0), extremising with respect to θ then yields which yields the same equations for W and µ 2 as above, but with the opposite sign, as the KL divergence is minimised.
In actual applications, the gradients are used in a discretised update of the form where η n is the, possibly time-dependent, learning rate.Details of the commonly used persistent contrastive divergence algorithm and time-dependent learning rate can be found in Appendix A.

IV. SEMIANALYTICAL SOLUTION A. Singular value decomposition
Before solving the RBM numerically, we aim to gain analytical insight in the update equations using a singular decomposition for the weight matrix in the continuous time limit [13].
The update equations for the weight matrix W and the mass term µ 2 , in the continuous time limit, read [see with the two-point functions (or propagators) Recall that ⟨ϕ i ⟩ = 0.The dot denotes the time derivative.We remind the reader that both K and K ϕ are symmetric N v × N v matrices and that the weight matrix where σ 2 h is the variance of the hidden nodes.We use the singular value decomposition to write W as where U is an orthogonal N v × N v matrix, V is an orthogonal N h × N h matrix, and Ξ is the rectangular N v × N h matrix with the (ordered) singular values ξ a (a = 1, . . ., N h ) on the diagonal.The RBM kernel then takes the form with the diagonal matrix Note that the existence of the model requires that µ 2 > σ 2 h ξ 2 1 , with ξ 1 the largest singular value of W . Equation (33) demonstrates that only the first N h eigenvalues can potentially be learnt, with the remaining N v − N h eigenvalues frozen at the higher scale µ 2 .
The symmetric target kernel K ϕ and the corresponding two-point function K −1 ϕ can be diagonalised via an orthogonal transformation as where the eigenvalues of K ϕ are assumed to be positive again.
The rhs of Eq. ( 26) can now be written as The term within the brackets will be encountered frequently below and hence we honour it with a new symbol, The evolution equation for W can then be compactly written as We note that Λ drives the evolution in the learning process: it vanishes when the basis on the visible layer is aligned with the basis for the data (U → O ϕ ) and the eigenvalues, or widths of the Gaussians, are correctly determined (D K → D ϕ ).One may note that Λ does not depend on V , which acts on the hidden nodes, resulting in the degeneracy discussed in Sec.II: any rotation of the hidden nodes leaves the solution on the visible layer invariant and the learning stops when Λ → 0, irrespective of what V is.

B. Learning dynamics
Having defined the needed quantities, we are now in a position to determine the learning dynamics of W in detail, i.e. the evolution of U, V , and the singular values Ξ.We consider separately Taking the derivative of the first product gives (39) On the other hand, Eq. (37) gives Conjugating both equations with U T and U then yields is a symmetric matrix with zeroes on the diagonal.Equation (41) then decomposes into one equation for the diagonal elements, determining the singular values, and one for the off-diagonal ones, determining U , namely Repeating the same analysis for W T W gives nearly identical equations in the N h × N h subspace, namely The equations for ΞΞ T and Ξ T Ξ yield identical equations for the N h singular values.
The equation for µ 2 finally reads, in this notation, Keeping µ 2 fixed, it is easy to see that σ 2 h can be absorbed in the time parameter ( t = tσ 2 h ) and the singular values, see Eq. ( 33); hence it does not add any freedom to the model.When µ 2 is learnt as well, its time evolution will depend on σ 2 h , after rescaling time as t → t.As noted, V does not appear in the driving term Λ. Hence V simply follows the evolution, until Λ − Λ d → 0, see Eq. (46).For square matrices, N h = N v , this redundancy can be removed by choosing W to be symmetric (V = U ) or by enforcing W to be of the lower (or upper) triangular form (Cholesky decomposition of W W T ), see Sec.II.

C. Simple examples
The simple example of two visible and two hidden nodes can be worked out in detail.We will note a number of characteristics which remain relevant also for larger systems.
First we note that U, V, and O ϕ are all 2 × 2 rotation matrices; we denote the angles as θ U , θ V , and θ 0 , respectively.Then one notes that and the same for V T V , with θV .Finally, the combination O T ϕ U is a rotation over an angle ∆θ = θ U − θ 0 .We denote the two eigenvalues of the target kernel K ϕ with κ 1,2 and of the RBM kernel . This yields the driving term, Putting everything together then gives the following equations and where These equations have several fixed points.The difference of angles is given by ∆θ = 0, π/2.Which of these is selected depends on which eigenvalue κ 1,2 is smaller.Note that the SVD decomposition orders the singular values, ξ 1 > ξ 2 .The equations have fixed points at σ 2 h ξ 2 1,2 = µ 2 − κ 1,2 and at ξ 2 1,2 = 0. We consider first the case of fixed µ 2 .The actual realisation depends on the ordering of κ 1,2 and µ 2 .We find (The fixed points at ξ 2 1,2 = 0 are unstable.)This is illustrated in Fig. 2 (top row).In this case, both eigenvalues are learnt correctly.If µ 2 is smaller than an eigenvalue, then it cannot be reproduced and is replaced by µ 2 , see Fig. 2 (middle row).In this case, only the smallest eigenvalue is learnt, while the other one evolves to µ 2 (see also Eq. ( 33)).
In case µ 2 is smaller than all eigenvalues, µ 2 < κ 1,2 , the eigenmodes cannot be reproduced and are replaced by µ 2 , with ξ 1 = ξ 2 = 0. Finally, we remark again that θ V simply evolves until ρ → 0, but it does not influence the learning of the other parameters.
The actual eigenvalues may not be known, and one may choose µ 2 to be too low, as in the second example above.This can be evaded by learning µ 2 itself, using Eq.(57).The system is now overparameterised, with ξ 1,2 and µ 2 being learnt to reproduce κ 1,2 .In this case one finds that the eigenvalues are reproduced, irrespective of the initial value of µ 2 , see Fig. 2 (bottom row).Note that one of the singular values decreases since µ 2 itself increases towards the largest eigenvalue.

Approach to the fixed point
To understand the evolution towards the fixed point, a simple linearisation suffices.We consider the case of fixed µ 2 .Taking concretely case (59) above, we expand about the fixed point and write Expanding Eqs. ( 53), (54), and (55) in x i and y and absorbing σ 2 h in the time parameter (denoting the derivative with respect to t = σ 2 h t with a prime) then yields the linearised equations We hence find exponential convergence, controlled by the relaxation rates The angle ∆θ( t) relaxes with γ ∆θ , whereas the singular values ξ i ( t) decay with min(γ i , γ ∆θ ).Interestingly, the relaxation rates are set by the difference between the RBM mass parameter µ 2 and the eigenvalues κ i in the spectrum.Irrespective of the actual values of µ 2 and the eigenvalues κ i , the mode corresponding to the higher eigenvalue relaxes the slowest.We hence conclude the following: • infrared modes, i.e. those corresponding to the smallest eigenvalues will converge faster, this can indeed be observed in Fig. 2 (top row); • increasing the value of µ 2 will lead to more rapid convergence for all modes.This will be explored below in more realistic cases.3. Nv = 2, Nh = 1 The case of one hidden node serves to demonstrate what happens when N h < N v .It is particularly simple as V is replaced by v = 1 and we only need to consider one angle and one singular value, determined by the following equations and where The equation for the angle is now decoupled and can be solved, as such that Using this in Eq. (68) confirms that the smallest eigenvalue of K ϕ is reproduced (for constant µ 2 ).If µ 2 is learnt as well, then Eq. (70) ensures it becomes equal to the largest of the two eigenvalues.
To summarise, we note the following observations: if either the number of hidden nodes or the mass parameter µ 2 is chosen too small, the infrared part of the spectrum (lowest eigenvalue) is reproduced, while the ultraviolet part (highest eigenvalue) evolves to µ 2 ; making µ 2 a learnable parameter yields one more degree of freedom to correctly reproduce the next eigenvalue; infrared modes are learnt quicker than ultraviolet modes.These observations for the simple case considered here remain relevant for more interesting systems, as we will demonstrate now.

V. LEARNING GAUSSIAN DISTRIBUTIONS
We continue with the case for which the target distribution is known and Gaussian, namely free scalar fields discretised on a lattice in one and two dimensions.The continuum action reads, in n Euclidean dimensions, The simplest lattice-discretised equivalent is, on a onedimensional lattice with N v nodes and with periodic boundary conditions, where We use "lattice units", a = 1, throughout.The spectrum of the target kernel K ϕ is easy to compute analytically and reads Each eigenvalue is doubly degenerate, except the minimum (k = 0, κ min = m 2 ) and the maximal (k = N v /2, κ max = m 2 + 4) ones.Referring back to Sec.II, the exact spectrum can only be learnt when N h = N v and when the RBM mass parameter Since the target theory is known, we can train the model directly from the correlation matrix of the target theory without the need for pregenerated training data.Then each term of the gradient is estimated by persistent contrastive divergence (PCD) to train the RBM, see Appendix A for details.The scalar field mass parameter is chosen as m 2 = 4 and the variance on the hidden layer equals σ 2 h = 1 throughout.

A. Initialisation with an exact solution
We start with the case of a constant RBM mass parameter µ 2 = 9 > κ max = 8, with N v = N h = 10.To test the numerical code, we may initialise the weight matrix W according to one of the exact solutions found in Sec.II: the Cholesky (lower triangular) solution and the symmetric solution.The results are shown in Figs. 3 and 4.
Here and throughout we denote the exact eigenvalues of the target distribution with κ α (α = 1, . . ., N v ) and the eigenvalues of the model kernel K with λ α = µ 2 − σ 2 h ξ 2 α .We will refer to these as the RBM eigenvalues.The latter depends on the training stage, indicated by epochs, see Appendix A. As can be seen in Figs. 3 and 4 (left), the RBM eigenvalues are correctly initialised for both choices and fluctuate around the correct values during training.
To indicate the size of the fluctuations, we do the following.In the Cholesky case, we consider separately the L 2 norm of the lower triangular elements, of the upper triangular elements (which are initialised at zero) and of the elements on the diagonal.We then standard normalise these to compare the amplitudes of the fluctuations, see Fig. 3 (right).We observe that the sum of each part fluctuates around the average value during training, whose size is set by the initial value, demonstrating the stability of the PCD updates.For the symmetric initialisation, we show the L 2 norms of the symmetric and asymmetric parts, W sym = (W + W T )/2, W asym = (W − W T )/2.Since the initial coupling matrix W is symmetric, we expect the norm of the asymmetric part to remain significantly smaller during training.This can indeed be seen in Fig. 4 (right), where we show the evolution after standard normalisation.The norm of the symmetric part of the coupling matrix is six orders of magnitude larger than that of the asymmetric part.As with the Cholesky initialisation, we observe that the overall structure of the coupling matrix is approximately preserved.Note there is no reason for it to be exactly preserved, as this is neither imposed nor required.

B. Initialisation with a random coupling matrix
In practical applications, the coupling matrix W is not initialised at an exact solution, but with random entries, drawn e.g. from a Gaussian distribution.In Fig. 5 we show the results obtained with elements of W sampled from a normal distribution N (0, 0.1).Other parameters are as above within particular N h = N v and µ 2 > κ max ; hence there are no obstructions to learning the target distribution exactly.This can indeed be seen in Fig. 5, where both the eigenvalues (left) and the action density (right) are seen to match.For the latter, configurations are generated using the trained RBM; the same number of Monte Carlo (Metropolis) generated configurations are shown, using binning to remove autocorrelations.The analytical result follows from the equipartition.It is noted that possible instabilities, due to λ α turning negative either initially or during the learning stage, are not encountered with this initialization.If they are encountered, then they can be avoided by tuning the width of the initial coupling matrix and learning rate.
Since the elements of W are initially relatively small, the corresponding singular values ξ α are small as well and the RBM eigenvalues λ α = µ 2 − σ 2 h ξ 2 α are close to µ 2 initially.They quickly evolve to the target values κ α .The order in which the modes are learnt (or thermalised) can be understood easily.Referring back to Sec.IV, we consider Eq. ( 43) for the singular values and Eq.(36) for the driving term.Assuming we are on the correct eigenbasis, the latter reduces to where λ α = µ 2 − σ 2 h ξ 2 α .Equation (43) then becomes [13] Note that this equation was encountered before (in a general basis) for 53), (54).During the initial stages, the term within the brackets is negative and largest for the smallest eigenvalues.Hence the corresponding singular values evolve quickest.At late times, one may linearise around the fixed point.In Sec.IV we demonstrated for N v = 2 nodes that the convergence in the linearised regime is exponentially fast and that the rate of convergence is set by γ α = (µ 2 − κ α )/κ 2 α .Hence the most infrared modes equilibrate fastest and the ultraviolet modes slowest.These aspects are demonstrated in Fig. 6, where we have shown the evolution of both the singular values (left) and the eigenvalues (right) during the initial stages of the training (the largest singular values correspond to the smallest eigenvalues).We note the similarity with the case of N v = 2 modes in Sec.IV, see in particular Fig. 2 (top row).So far we have kept the RBM mass parameter µ 2 fixed.However, it can also be treated as a learnable parameter using Eq. ( 22).This is particularly useful if details of the target spectrum are not known.It provides then an additional degree of freedom.In Fig. 7, the initial RBM mass parameter is initialised below κ max .It subsequently increases to match the largest eigenvalue, see Fig. 7 (left).Since the system is over-parametrised, one of the singular values remains at the initial value, see Fig. 7 (right).Note the different timescale for equilibration compared to the case with a constant µ 2 , as it takes time for µ 2 to find the correct value.
Up to now, we considered a scalar field in one dimension only.The generalisation to higher dimensions is interesting since the RBM does not know about the dimensionality a priori, with the N v visible nodes only connecting to the hidden nodes.We consider here two dimensions, using an N x × N y lattice.The eigenvalues of the target kernel are with −N x,y /2 < k x,y ≤ N x,y /2.In this case, there is a larger degeneracy of eigenvalues.The RBM has N v = N x × N y visible nodes.The dimensionality has to be learnt and encoded in the coupling matrix W .The (target) kernel and two-point functions are (N x × N y ) × (N x × N y )-dimensional tensors.This two-dimensional structure can be flattened in a one-dimensional representation, where the kinetic term is decomposed into a tensor product of two one-dimensional Laplacian operators, where in the last expression ⊗ is the Kronecker product and the sizes of the identity matrices are given explicitly.Figure 9 shows an example of a flattened scalar field kernel for the two-dimensional case with N x = N y = 4. Importantly, the spectrum of the flattened kernel and the original kernel are identical, since the boundary conditions are encoded correctly.The tensor product decomposition (83) allows one to see this explicitly.
In Fig. 8 (left), we show the evolution of the RBM eigenvalues.The RBM mass parameter is µ 2 = 16 > κ max = 12.There should be four degenerate eigenvalues at 6 and 10, and six degenerate ones at 8. Yet it appears the eigenvalues only lie within a band close to the expected value.This is due to the fact that to obtain these results we have used a fixed learning rate (time step), which prevents the system from reaching high precision.This can be remedied by introducing an epoch dependent learning rate.This is explored in Appendix A. We multiply the learning rate by a factor close to one, r = 0.99, after a given number of epochs, N rate epoch = 128.The virtue of having a diminished learning rate in the later stages is that it allows the model to be finely trained, with less statistical fluctuations.The result is shown in Fig. 8 (right), where we indeed observe precise agreement with the target spectrum.

C. Ultraviolet regularization by the RBM mass parameter
Up to now, we have considered the ideal "architecture," namely N h = N v and µ 2 > κ max , for which Gaussian distributions can be learnt exactly, as we have demonstrated.In practice, one often chooses N h < N v and the maximum eigenvalue may not be known.Here we determine what this implies.
We start with the case where N h = N v , but with µ 2 fixed and less than κ max .We refer to Eq. (81) for the evolution of the singular values in the eigenbasis.Take µ 2 < κ α .In this case, the term inside the brackets is negative and the only solution is ξ α = 0.The corresponding eigenvalue is then λ α = µ 2 .When µ 2 > κ α , the solution is given by the fixed point, σ 2 h ξ 2 α = µ 2 − κ α , and λ α = µ 2 − σ 2 h ξ 2 α .We hence conclude that the infrared part of the spectrum, with κ α < µ 2 , can be learnt, whereas the ultraviolet part, with κ α > µ 2 , cannot be learnt.Instead, the RBM eigenvalues take the value of the cutoff, µ 2 [14].This is demonstrated in Fig. 10 for a one-dimensional scalar field theory with N v = N h = 10 nodes.As the condition for exact training is violated, the RBM model can no longer faithfully reproduce the target data and distribution.The impact of this depends on the importance of the ultraviolet modes, as we will see below for the MNIST dataset.

D. Ultraviolet regularisation by the number of hidden nodes
Next, we consider the case with N h < N v .This is straightforward, as there are only N h singular values, leading to the RBM eigenvalues see e.g.Eq. (33).Again we note that the infrared part of the spectrum can be reproduced, whereas the ultraviolet part is fixed at µ 2 , irrespective of the actual value of the target eigenvalue.This is shown in Fig. 11 for the one-dimensional case with N v = 10 and N h = 9, 8, 7, 6.Note that all eigenvalues, except the minimal and maximal ones, are doubly degenerate.Hence in the case of N h = 8 and 6, one of the degenerate eigenvalues remains and one is removed.
Finally, in Fig. 12 we give two examples in the twodimensional scalar theory, using µ 2 = 9 < κ max on the left and N h = 8 < N v = 16 on the right.

VI. MNIST DATASET
It is important to ask whether the considerations above are also relevant for realistic datasets commonly used in ML.We consider the MNIST dataset [15], consisting of 60,000 28 × 28 images of digits.Hence N v = 784, substantially larger than what we have considered so far.Unlike in the case of scalar fields, the probability distribution function is not known.However, we may still obtain the correlation matrix ⟨ϕ i ϕ j ⟩ by summing over the 60,000 realisations.The MNIST (target) kernel is then given by its inverse, The eigenvalues of the correlation matrix are the inverse of the eigenvalues of the kernels discussed so far and we hence denote them as 1/κ α .The 784 eigenvalues are shown in Fig. the language of the previous sections, these correspond to the ultraviolet part of the spectrum of the quadratic kernel and hence the MNIST dataset can be said to be ultraviolet divergent.The values of the ten largest eigenvalues of the correlation matrix are listed explicitly on the right.These correspond to the infrared part of the spectrum of the quadratic kernel.Since these are finite, the MNIST dataset is infrared safe.This terminology reflects the ordering of the eigenvalues κ α encoding the quadratic correlations in the MNIST data, from small (infrared) to large (ultraviolet).As in the two-dimensional scalar case, the eigenvalues do not depend on the flattening of the indices.
We will now train the scalar field RBM on the MNIST dataset, starting with N h = N v and a fixed RBM mass parameter µ 2 = 100.The result is shown in Fig. 14 (left).As before, the horizontal dashed black lines are the target eigenvalues, obtained from the MNIST correlation matrix.The blue lines are the RBM eigenvalues.The initial values are close to µ 2 and hence they become smaller during the learning stage.As above the infrared part of the spectrum is learnt quickest.This is further illustrated in Fig. 14 (right), where the evolution during the final 60,000 epochs are shown (out of one million).The smallest eigenvalues agree with the target values but the larger ones have essentially stopped learning before reaching the correct value, due to the reduced learning rate.We note that the RBM mass parameter µ 2 = 100 regulates the number of modes here.In fact, there are 289 modes below the cutoff set by µ 2 .Hence the number of hidden modes, N h = 784, can be reduced without a loss of quality.We come back to this below.
Without knowledge of the target spectrum, the (constant) value for µ 2 may be chosen to be on the small side; as is obvious in Fig. 14 (left), there are many eigenvalues above µ 2 = 100.This can be remedied by promoting µ 2 to a learnable parameter.This is demonstrated in Fig. 15, where µ 2 increases such that the target spectrum can be better learnt.The learning dynamics employs a diminishing learning rate, see Appendix A, which slows down the increase of µ 2 but also hinders the learning of the spectrum beyond the infrared modes.As stated, larger eigenvalues give smaller contributions to the update equations, leading a slower learning.
As in the scalar field case, we can regularise the MNIST data by choosing N h < N v .In this case, the number of modes that can be learnt depends on the number of modes in the spectrum with an eigenvalue less than µ 2 , and the number of hidden nodes.Figure 16 shows the quality of regenerated images after one passes forward and backward through the trained RBM.Using the fixed RBM mass parameter µ 2 = 100 limits the maximal number of modes to be included to N max modes = 289, the number of modes with an eigenvalue less than µ 2 .We observe that one needs at least around 64 hidden nodes to have an acceptable generation, which is considerably smaller than the maximal possible number.This illustrates that the ultraviolet part of the spectrum can be ignored.
To give a more quantitative measure of the quality of regeneration, we have computed the data-averaged KL divergence for the trained model,

KL(p data ||p
where the constant "cst" term is independent of the model distribution.The result is shown in Fig. 17.We indeed observe the KL divergence between the target distribution and the model distribution starts to increase as more modes are excluded.Adding modes beyond the cutoff imposed by the choice of µ 2 does not increase the quality, as expected.As concluded "by eye" above, around 64-100 hidden nodes are required for a reasonable quality of regeneration.

VII. INTERACTIONS
Strictly speaking, the Gaussian-Gaussian RBM can only be exact if the target distribution is Gaussian as well.To go beyond this, one needs to introduce interactions.There are (at least) two ways of doing so.Motivated by the notion of QFT-ML [8], one may add local interactions via potentials defined on nodes, see Eqs. ( 5), (6).A simple choice would be to add a λϕ 4 term to each node on the visible layer since such systems are well understood and allow one to study e.g.spontaneous sym- FIG. 17. Data-averaged KL divergence for the trained RBM, as a function of the number of hidden nodes N h (on a logarithmic scale) at fixed µ 2 = 100.For this value of µ 2 , the maximal number of modes included is N max modes = 289 and hence increasing N h above this value does not lead to additional improvement.
metry breaking in the context of the learning process.Of course, sampling the hidden layer requires a more costly sampling method than for the Gaussian case.
One may also change the nature of the hidden nodes from continuous to discrete, taking e.g.h a = ±1.This leads to the Gaussian-Bernoulli RBM (see e.g.Ref. [13]), with the distribution and action where This gives the following induced distribution on the visible layer, with the effective action where ψ a = i ϕ i W ia − η a .A formal expansion in ψ a then yields, up to a constant, the effective action on the visible layer, with easily determined coefficients c n .This is a highly nonlocal action.
To make the connection with the previous sections, it is straightforward to see that the quadratic (n = 1, c 1 = 1) term yields the same structure as above, which, when combined with the RBM mass parameter µ 2 on the visible layer, gives the same kernel, K = µ 2 1 1 − W W T , and source, J = W η. Quartic interactions are generated at the next order.Taking for simplicity η a = 0, such that only even terms in ϕ i are present, one finds the n = 2 term (c 2 = 2), with the coupling This is indeed a quartic term but with an a priori highly nonlocal coupling, generated by the all-to-all coupling to the hidden layer.From a QFT perspective, it would be of interest to study such a theory, which we postpone to the future.

VIII. CONCLUSION
In this paper, we have studied scalar field restricted Boltzmann machines from the perspective of lattice field theory.The Gaussian-Gaussian case can be understood completely.We have demonstrated, using analytical work and numerical experiments, that the scalar field RBM is an ultraviolet regulator, regulating the ultraviolet part of the spectrum of the quadratic operator of the target theory.This is also the case when the target probability distribution is not known, such as in the MNIST case, but where the spectrum can be extracted from the data-averaged correlation matrix.The cutoff is determined by the choice of the RBM mass parameter or the number of hidden nodes.This provides a clear answer to generally difficult questions on the choice of ML "architecture", namely what are the consequences of choosing a particular setup.At least in this simple case the answer is straightforward and concerns the (in)sensitivity of the generative power of the RBM to the ultraviolet modes compared to the infrared modes.
We have also shown that infrared modes are learnt the quickest.This is of interest for models which suffer from critical slowing down, for which infrared modes are usually hard to handle.Indeed, many ML (inspired) generative approaches have surprisingly short autocorrelation times, which is worth exploring further.
As an outlook, we note that in the final section, we have indicated two ways to go beyond the Gaussian-Gaussian case.The QFT-ML approach, in which local potentials are added to nodes on e.g. the visible layer, is convenient for LFT practitioners since the resulting models are well understood.Replacing the continuous hidden degrees of freedom with binary ones (Gaussian-Bernoulli RBM) yields models of a very different character, involving highly nonlocal interaction terms to all orders.It would be of interest to understand these constructions further using QFT methods.
The data generated for this manuscript and simulation code are available from Ref. [16].However, the values of r and N rate epoch should be chosen in a delicate manner.If the decay rate r is too large, then the learning rate decreases too fast and the model freezes before it reaches the target destination.For example, in Fig. 19, the training flow of the scalar field RBM with the trainable mass parameter and r = 0.99, N rate epoch = 128 is shown (compare with Fig. 7).The model does not suffer when it is learning infrared modes, which are learnt quickest, but it fails to learn the highest mode of the target kernel.The model parameter freezes out before it reaches the target.One can decide the learning rate decay parameters by observing the regenerated samples and measuring the goodness of those.Since the ultraviolet modes are less relevant compared to the infrared ones, one can accept a truncation of the higher modes provided a target goodness is achieved.One can also employ an adaptive learning rate decay.
We have also looked at employing momentum optimisation and L 2 regularisation of the coupling matrix but have found no need for these.

Appendix B: Kullback-Leibler divergence
For completeness, we evaluate here the KL divergence in the case that both the target theory and the model are Gaussian, without linear terms.This allows us to compare it with the log-likelihood in the main text.We consider the KL divergence, KL(p||q) = Dϕ p(ϕ) log p(ϕ) q(ϕ, θ * ) , (B1) with p(ϕ) the target distribution and q(ϕ, θ * ) the trained distribution (hence the asterisk on θ).We assume the learning process has found the correct eigenbasis, such that the distributions are where all eigenvalues a i , b i > 0. To make the connection with the scalar theory with N h < N v in Sec.V, we note that i = 1, . . ., N v , and that after training, It is then straightforward to evaluate the KL divergence.
In particular, log p(ϕ) q(ϕ, θ * ) = − 1 2 i Putting everything together, one finds Each term is non-negative, and KL(p||q) ≥ 0, as it should be.The equality is achieved only when each eigenvalue is correctly determined.For the scalar theory in Sec.V, this becomes )

Standardized ||L|| 2 :FIG. 3 .
FIG. 3. Cholesky initialisation.Left: evolution of RBM eigenvalues λα during training.Note that adjacent eigenvalues are coloured alternatively.Exact eigenvalues κα are shown with horizontal dashed lines and the RBM mass parameter µ 2 with the horizontal full line.After the Cholesky initialisation, the RBM eigenvalues fluctuate around the correct values.Right: the L2 norm of each part of the coupling matrix, diagonal (D), upper (U ) triangular, and lower (L) triangular.Values are standardised, with x (σ) the mean value (standard deviation) along the training interval.Each part fluctuates around its average value

StandardizedFIG. 4 .FIG. 5 .
FIG.4.Left: as in Fig.3, for the symmetric initialisation.Right: standardised L2 norm of symmetric and asymmetric parts of the coupling matrix.The latter remains small during updates.

FIG. 6 .FIG. 7 .
FIG.6.Convergence of the singular values ξα (left) and the eigenvalues λα (right) for the system of Fig.5.Infrared modes are learnt the quickest.

FIG. 8 . 8 FIG. 9 .
FIG. 8. Evolution of the eigenvalues in the two-dimensional case during training, with constant µ 2 = 16 and Nv = N h = 16, using a fixed (left) and a diminishing (right) learning rate.

µ 2 = 4 . 8 FIG. 10 .
FIG.10.Regularisation by RBM mass parameter µ 2 : evolution of the eigenvalues in the one-dimensional scalar field theory.Only the infrared part of the spectrum is reproduced.

N h = 6 FIG. 11 .FIG. 12 .
FIG. 11.As above, but with regularisation by the number of hidden N h .

FIG. 14 .FIG. 15 .
FIG.14.Left: evolution of the eigenvalues for the MNIST dataset with fixed RBM mass parameter µ 2 = 100, and Nv = N h = 784.Right: evolution during the last few training epochs.The lowest eigenvalues have already matched their target values but the higher modes are still being trained, albeit at a very small learning rate.

4 FIG. 16 .
FIG.16.Quality of regenerated images with different numbers of hidden nodes.As the number of hidden nodes decreases, the regeneration quality decreases.
13.Many eigenvalues are close to zero.In Eigenvalues of the correlation matrix ⟨ϕiϕj⟩ of the MNIST dataset.Note that many eigenvalues are close to zero.The values of the ten largest eigenvalues are indicated.