Quantum field-theoretic machine learning

We derive machine learning algorithms from discretized Euclidean field theories, making inference and learning possible within dynamics described by quantum field theory. Specifically, we demonstrate that the $\phi^{4}$ scalar field theory satisfies the Hammersley-Clifford theorem, therefore recasting it as a machine learning algorithm within the mathematically rigorous framework of Markov random fields. We illustrate the concepts by minimizing an asymmetric distance between the probability distribution of the $\phi^{4}$ theory and that of target distributions, by quantifying the overlap of statistical ensembles between probability distributions and through reweighting to complex-valued actions with longer-range interactions. Neural networks architectures are additionally derived from the $\phi^{4}$ theory which can be viewed as generalizations of conventional neural networks and applications are presented. We conclude by discussing how the proposal opens up a new research avenue, that of developing a mathematical and computational framework of machine learning within quantum field theory.


I. INTRODUCTION
Relativistic quantum fields [1] are formulated on Minkowski space where intricate mathematical problems related to the hyperbolic geometry emerge. By recasting Minkowski space as Euclidean significant simplifications can be obtained for certain cases: The hyperbolic problems are transformed to be elliptic, the Poincaré group becomes the Euclidean group where a positive definite scalar product emerges, noncommuting operators are expressed as random variables and causality is formulated as a Markov property.
Of high importance is the reverse direction: that of arriving at a quantum field in Minkowski space by constructing it from one in Euclidean space. To make such prospects attainable a rigorous mathematical framework for quantum fields had to be established, and a series of relevant contributions led to advances known as constructive quantum field theory [2][3][4]. A connection between probability theory and quantum field theory was then established when quantum fields were constructed from Euclidean fields that satisfy Markov properties [5,6].
A notable case of these algorithms is the framework of Markov random fields [39], which introduces Markov properties on a graph-based representation to encode probability distributions over high-dimensional spaces. As quantum field theory and probability theory are evidently connected analytically [6], and computational investigations of quantum fields are feasible through the framework of lattice field theory [40], a new challenge is anticipated to emerge: namely that of investigating machine learning from the perspective of quantum fields.
In this manuscript, we derive machine learning algorithms from discretized Euclidean field theories, making inference and learning possible within dynamics described by quantum field theory. From the mathematical point of view, we explore if the φ 4 scalar field theory on a square lattice satisfies the Hammersley-Clifford theorem, therefore recasting it as a Markov random field which can complete machine learning tasks. From the equivalent perspective of physics, we treat the φ 4 scalar field theory as a system with inhomogeneous coupling constants and we search based on its dynamics, which comprise local interactions, for the optimal values of the coupling constants that are able to complete a machine learning task. Specifically we consider the minimization of an asymmetric distance between the probability distribution of the φ 4 theory and that of target distributions. We also quantify the overlap of statistical ensembles between probability distributions and investigate if reweighting to the parameter space of complex-valued actions with longer-range interactions is possible by utilizing instead the probability distribution of the approximating local inhomogeneous action.
We then proceed to derive neural network architectures from the φ 4 scalar field theory which can progressively extract features of increased abstraction in data. We explore the implications of including a local symmetry-breaking term in the φ 4 Markov random field, and rearrange the lattice topology to derive a φ 4 neural network which can be viewed as a generalization of conventional neural network architectures. Based on the equivalence between the φ 4 scalar field theory and the Ising model under a certain limit, we discuss how the φ 4 neural network can provide novel physical insights to the interpretability of a notable class of machine learning algorithms. Finally, we conclude by discussing how the introduction of φ 4 machine learning algorithms opens up a new research avenue, that of developing, computationally and analytically, a framework of machine learning within quantum field theory.

II. THE φ 4 SCALAR FIELD THEORY AS A MARKOV RANDOM FIELD
Let Λ be a finite set whose points represent the sites of a physical model, and let Λ have an additional structure, for instance consider that the spacing between the sites might be known and that the sites are connected. We now consider that the points of Λ lie on the vertices of a finite graph G = (Λ, e), where e is the set of edges on G. If i, j ∈ Λ and there exists an edge between i and j then i and j are called neighbours and the set of all neighbours of a considered point i will be denoted by N i . A clique is a subset of Λ where the points are pairwise connected, and a clique is called maximal if no additional point can be included such that the resulting set is still a clique. We will denote a maximal clique as c and the set of all maximal cliques as C. For an illustration of the concepts see Fig. 1 and for rigorous results see Refs. [39,41].
In addition we associate to each point i ∈ Λ a random variable φ i,i∈Λ and we will call φ = {φ i } a state or configuration of the system. Given a graph G = (Λ, e), the set of random variables define a Markov random field if the associated probability distribution p fulfills the local Markov property with respect to G. The local Markov property denotes that a variable φ i is conditionally independent of all other variables given its neighbors N i , i.e: A probability distribution is then related with the events generated by a Markov random field through the Hammersley-Clifford theorem [39]: Theorem 1 (Hammersley-Clifford.) A strictly positive distribution p satisfies the local Markov property of an undirected graph G, if and only if p can be represented as a product of nonnegative potential functions ψ c over G, one per maximal clique c ∈ C, i.e., where Z = φ c∈C ψ c (φ)dφ is the partition function and φ are all possible states of the system.  We will demonstrate that the φ 4 scalar field theory satisfies the Hammersley-Clifford theorem and is therefore a Markov random field. The two-dimensional φ 4 theory is described by the Euclidean Langrangian: where the action that regularizes the continuum theory on a square lattice is: The quantities κ L , µ 2 L , λ L are dimensionless parameters, one of which is deprecated and can be absorbed by rescaling the fields [42]. Nevertheless, consider the set of variables w = κ L , a = (µ 2 L + 4κ L )/2, b = λ L /4 as inhomogeneous and the resulting action as: where the set of coupling constants is θ = {w ij , a i , b i }, and the associated Boltzmann probability distribution is: The φ 4 scalar field theory is formulated on a graph G = (Λ, e) where Λ is the set of lattice sites and e the set of edges or pairwise interactions. For a square lattice only nearest neighbors define a maximal clique (see Fig. 1). Since we search for arbitrary, strictly positive potential functions ψ c per maximal clique c ∈ C, we can multiply ψ c with nonnegative functions of subsets of c [43], i.e. with functions of one-site cliques. We then arrive, after considering the imposed boundary conditions, at a nonunique choice of potential function: where i, j are nearest neighbors. As the potential functions ψ c are nonnegative the quantity ln ψ c can be defined, and the probability distribution p(φ; θ) can be factorized as: (8) To summarize, the discretized φ 4 scalar field theory satisfies the Hammersley-Clifford theorem and the local Markov property and is therefore a Markov random field. To understand intuitively the meaning of the local Markov property, consider the more familiar case satisfied by a Markov chain P (φ k+1 |φ k , . . . , φ 0 ) = P (φ k+1 |φ k ). This property declares that given a certain state φ k a future state φ k+1 depends only on the current state φ k , and not on states that preceded it, such as φ k−1 . The local Markov property of Eq. 1 extends this concept to higher dimensions by giving it a spatial representation via a Markov random field. For the case of the φ 4 scalar field theory the variational parameters θ are the coupling constants θ = {w ij , a i , b i }. By considering that the probability p(φ; θ) of the Markov random field depends on the parameters θ a variety of machine learning tasks can then be completed.

III. MACHINE LEARNING WITH THE φ 4 SCALAR FIELD THEORY
A. Learning without predefined data Consider a target probability distribution q(φ) of an arbitrary statistical system. An asymmetric measure of the distance between the two probability distributions p(φ; θ) and q(φ) can be defined, which is called the Kullback-Leibler divergence [39]: The Kullback-Leibler divergence is nonnegative and equal to zero when the two probability distributions exactly match one another. We emphasize that the Kullback-Leibler divergence does not satisfy the triangle inequality and it therefore cannot be classified as a proper distance as it isn't symmetric. It is the quantity KL(p||q) + KL(q||p) which is a true metric. The Kullback-Leibler divergence will be called an asymmetric distance to retain the intuitive picture that it establishes a measure of the difference between two probability distributions. By searching for an optimal set of coupling constants θ = {w ij , a i , b i } we can minimize the Kullback-Leibler divergence so that the probability distribution of the φ 4 scalar field theory p(φ; θ) will converge to the target probability distribution q(φ). Once minimization is conducted a Markov chain Monte Carlo simulation can be initiated for p(φ; θ) to draw samples that would be representative of the target distribution q(φ). Let us consider the case where the target probability distribution q(φ) is that of an arbitrary statistical system with partition function Z A and it has a Boltzmann form q(φ) = exp[−A]/Z A . Any additional parameter, such as the inverse temperature, is absorbed within the Hamiltonian or lattice action A. By substituting q(φ) and p(φ; θ) in Eq. 9 we arrive at: By considering that the terms F A = − ln Z A and F = − ln Z are equal to the free energy, the above equation can be equivalently expressed as: where F is the variational free energy. As a result Eq. 11 sets a rigorous upper bound to the calculation of the free energy F A of the target system and this bound F is dependent on calculations conducted entirely on the distribution p(φ; θ) of the φ 4 Markov random field. This indicates that one can map an arbitrary system to a φ 4 scalar field theory by minimizing an asymmetric distance between the probability distributions of the two systems. A gradient-based approach can then be implemented to minimize the variational free energy F via its derivatives in terms of the parameters θ: where all expectation values are calculated under the probability distribution p(φ; θ) of the φ 4 scalar field theory. Derivations can be found in Appendix A. The variational parameters are then updated at each epoch t of the minimization process through: where η is the learning rate and L = ∂F/∂θ (t) . After the minimization process we anticipate that F ≈ F A and as a result p(φ; θ) ≈ q(φ).
To illustrate the approach we consider as a target system a φ 4 lattice action A with longer-range interactions and complex-valued coupling constants, defined as: The notation nn and nnn denotes nearest neighbor and next-nearest neighbor interactions and the lattice action is complex due to the g 5 A (5) term. The combination of the g 2 and g 5 parameters introduces a complex coupling constant in the mass term. The coupling constants have values g1 = g4 = −1, g 2 = 1.52425, g 3 = 0.175 and Variational parameters θ = {wij, ai, bi} versus epochs t on logarithmic scale. The figures depict the evolution of the parameters θ towards the expected values of the coupling constants in the target homogeneous action. g 5 = 0.15. The values for g 1 , g 2 and g 3 have been chosen near the critical point of the second-order phase transition for the system with a local homogeneous action for which g 4 = g 5 = 0. We will present three applications for lattices of size L = 4 at each dimension: firstly, a proof-of-principle demonstration will be conducted to verify that the inhomogeneous action S (see Eq. 5) can learn the local lattice action A {3} = 3 k=1 g k A (k) . Secondly, we will discuss that by considering the local lattice action A {3} it is impossible to reweight to the full action A due to insufficient overlap of statistical ensembles, but there exists an inhomogeneous representation of A {3} equal to S for which this is possible. Finally we will demonstrate that S can approximate A sufficiently to simultaneously extrapolate observables in the parameter space of the complex action A along the trajectory of a considered coupling constant and we will discuss how to successfully define the allowed reweighting range.
We now initialize the φ 4 Markov random field with inhomogeneous coupling constants θ which are randomly drawn from a Gaussian distribution and consider as a target system in Eq. 10 the local lattice action A {3} . We anticipate that the optimal solution is the one where the inhomogeneous coupling constants θ of the φ 4 Markov random field will converge to the homogeneous constants g 1 , g 2 and g 3 of the target φ 4 scalar field theory. Details about the simulations can be found in Appendix B. The time evolution for the parameters θ is depicted in Fig 2 and details of the training process can be found in Appendix B. After training is conducted the parameters θ have converged to the homogeneous constants of the target system with precision of order of magnitude of 10 −8 for all cases. It then becomes clear that given sufficient training time the two systems become identical.
The overlap of statistical ensembles can be quantified through the Kullback-Leibler divergence. We consider the probability distribution p(φ; θ), described by the local inhomogeneous action S, and we minimize the Kullback-Leibler divergence to approximate the target distribution of action A {4} which is denoted as q(φ). In addition, we simultaneously estimate the Kullback-Leibler divergence between the distributions of A {3} and A {4} to quantify their overlap of statistical ensembles. The results are depicted in Fig. 3 where it is evident that the local inhomogeneous action S produces a probability distribution which approximates A {4} exceedingly better than the probability distribution of A {3} . This tentatively indicates that while S and A {3} have the same form of lattice action, the inhomogeneity present in the former allows for the construction of richer representations of probability distributions. As a result, histogram reweighting [44] from local inhomogeneous actions to regions of parameter space that are inaccessible to the local homogeneous action might be possible.
We proceed to discuss the precise implications of the equivalence between the approximating distribution p(φ; θ) of action S and the target distribution q(φ) of action A {4} . The definition of the expectation value O P of an arbitrary observable O in a system that has some equilibrium occupation probabilities P is: where the sum is over all possible states φ of the system. After the Kullback-Leibler divergence between the distributions p(φ; θ) and q(φ) is minimized we anticipate that: which instantly implies, based on Eq.16, that: To clarify further, observables, such as the lattice action A {4} should yield approximately equal values when calculated from samples drawn from either distribution p(φ; θ) or q(φ) even though the two distributions have different actions S and A {4} , respectively. To express these ideas in a more formal manner, we now consider the expectation value of an arbitrary observable as obtained during a Monte Carlo simulation (e.g. see Refs [20,21]) wherep are the probabilities used to sample from the equilibrium distribution and N the number of samples that we have obtained during the Monte Carlo simulation. There are two fundamentally different ways to proceed in calculating the expectation value of the above equation by relying instead on the approximating probability distribution p(φ; θ). The first is to draw a subset of samples from p(φ; θ) and then conjecture, based on Eq. 17, that these N samples have been produced instead by the distribution q(φ). This would have been equivalent to consideringp = q(φ) in Eq. 19 but a systematic error would be introduced based on the accuracy in which the probability distribution p(φ; θ) approximates q(φ). The second approach again relies on drawing a subset of samples from the distribution p(φ; θ), but this time we will consider that p(φ; θ) = q(φ) and that the samples have been produced directly from p(φ; θ) of Eq. 6 with action S. This is equivalent to conducting a reweighting step to arrive from distribution p(φ; θ) to q(φ) under a sufficient overlap of ensembles. We anticipate that this reweighting step is possible to achieve due to the minimization of the Kullback-Leibler divergence between the two distributions p(φ; θ) to q(φ) and their approximate equivalence.
We will follow the second approach and implement a reweighting technique, details of which can be found in Appendix C, to simultaneously extrapolate observables in the parameter space of the full action A which includes complex couplings and longer-range interactions: The equation above can be interpreted as two distinct simultaneous reweighting steps. Firstly the probability distribution p(φ; θ) of the φ 4 Markov random field with action S is reweighted to the distribution q(φ) with action A {4} but with a shifted coupling constant g j . This acts as a correction step to ensure that the proper distribution is reached from p(φ; θ) and it additionally allows an extrapolation along the direction of the parameter space described by coupling g j . Secondly there is a reweighting step to reach the distribution described by the complex lattice action A, which includes the imaginary part g 5 A (5) . Any arbitrary observable can be reweighted in parameter space, such as machine learning derived observables [21], and Hamiltonian-agnostic reweighting [20] could additionally be explored.
We consider that j = 4 and we extrapolate observables along the trajectory of the g 4 coupling constant for a continuous range of values g 4 ∈ [−0.85, −1.15]. We recall that the φ 4 Markov random field was trained to approximate the action A {4} where g 4 = −1. Results for the magnetization and the internal energy, obtained with reweighting from the probability distribution p(φ; θ) to the full action A are depicted in Figs. 4 and 5. The results are compared with Monte Carlo simulations conducted on action A {4} which are combined with reweighting to the full complex distribution to allow for a comparison with the ones from p(φ; θ). It is evident that the results depicted agree within statistical errors with the Monte Carlo extrapolations. Details about the statistical error analysis can be found in Appendix D.
When reweighting is implemented to extrapolate to the probability distribution of a complex action or as a correction step in the case of an approximating distribution the question of how to strictly define the reweighting range emerges. This can be achieved, formally, through the calculation of weight functions which are dependent on the underlying histograms. Specifically, we consider as an example in Eq. 20 the expectation value of the action S. In addition, instead of expressing Eq. 20 as a sum over each action S l calculated on a configuration φ we instead reformulate it in terms of each uniquely sampled action S in the Monte Carlo dataset after the construction of histograms. The expectation value is then: where the sum is over uniquely sampled actions S and W(S) is a weight function which is equal to: where ) is a multi-dimensional histogram of the inhomogeneous action S as well as each action term in which we are interested to extrapolate towards during reweighting. Reweighting can be achieved either by including novel terms in the action or by shifting its corresponding coupling constant if the term already exists. Of particular interest is also the quantity W (S) where the exponentials are chosen equal to one and which is proportional to the actual histograms of the action in the corresponding Monte Carlo dataset. This quantity can additionally serve as an indication of the reweighting range.
We proceed to calculate the weight functions W(S) for each uniquely sampled action S in a considered extrapolation range. The results are depicted in Fig. 6 where an overlap between distinct weight functions that are adjacent in parameter space to the coupling constant g 4 = −1 is observed. We recall that reweighting extrapolations are accurate only when the method successfully predicts the form of histograms at the extrapolated point in parameter space based on the histograms present at the initial dataset. When the coupling constant is g 4 = −0.8 major inconsistencies can be noticed. This indicates that reweighting extrapolations to g 4 = −0.8 would be inaccurate as the form of the weight functions cannot be successfully predicted.
We emphasize that reweighting from the local homogeneous action A {3} to the full action A is not possible. The inclusion of an imaginary term and a longer range interaction does not produce a sufficient overlap of ensembles. Results are depicted in Fig. 7. We recall that the local homogeneous action A {3} has coupling constant g 4 = 0 and the target distribution of action A {4} includes a term with coupling constant g 4 = −1.0. It is clear that the values of the lattice action lie at an entirely different scale and inconsistencies begin to emerge when g 4 = −0.2. Reweighting to the full action is then impossible from the probability distribution of action A {3} . However, the local inhomogeneous action S is able to achieve reweighting to the full distribution of the action A. Consequently the opportunity to map improved lattice actions, which include longer-range interactions, to local inhomogeneous actions is a prospect that is open to explore. This can be achieved by minimizing the asymmetric distance between their associated probability distributions.

B. Learning with predefined data
The preceding results do not require any predefined data to be used as input within the training process since configurations were obtained during the gradient-based approach. However, there exist cases where one has already obtained a set of available data, which could comprise configurations of a system, experimental data, or a set of images, and whose probability distribution is of unknown form. The obtained dataset then explicitly encodes an empirical probability distribution q(φ) that is a representation of the complete probability distribution of the system. The empirical distribution q(φ) can still be learned by minimizing instead the opposite divergence: By expanding the above equation we arrive at: The first right hand term is constant and the minimization of KL(q||p) is therefore equivalent to the maximization of the second right hand term under the training data: The variational parameters are now updated according to Eq. 13 where L = −∂ ln p(φ; θ (t) )/∂θ (t) .
To illustrate the concepts we now create a dataset from a Gaussian distribution with µ = −0.5 and σ = 0.05 which encodes an empirical distribution q(φ). The information about the form of q(φ) will not be introduced in Eq. 23 because the training will instead be conducted on the obtained data. To clarify further, the same approach can be established for any obtained dataset, without the need to even infer the underlying form of the distribution. After successful training, Markov chain Monte Carlo simulations can be implemented based on the distribution p(φ; θ) of the φ 4 Markov random field to draw samples that would be representative of the unknown target distribution q(φ). Additional details can be found in Appendix A.
We anticipate, due to the invariance under the Z 2 symmetry in the lattice action S, that the symmetric distribution with µ = 0.5 might be additionally reproduced. Original image and equilibration of the Markov random field after 1, 10 and 50 steps.
If this feature is not desirable then a local symmetrybreaking term of the form i r i φ i can be included in the action S to favor configurations that will explicitly reproduce q(φ). The Hammersley-Clifford theorem is still satisfied and results for the symmetric action S and the action S b which includes a symmetry-breaking term are depicted in Fig. 8. We observe for the symmetric case that while the algorithm has been trained on one of the probable solutions it is able to produce additional solutions that are invariant under the inherent symmetry, whereas this feature has been eliminated for the brokensymmetry case where the probability distribution q(φ) is explicitly reproduced.
Markov random fields are widely applied to problems in computer vision, image segmentation and compression, as well as image analysis [45]. Every problem that is formulated as an energy or lattice action minimization problem can be solved by implementing Markov random fields. Since the φ 4 scalar field theory satisfies the Hammersley-Clifford theorem and is therefore a Markov random field it can be implemented to complete such tasks. We therefore consider as q(φ) in Eq. 24 the configuration of an image from the CIFAR-10 dataset [46], which we will map to the action of the inhomogeneous φ 4 theory of Eq. 5. In essence, we search for the optimal values of the coupling constants, which describe the local interactions in the φ 4 scalar field theory, that can reproduce the considered image as a configuration in the equilibrium distribution of the system. In Fig. 9, results are depicted after training the φ 4 theory. We observe that by initializing a Markov chain the configurations of the equilibrium distribution converge to an accurate representation of the original image.

IV. φ 4 NEURAL NETWORKS
When the aim of the machine learning task is to study intricate probability distributions, deep learning algorithms that include multiple layers in the neural network architecture can be implemented. These layers progressively transform data to arrive at increasingly abstract representations, allowing for increased expressivity and representational capacity in the model. Such cases of deep learning algorithms can be constructed from the dynamics of the φ 4 scalar field theory.
We consider that part of the random variables φ i on the lattice sites are visible and correspond to a set of observations and the remaining are hidden variables h j , which capture dependencies on a set of training data, given as input to φ i . In addition, to make the connection with the computer science literature we consider a bipartite graph which imposes the restriction that interactions are exclusively between the φ and the h variables (see Fig. 1). We therefore recast the φ 4 neural network as a variant of a restricted Boltzmann machine (RBM) [47][48][49][50], which is able to model continuous data. Alternative parametrizations of the graph structure are open to explore. A joint probability distribution p(φ, h; θ) is then defined, based on a lattice action S(φ, h; θ): which also gives rise to a new expression, based on Eq. 23, for the derivative of the log-likelihood ln p(φ, θ): where the set of variational parameters is now θ = {w ij , r i , a i , b i , s j , m j , n j }. The conditional distributions of the visible and the hidden variables are p(φ|h; θ) = i p(φ i |h) and p(h|φ; θ) = j p(h j |φ). Derivations can be found in Appendix A.
By considering certain values of parameters in the φ 4 neural network of Eq. 26 one can arrive at other neural network architectures, all of which are special cases of a φ 4 Markov random field. For instance by choosing b i = n j = 0 one obtains a Gaussian-Gaussian RBM [47,49]. If b i = n j = m j = 0 and h j ∈ {−1, 1} then the architecture is a Gaussian-Bernoulli RBM [47,49]. Of particular interest could be the choice of m j = n j = 0 and h j ∈ {−1, 1} which would reduce to a φ 4 -Bernoulli RBM, a case with a non-linear sigmoid function that, to our knowledge, has not been studied before. We emphasize that the φ 4 Bernoulli RBM is anticipated to have substantial representational capacity due to the presence of the non-linear sigmoid function in the hidden layer [51].
It is a well-known fact that the φ 4 scalar field theory of Eq. 4, a model with continuous degrees of freedom, reduces to an Ising model under the limit κ L fixed, λ L → ∞ and µ 2 L → −∞ [42]. The φ 4 -Bernoulli RBM can then be interpreted as a φ 4 neural network where certain lattice sites have reached the Ising limit, allowing for novel physical insights. It is important to recall that, with the inclusion of two hidden layer layers, deep variants of restricted Boltzmann machines are universal approximators of probability distributions [52].
To demonstrate the applicability of the φ 4 neural network of Eq. 26, we train it on the first forty examples of the Olivetti faces dataset [53] using 4096 visible units and 32 hidden unit to observe if meaningful features are learned. A subset of the learned features, i.e. the coupling constants w ij for a fixed j, are depicted in Fig. 10. We observe that the neural network has learned hidden features which comprise abstract face shapes and characteristics. The hidden units can then serve as input to a new φ 4 neural network to progressively extract abstract features in data [54].

V. CONCLUSIONS
In this manuscript we derived machine learning algorithms from discretized Euclidean field theories. Specifically we demonstrated that the φ 4 scalar field theory on a square lattice satisfies the Hammersley-Clifford theorem and is therefore a Markov random field that can be used for inference and learning. By recasting the φ 4 theory within a mathematically rigorous framework a variety of theorems, as well as training algorithms, are available and an overview can be found in Ref. [39]. As the resulting algorithm has inhomogeneous coupling constants it can additionally be investigated from the perspective of spin glasses [33], and enhanced sampling can be obtained based on computational techniques from statistical mechanics [55,56], or model-specific algorithms [57,58].
The Kullback-Leibler divergence can be utilized to quantify the overlap of statistical ensembles between probability distributions. Specifically, we demonstrated that the φ 4 scalar field theory with inhomogeneous coupling constants is able to absorb longer range interactions and observables can be reweighted to the parameter space of complex actions using the approximating probability distribution. The prospect of constructing improved lattice actions [59,60] based on local inhomogeneous representations is open to explore.
In principle any arbitrary system can be mapped to a φ 4 scalar field theory with inhomogeneous coupling constants by minimizing an asymmetric distance of their probability distributions based on Eq. 10. The concepts are therefore anticipated to be generally applicable to systems within condensed matter physics, lattice field theories and statistical mechanics. To enhance the accuracy a variant of a neural network architecture can be implemented which is proven to be a universal approximator of a probability distribution [52]. In the manuscript such variants have been presented as special cases of a φ 4 neural network.
The resulting φ 4 machine learning algorithm of Sections II and III retains the topology of the lattice structure and the boundary conditions, but differs from the conventional φ 4 scalar field theory due to the inhomogeneous coupling constants. To employ the tools of quantum field theory a framework involving the replica method is required, but the theories can still be formulated in terms of the functional integral with an additional averaging over the space of couplings [61]. It is noted that in our formulation the couplings are inhomogeneous but not random as they are determined during the minimization process.
We emphasize that prior arguments considering the Hammersley-Clifford theorem hold for arbitrary dimensions and one could therefore construct a d-dimensional Markov random field to initiate analytical or computational investigations. The factorization of a lattice action in terms of products of potential functions, a step that is required to recast a system as a Markov random field, depends on the topology of the graph structure and different topologies yield different maximal cliques. An equivalence between local, pairwise and global Markov properties of a graph structure can also be rigorously proven [39]. Through the construction of quantum fields in Minkowski space from Markov fields in Euclidean space [6], a new research avenue is envisaged, namely that of developing a computational and mathematical framework of machine learning within quantum field theory. Award WM170010 and by the Leverhulme Foundation Research Fellowship RF-2020-461\9. Numerical simulations have been performed on the Swansea SUNBIRD system. This system is part of the Supercomputing Wales project, which is part-funded by the European Regional Development Fund (ERDF) via Welsh Government. We thank COST Action CA15213 THOR for support.
The Kullback-Leibler divergence, which is repeated here for convenience, defines an asymmetric measure of the distance between the distribution of the machine learning algorithm p(φ; θ) and an unknown target distribution q(φ): By expanding the above equation we arrive at: where p(φ;θ) denotes the expectation value under the probability distribution p(φ; θ). If the two probability distributions are substituted to be of Boltzmann form, The terms ln Z p(φ;θ) are constant in terms of expectation values and we therefore obtain: By denoting the right hand part as F, the derivative in terms of a variational parameter θ i is equal to: where each term is calculated as: By substituting we arrive at: A gradient based approach can be implemented based on the above equation to learn a target known probability distribution.
On the opposite direction if a set of data is available for which the probability distribution is unknown the alternative Kullback-Leibler divergence can be considered: By expanding the right-hand side we arrive at the ex-pression: Minimizing the Kullback-Leibler divergence is equivalent to the maximization of the term ln p(φ; θ) q(φ) , which is: where x is a training example and N the number of training data. For the case of the Markov random field the derivative of the log-likelihood is: where the term outside the expectation values is calculated on the training examples.

φ 4 Neural Network
The case of the quantum field-theoretic neural network is more complicated due to the joint probability distribution of the visible units φ and the hidden units h: From the joint probability distribution we can define marginal probability distributions via: Similarly: The gradient of the log-likelihood for the case of the quantum field-theoretic neural network is: .

(A24)
We approximate the last expression in the above equation for each parameter θ using contrastive divergence. Specifically, the visible units φ are set equal to a specific training example φ (x) and then based on the conditional distribution p(h|φ (x) ) a set of hidden units h (x) is sampled. The hidden units h (x) are then utilized to sample a new set of visible units φ (x+1) and the approach is repeated for k steps: where for the considered cases we use k = 1.

Appendix B: Simulation details and Hyper-Parameters
The φ 4 scalar field theory is a system with continuous degrees of freedom −∞ < φ < +∞. To sample the system we implement Markov chain Monte Carlo sampling with the Metropolis algorithm, where we consider one step as equivalent to updating a number of lattice sites equal to the volume of the system. The question of how to properly choose a new state additionally arises. When the training data have values which lie at a specific inter-val, the aim of the machine learning algorithm is to learn a probability distribution which reproduces them. The new state can then be chosen by sampling uniformly between the minimum and maximum value, therefore guaranteeing that every state is reachable under an arbitrary large number of Monte Carlo steps. For the case of the hidden units in the φ 4 neural network we impose the same restriction, even though the hidden units could, in principle, remain unconstrained, i.e. −∞ < h < +∞. We also emphasize that during the gradient process of the Markov random field we retain one Markov chain to conduct the necessary calculations.
The learning rate that produced Figs. 2 and 3 is 10 −3 and 10 −2 , respectively. The sample size is chosen equal to 50 before updating the variational parameters θ. The image in Fig. 9 has size 32 * 32 and its continuous values lie between [−1, 1]. The Markov random field was trained with learning rate 0.1 and 4×10 4 epochs. The parameters that produced Fig. 8 are a learning rate of 0.1, 400 epochs and a batch size of 4. For the results depicted in Fig. 10 the φ 4 neural network has 4096 visible units, 32 hidden units, learning rate 0.1, batch size of 5 and was trained for 10 4 epochs on the first 40 examples of the Olivetti faces dataset.

Appendix C: Histogram Reweighting
We consider the numerical estimator for an arbitrary observable O in the full complex action A which we aim to sample: , (C1) where N is the subset of Monte Carlo samples andp are the probabilities used to sample from the equilibrium distribution. We have expressed the numerical estimator in a form that simultaneously allows extrapolation along the trajectory of a coupling constant g j . We will now substitute the probabilitiesp for the probabilities of the inhomogeneous φ 4 Markov random field: where the sum is over all possible states φ of the system and we arrive at the reweighting equation: Given a subset of samples drawn from the equilibrium distribution of the φ 4 Markov random field, which is described by the action S, one can extrapolate observables to the full distribution of the action A which includes longer range interactions and complex-valued terms along the trajectory of a coupling constant g j .
To compare the reweighting extrapolations from the φ 4 Markov random field to the full action, we additionally implement reweighting from the simulated action A {4} . In this form of reweighting we consider again Eq. C1 and we substitutep for: where we consider for this specific case that g j = g j , arriving at equation: One observable of interest is the magnetization which is defined as: where V = L * L is the volume of the system.

Appendix D: Binning Analysis
Statistical errors are calculated with the binning method on the obtained Monte Carlo datasets. Each dataset with 10 4 minimally correlated configurations is split into n = 10 datasets where calculations of observables O are conducted. The standard deviation for an observable O is then obtained through: