Generalization capabilities of translationally equivariant neural networks

The rising adoption of machine learning in high energy physics and lattice field theory necessitates the re-evaluation of common methods that are widely used in computer vision, which, when applied to problems in physics, can lead to significant drawbacks in terms of performance and generalizability. One particular example for this is the use of neural network architectures that do not reflect the underlying symmetries of the given physical problem. In this work, we focus on complex scalar field theory on a two-dimensional lattice and investigate the benefits of using group equivariant convolutional neural network architectures based on the translation group. For a meaningful comparison, we conduct a systematic search for equivariant and non-equivariant neural network architectures and apply them to various regression and classification tasks. We demonstrate that in most of these tasks our best equivariant architectures can perform and generalize significantly better than their non-equivariant counterparts, which applies not only to physical parameters beyond those represented in the training set, but also to different lattice sizes.


I. INTRODUCTION
Machine learning has become an increasingly popular tool for a diverse range of applications in physics. Particularly suitable for the analysis of spatially arranged data are Convolutional Neural Networks (CNNs). Modern CNN architectures are based on the idea that a network's prediction should not change when the input is shifted. They rely on two key ingredients that have already been introduced by the Neocognitron [1] over 40 years ago: convolutional layers (S-cells) and pooling (subsampling, downsampling) layers (C-cells). This incorporation of a translational symmetry was an essential advantage over its predecessor, the Cognitron [2]. However, equivariance under translations is not guaranteed in a generic CNN, even though it is the idea behind weight sharing in the convolutional layers.
In the past decade, the computer vision community has created many different machine learning algorithms and continues to refine them. During the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [3], which was a popular competition that was held annually from 2010 until 2017, the performance of CNNs steadily increased, and in 2012, AlexNet [4] was the first CNN to win the classification task. However, its first convolutional layer already breaks translational equivariance by using a stride of four, as do three max pooling layers with a stride of two that are part of the network. Additionally, the output of the last convolutional layer is flattened before it is passed to the dense layers of the network. LeNet-5 [5], a very early CNN, uses a stride of one in the convolutional layers, but the average pooling layers with a stride of two break translational symmetry.
An important step towards a translationally equivariant network architecture has been made by the introduction of global pooling layers. Global Average Pooling (GAP) was first introduced in [6], and the first winning network of the ILSVRC's classification task that makes use of it is ResNet [7] from 2015's competition.
The grand success of machine learning in many different tasks has also garnered attention within other research communities. Although many ingredients can be carried over from computer vision, differences in the tasks may require a different treatment. A lot of effort has been made to incorporate global [8][9][10][11][12][13][14][15][16][17][18] and gauge [19][20][21] symmetries in the network architecture, since they play a central role in modern physics, among other fields. Nevertheless, the most basic one, translational symmetry, is often not strictly enforced despite the fact that the task would allow for it. Oftentimes, the data are flattened somewhere in the network, as e.g. in [22][23][24][25][26][27], and sometimes a convolutional or pooling operation with a stride greater than one spoils symmetry under translations, even though a global pooling layer constitutes the transition from the convolutional part of the network to its dense part, e.g. in [28]. There are examples that make explicit use of this symmetry though, such as [29], and we want to raise awareness that one should take it into account when choosing a network architecture.
In this paper, we focus on translational symmetry in CNNs. In addition to providing theoretical reasons for choosing a translationally equivariant architecture, we conduct experiments with three types of architecture on three different machine learning tasks. We try to find well-performing architectures for each of these types by doing an extensive search with the optimization framework optuna [30]. We furthermore investigate the generalization capabilities of the three network types in two different ways. First, we examine how models generalize to different sets of physical parameters at a fixed lattice size, and, second, we inquire how well they generalize to other lattice sizes. The latter is not possible for networks including a flattening step, since they require a fixed in-put size. Therefore, these types of model would have to be retrained for each new lattice size. This paper is structured as follows: we first discuss translational symmetry in section II and show under which circumstances CNNs are indeed respecting equivariance under translations, as well as certain pitfalls that break the symmetry, in section III. The next three sections are devoted to three different machine learning tasks pertaining to a complex scalar field on a lattice: a regression task with the aim of predicting two observables of said scalar field (section IV), a classification task in which the algorithm should judge whether or not the flux of a given lattice configuration is conserved (section V) and another regression task, in which the network is supposed to figure out how many flux violations are present on the lattice (section VI). Section VII contains our conclusions and possible future research avenues. The appendices encompass information about the complex scalar field (appendix A), our datasets (appendix B), some supplemental proofs for section III (appendix C) and some additional analysis pertaining to the regression task in section IV (appendix D).
The code and data sets used in this work are published in a separate repository 1 .

II. TRANSLATIONAL SYMMETRY
In this section, we exemplify symmetry aspects on a complex scalar field and explain how they may impact the choice of machine learning models. The action of a complex scalar field φ in an external potential V in D dimensions can be written as with the Lagrangian density The latter is covariant under translations x µ → x µ = x µ + a µ , with a constant vector a µ . This can be seen by noting that the fields transform via φ (x µ ) = φ(x µ − a µ ) and that the partial derivative is not influenced by translations ∂ µ = ∂ µ . The action given by Eq. (1) is then invariant under translations. This has important implications for the resulting physical theory because finite continuous symmetries of the action lead to conserved quantities, according to Noether's first theorem [31]. The invariance under temporal translations entails energy conservation; the invariance under spatial translations leads to momentum conservation.
Another important symmetry of the action in Eq. (1) is a global U (1) symmetry, given by φ → e iα φ. It implies Figure 1. The three different architecture types used in this study. The checkmark () or cross () indicate spatial operations in which translational symmetry is respected or violated, respectively. Translational symmetry can be violated by convolutional or pooling layers with a stride greater than one or by a flattening layer. A global pooling layer allows for the application of the same network to different lattice sizes. Each of the layers can have a number of channels (not depicted) without affecting the translational symmetry properties.
the existence of a conserved four-current j µ and allows the definition of a chemical potential µ. The action can be modified to directly include the chemical potential via with D 0 = ∂ 0 − iµ.
In the following, we consider a complex scalar field in 1 + 1 dimensions in a quartic potential V (|φ|) = m 2 |φ| 2 + λ|φ| 4 (4) on the lattice with periodic boundary conditions. The parameters of the potential are the mass m and the coupling constant λ. A discretized version of the action in Eq. (3) retains its invariance under discrete translations.
We then switch to a dual representation, called flux representation. It describes the same physical content as the original representation, but the variables are four integer fields (link variables) k x,ν and l x,ν , with ν = 1, 2, instead of the complex scalar field. The corresponding partition function reads where the outer sum is a shorthand for The function W (f x ) is given by and the integer field f x is defined as [|k x,ν | + |k x−ν,ν | + 2(l x,ν + l x−ν,ν )]. (8) The dual formulation incorporates the same symmetry properties as the original formulation. A more detailed explanation of this procedure is given in appendix A and in [32]. To ensure the flux conservation demanded by the Kronecker δ symbol in Eq. (5), the worm algorithm [33] has been employed to update the link variables k x,ν . It is a local algorithm that updates contiguous field values on the lattice in successive steps. The resulting structures are known as worms. When the head of a worm meets its tail, a worm is closed; otherwise, it is open. Details about the generation of our datasets can be found in appendix B. This dual representation in two dimensions allows for a strong analogy with two-dimensional images. While every pixel of an image is described by one (grayscale) or three (color) numbers, every position of the dual lattice is described by four values. An important difference between them is their boundary conditions. Typically, image applications employ fixed boundary conditions, which break translational equivariance at the boundaries. In contrast, using periodic boundary conditions on the lattice, translational equivariance can be preserved.
The aforementioned four integer fields of the flux representation are used as the input for the upcoming machine learning tasks. In these tasks, the intensive or extensive nature of an observable are important for the choice of a global pooling layer because the networks should be able to generalize to other lattice sizes apart from the one they have been trained on. In a regression task, an intensive quantity requires a global average pooling layer, while an extensive quantity calls for a global sum pooling layer. In a classification task, it is not the physical observable itself that is predicted but a decision boundary, so the choice of global pooling layer is more subtle.
We are interested in the generalization capability to larger lattices because usually studies on the lattice are intended as an approximation of the real case of an infinite spacetime background. Therefore, a compromise has to be found between lattice size and computational effort such that the simulation produces results satisfactorily close to the physical ones in a reasonable amount of time. The approach adopted throughout this paper is to train models on small lattices and examine how well they generalize to larger lattices.
The three machine learning tasks that are described at the end of section I are tackled with three different types of CNN architecture, which are depicted in Fig. 1: • a translationally equivariant architecture (EQ) that uses only layers of stride one and a global pooling layer and is applicable to different lattice sizes, • a strided architecture (ST) that breaks translational equivariance due to spatial pooling layers with a stride greater than one but is still suited to give predictions on different lattice sizes due to a global pooling layer, and • a flattening architecture (FL) that represents a "traditional" architecture that breaks translational equivariance due to spatial pooling layers with a stride greater than one and a flattening layer, which restricts its usage to one particular lattice size.
After the global pooling or flattening step, a dense feedforward network, which is also known as multi-layer perceptron (MLP), can be attached without altering the translational equivariance properties of the network. A discussion about which network layers are translationally equivariant and which ones are not, as well as what exactly breaks said equivariance, can be found in section III. Dense neural networks can be seen as universal function approximators [34,35] and as such, they do not respect any particular symmetries. Such symmetries can be implemented in (or "hard-baked" into, as other authors [10] call it) the network architecture, though, so that the function that is learned is restricted to respect a certain symmetry by design, independent of any training. We expect that such a restriction, as incorporated by EQ architectures, is beneficial and that CNNs of that kind therefore outperform CNNs of the other two kinds.
Alternatively, symmetries can be learned, which can be encouraged by augmenting the training data according to the desired symmetry transformation. Thus, we expect data augmentation to improve the performance of networks of ST and FL architectures. Note, however, that when using data augmentation, it is not guaranteed that the network respects the symmetry even on the training set, let alone on the test set. In addition, data augmentation only establishes a relation between the input layer and the output layer; it does not require the hidden layers in between to respect the symmetry.
From a more theoretical standpoint, a CNN can be seen as a special case of an MLP, where the latter learns to set its weights so that the receptive fields are local (by setting the other weights to zero) and to "share" the appropriate weights (by setting them to the same value). The idea behind the CNN was, though, that this does not have to be learned but can be implemented in the architecture.
Note that if the observable to be studied violates the symmetry that is implemented in the network, it cannot be properly approximated by design. Therefore, it is important to understand the symmetry properties of the task before selecting a particular architecture.

III. SYMMETRY PROPERTIES OF MACHINE LEARNING LAYERS
If a network layer's output is invariant under a symmetry transformation of the input, the outputs of all subsequent layers are invariant under the symmetry transformation of the former layer's input as well. However, invariance of the network's prediction does not require every individual layer to be invariant under this symmetry. The more general concept of equivariance not only is a sufficient requirement, but also allows for more expressive networks. Group Equivariant Convolutional Neural Networks (G-CNNs) [8] exploit symmetry transformations that are described by a group G. Conventional CNNs can be seen as a special case of G-CNNs, with the translation group T as their symmetry group, i.e. G = T.
The following discussion about equivariance is based on [8]. The condition for equivariance of a network layer Φ under a group transformation L g by the element g ∈ G is given by Note that L g = L g , in general, and that invariance under L g is the special case L g = 1.

A. Convolutional layers
On a two-dimensional rectangular lattice, a convolution 2 is defined as (10) where the feature map f and the kernel (or filter) ψ are real-valued functions The kernel ψ is assumed to have finite support Ψ ⊂ Z 2 , i.e. there is only a finite number of points on Z 2 where ψ is non-zero. In principle, this allows us to restrict the sum on the RHS of Eq. (10) to Ψ, but for simplicity we keep the sum running over all of Z 2 . For our purposes, the feature map f is understood to be defined on a finite, rectangular proper subset F ⊂ Z 2 with periodic boundary conditions. We can avoid explicitly dealing with periodic boundaries by assuming that the feature map periodically repeats outside F . In machine learning frameworks such as PyTorch [36], periodic boundary conditions are enforced through the use of circular padding. As a result of periodicity, the output of the convolution Eq. (10) has the same size as the feature map f . We define a translation of the feature map via where t ∈ T is an element of the translation group, which can be identified with an element of Z 2 . The convolution is equivariant under translations due to Equation (10) assumes the convolution to have a stride s of one, i.e. the number of points that the kernel is shifted when the convolution is performed is one. More generally, convolutions with strides s ≥ 1 can be written as 2 The cross-correlation in signal processing is often referred to as convolution in the machine learning community. For this paper we adopt this nomenclature. Additionally, we disregard a possible bias term b to be added to Eq. (10) without loss of generality.
This definition reduces to the original convolution if s = 1. For s ≥ 2, the output size of the convolution is smaller than the input size of the feature map f . Strided convolutions with s ≥ 2 generally break translational equivariance. This can be demonstrated by considering a translation t ∈ T with |t| < s. For example, we can choose t = (1, 0). Performing this translation on the input feature map f yields In order for the above expression to be equivariant, we would need to be able to rewrite it in terms of a shifted position x = x − t/s ∈ Z 2 . However, this is not possible because t = (1, 0) is not divisible by s ≥ 2. On the other hand, the strided convolutions are equivariant if we consider only the subgroup T s ⊂ T consisting of translations by multiples of s lattice points. In that case, any element t ∈ T s is divisible by s and therefore where t = t/s ∈ T. This means that a convolutional layer with a given stride is equivariant only under translations that are a multiple of that stride. Equivariance under all possible translations is given only for s = 1. The generalization to more than one feature map, i.e. multiple channels, is straightforward. Note that a convolution with s ≥ 2 is equivalent to a convolution with s = 1 combined with a subsequent subsampling step.

B. Spatial pooling layers
Spatial pooling layers are usually used to subsample, i.e. s ≥ 2 in pooling layers. For this discussion, let us split this layer up into a pooling step and a subsampling step. Since average pooling is equivalent to a special case of a convolution, where all weights of ψ are identical and given by 1/|Ψ|, with |Ψ| denoting the cardinality of Ψ, the average pooling step is equivariant under translations. The subsequent subsampling, however, breaks this equivariance, which again leads to equivariance only under translations that are a multiple of the spatial average pooling layer's stride.
This holds not only for average pooling though, but for spatial pooling in general: we take again the pooling step by itself, or equivalently, with s = 1. It acts on the feature map f by performing the same operation on subsets U x of F P f (x) = P y∈Ux f (y). (18) These subsets correspond to the kernel of the pooling operation. Its dependence on x depicts the "sliding" of the kernel over the feature map. A spatial pooling step respects Eq. (9), as can be seen by Thus, also in a spatial pooling layer it is the stride that restricts the equivariance of the layer to translations by multiples of said stride. We want to stress that spatial pooling layers with s = 1 respect translational equivariance and can therefore be included if one desires an architecture that incorporates such a symmetry, albeit in a different role than usual because it does not subsample.

C. Global pooling
If we wanted to use a traditional CNN architecture on a two-dimensional lattice with periodic boundary conditions, we would have another problem as well: the last convolutional or pooling layer is often flattened and densely connected to the linear layers at the end of the network. Since different positions in one feature map are connected to different weights without a "sliding" kernel, this is another point where translational equivariance is broken. A possible solution to this problem is a global pooling layer between the last convolution and the first dense layer. The GAP layer was first introduced in [6]. There, the authors proposed to create one feature map for each class and to feed the average of each feature map directly to a softmax layer. This approach would respect translational symmetry, although, in general, dense layers could be used between the global pooling and the softmax operation.

D. Equivariant architectures
On the aforementioned two-dimensional lattice with periodic boundary conditions and for similar problems we propose the following network architecture for classification and regression tasks: the input is fed to a convolutional layer with a stride of one and circular padding so that the output of the convolution has the same size as its input. The kernel size can be odd or even. Translational equivariance is retained by applying consecutive convolutional layers, all with s = 1, with non-linear activation functions in between. Activation functions do not influence the symmetries of an individual layer, since they are applied pointwise. If information from different scales is required, dilated convolutions [37] can be used with a stride of one. Since dilated convolutions are equivalent to convolutions with a larger kernel and the appropriate weights set to zero, they are also equivariant under translations if their stride is one. Spatial pooling layers for subsampling, which use s > 1, break translational equivariance, but it is still possible to use them with s = 1 between convolutional layers. A way of subsampling that respects translational equivariance is rendered possible by coset pooling [8]. However, since this is a nonlocal operation, we do not expect it to be suitable for the machine learning tasks discussed in this paper, which focus on local quantities and predictions. In the special case of translationally invariant functions, we suggest to utilize a global pooling layer after the last convolution. The output of the global pooling layer is translationally invariant, and therefore the rest of the network can be a general MLP without breaking the symmetry.
There is still one important point to be made: every layer before the GAP respecting translational equivariance is sufficient to guarantee invariance under translations after the GAP, but it is not necessary. If a spatial average pooling layer that breaks translational equivariance and a subsequent convolutional layer are inserted just before the GAP, the output of the GAP can still be invariant under translations, depending on their strides (theorem 1 in appendix C). If there is an activation function after the convolutional layer, as is usually the case, the GAP's output is, in general, no longer invariant under translations. The activation function is also necessary for the convolution not to lead to a single multiplicative and additive factor of the GAP, as is shown in lemma 2 in appendix C. We thus stick to the aforementioned sufficient conditions for translational equivariance and apply an activation function after the convolutional layer right before the GAP.

IV. REGRESSION: PREDICTING OBSERVABLES ON THE LATTICE
This section revisits a regression task that has previously been performed in [22]: given a lattice configuration as input, the network shall predict two physical observables, namely the particle density n and the lattice averaged squared absolute value of the field |φ| 2 . The former is given by where the summation of one of the four integer fields k x,2 runs over all N x N t lattice sites. The latter is given by which contains the highly non-linear function W (f x ). It is given in Eq. (7) and depends on all four integer fields.
The function W (f x ) also depends on the physical parameters λ, η and µ, which are set to the same values as in [22]. Concretely, the values of the coupling constant λ and the mass m will be kept fixed in this task (λ = 1, η = 4 + m 2 = 4.01), and the chemical potential µ lies in the interval µ ∈ [0.91, 1.05], with steps of ∆µ = 0.005.
In [22], the networks have been trained on lattice configurations and observables that have been generated with two values of µ, specifically the outermost values µ = {0.91, 1.05}, but tested on data that have been created on the whole given interval of the chemical potential. This allows for an analysis of the architectures' generalization capability to lattice configurations that correspond to chemical potentials that are not represented in the training set. We will follow this procedure, with the exception that we will use only a single µ to generate training data, namely the uppermost one µ = 1.05. Since we would test on only smaller values of the chemical potential than the one that has been used for training in our approach, we deviate from [22] in that we create additional test data that contain higher values of the chemical potential. They lie in the interval µ ∈ [1.1, 1.5], with steps of ∆µ = 0.1. This renders possible an analysis of the architectures' generalization capability to lattice configurations that correspond to values of µ that are greater than the one used to create the training set. We will come back to these test data only at the end of this section. In addition to the generalization ability to different values of the chemical potential, we will investigate the generalization ability to lattice sizes that the models have not been trained on. This highlights a key advantage of architectures that employ a global pooling layer between their convolutional and their dense layers over architectures that simply flatten the data, because the latter are restricted to a given input size.

A. Architecture choice
The datasets stem from a physical system, whose properties should be taken into account when choosing a network architecture for a model that should learn from said dataset. Let us assume for the following discussion that we have no knowledge of the exact form of Eqs. (20) and (21).
First, the observables are invariant under arbitrary translations of the lattice configuration. This leads to the restriction of preferred architectures that has been proposed at the end of section III: the input is passed to a convolutional layer with a stride s = 1 and circular padding that causes its output to have the same size as its input. Such layers are used consecutively, with non-linear activation functions in between. Optionally, spatial pooling layers with s = 1 can be inserted. The output of such convolutional and pooling layers is equivariant under translations of the input. The output of the last convolution is fed to a global pooling layer, which makes it invariant under translations of the input. Then, the data are passed through an MLP with two output nodes, one for each observable.
Second, the observables are derivatives of the logarithm of the partition function on the lattice. The partition function is a product over quantities at each lattice site. The observables can therefore be written as a sum over the lattice. Consequently, we want to use a global pooling layer that respects this fact, which excludes global max pooling. Since the observables are intensive quantities and the network shall be able to generalize to different lattice sizes, global average pooling is the natural choice. The MLP at the end does not modify the intensive nature of the prediction.
To check how the above theoretical considerations perform in practice, we want to compare the three types of architecture that are depicted in Fig. 1 from section II. A fair comparison among these network architecture types is quite difficult. One could take a translationally equivariant architecture and break equivariance by inserting at least one spatial pooling layer with s > 1. This would keep the number of parameters the same. However, having found a decent EQ architecture, it is not guaranteed that the corresponding ST architecture is a good one compared to other ST architectures and vice versa. Also, keeping the weights constant may not lead to a fair comparison with FL architectures.
Therefore, we define a space of possible architectures for each of the three types separately, which are illustrated in tables I to III, and use an optimization procedure to find an adequate representative for each architecture type individually. Table I depicts the search spaces of EQ, table II of ST  and table III of FL architectures. The possible parameter values of the first run are inspired by manual trials, which also included different activation functions (ReLU, tanh, PReLU and LeakyReLU ). Its results lead to modifications of the parameter space of the second run and the choice of LeakyReLU for a suitable activation function. Both of them try 50 different combinations of parameters in their optimization procedure on each training set, which will be specified in the next subsection. The extended search explores an enlarged parameter space with 100 trials, also with unique combinations of parameter values, in order to check if a better architecture was missed during the first two runs due to the choice of a too small search space. This search involves only the largest training set. After every convolutional layer and after every linear layer but the last one, a LeakyReLU activation function [38] is applied. Its advantage over the ReLU activation function is the avoidance of so-called dead or dying neurons, which never activate initially or become Table I. Search spaces for EQ architectures. It lists the possible number of convolutional (conv, s = 1) and linear layers (lin), kernel sizes, the number of channels of the convolutional layers and the number of nodes in the linear layers. Spatial pooling layers with s = 1 seem to worsen the predictions and have therefore not been included in these search spaces. inactive during the training process.
ST architectures can be thought of as EQ architectures with at least one spatial pooling layer with s = 2 in the convolutional part of the network. This is either an average pooling or a max pooling layer, both with a 2 × 2 kernel. A spatial pooling layer is neither directly applied to the input, nor inserted just before the global pooling layer. The position of the spatial pooling layer(s) is part of the search space, but it is restricted by the choice of the number of convolutional layers, as is the number of spatial pooling layers. If, e.g., two convolutional layers are chosen, there can only be one spatial pooling layer at only one specific position, that is between the convolutional layers.
FL architectures are inspired by how we think one would construct a CNN traditionally for this machine learning problem. At its core are two 2 × 2 convolutions, followed by a spatial pooling layer with a 2 × 2 kernel and a stride of 2. Optionally, there can be a 1 × 1 convolution before each of the 2 × 2 convolutions and between each of them and the respective following spatial pooling layer, leading to a possible total count of six convolutional layers.
Our optimization procedure of choice has been optuna. The performance metric is the validation loss averaged over three different parameter initializations. This averaging process is applied to counteract the statistical fluctuations introduced by the random initializations of the trainable network parameters. It is important because optuna changes its search space dynamically, so early search results influence the probability distributions that serve as the basis to select later parameter values. This optimization process is done for different sized training sets individually, since on smaller training sets different architectures might perform better than on larger ones.
After the optimization procedure by optuna, models of the best architectures are retrained ten times from scratch and evaluated on the validation set to verify their performance while further minimizing statistical fluctuations due to the random parameter initializations. Our results show that the same architectures that perform well on small training sets also perform well on larger training sets and that many architectures perform similarly. Thus, we select the best-performing architecture of each type according to the mean validation loss as a representative and compare only them. These best- Table II. Search spaces for ST architectures. It shows the possible number of convolutional (s = 1) and linear layers (abbreviated as in table I), kernel sizes, the number of channels of the convolutional layers, the number of nodes in the linear layers, the number of spatial pooling layers (SPL, s = 2) and the spatial pooling mode (SPM).  Table III. Search spaces for FL architectures. It shows the possible number of convolutional (s = 1) and linear layers, kernel sizes, the number of channels of the convolutional layers, the number of nodes in the linear layers, the number of spatial pooling layers (s = 2) and the spatial pooling mode. The number of convolutional layers is not chosen directly but follows from the number of 1 × 1 convolutions that are selected. The asterisk next to "kernel size" signifies that the kernel size of the convolution depends on its position. Two 2 × 2 convolutions with a respective subsequent spatial pooling layer are mandatory. Additional 1 × 1 convolutions are possible, namely before each of the 2 × 2 convolutions and between each of them and their corresponding subsequent spatial pooling layer. The abbreviations are the same as in is the number of input (output) channels. Before every convolutional operation we use circular padding to enforce periodic boundary conditions. Additionally, we use a stride of one for each convolution. Average pooling layers with kernel size K and stride s are written as AvgPool(K × K, s). Dense layers are denoted by Linear(N in , N out ) with N in (N out ) input (output) nodes.

B. Training and testing
The training is performed for every model of each of these three architectures and for each training set analogously: mean squared error (MSE) is selected as a loss function; the total loss is the arithmetic mean of the individual losses, each of which corresponds to one physical observable. It is optimized with the AMSGrad [39] variant of the AdamW optimizer [40] with a vanishing weight decay. Training models on different sized training sets gives us information about the sample efficiency. Limiting the size of training sets is motivated by machine learning tasks where the generation of training samples is costly, for example in medical applications or in large-scale simulations on supercomputers. The number of training samples in a training set ranges from 100 to 20000, with steps ∆ = 50 from 100 to 250, ∆ = 250 from 250 to 1000, ∆ = 500 from 1000 to 3000 and ∆ = 1000 up to 20000 training samples. The corresponding validation sets contain 10% of the amount of the training set's data. The batch size during training was chosen to be 100 for training sets with at least 500 training samples and 50 otherwise. The reason behind this choice is that the algorithm shall be trained with mini-batches. To avoid that this approaches batch training for smaller training sets, a smaller batch size is chosen for them. The training lasts between 100 and 1000 epochs; the exact number is determined by early stopping based on validation loss with a patience value of 25. The model is taken at the time it has had the lowest validation loss. An overview of the chosen parameters is given in table V.
The training takes place on a 60 × 4 lattice; the first number refers to the temporal dimension and the second to the spatial one. All data in the training set and the validation set have been generated with µ = 1.05.
Both translationally non-equivariant architectures (ST and FL) are trained with and without data augmentation. The training data are augmented by randomly shifting the input data by a number of pixels that is determined by the symmetry properties of the respective architecture. ST architectures contain at most two spatial pooling layers with a 2 × 2 kernel and a stride of 2, as is shown in table II. Therefore, they still incorporate translational equivariance under shifts of multiples of 4 (see section III); and the data can be augmented by shifts of [0, 3] in both directions. FL architectures, however, do not incorporate translational equivariance under any shifts of the input; thus, the data have to be augmented by shifts determined by the lattice size, i.e. by [0, 59] in the time direction and by [0,3] in the space direction.
The testing can be divided into two parts. As a first step, each architecture is evaluated on the same lattice size as it has been trained on, for various values of µ. This checks whether networks of a given architecture are able to generalize to values of µ that are not represented in the training set. Then, the generalization ability to other lattice sizes is investigated. This second step can be done only with architecture types EQ and ST because FL architectures require a fixed input size.
The test set on the 60 × 4 lattice contains samples that have been generated with various values of µ, most of which have not been used for the training and validation sets. This test set contains 4000 lattice configurations pertaining to each µ ∈ [0.91, 1.05], with steps of ∆µ = 0.005, where only the last value µ = 1.05 has been used for training and validation. This amounts to 1.16 × 10 5 testing samples in total.
For testing on different lattice sizes, we generated a test set analogous to the one on the 60 × 4 lattice on a 50 × 2, a 100 × 5, a 125 × 8 and a 200 × 10 lattice. For each of these lattice sizes, we created again 1.16 × 10 5 test samples, 4000 pertaining to each µ ∈ [0.91, 1.05], with steps of ∆µ = 0.005. Note that the winning ST architecture (see table IV) can be evaluated on the 50 × 2 lattice because it contains only one spatial pooling layer with a 2 × 2 kernel and a stride s = 2. Further details on the dataset generation can be found in appendix B.

C. Results
In this subsection, we will discuss the test results on the 60 × 4 lattice, which is the lattice size on which the training took place, in detail before analyzing the generalization ability to other lattice sizes of the different network types. Then, we will investigate the Silver Blaze phenomenon on the larger lattice sizes with our trained models. Finally, we will discuss the results on our second set of test sets, which contains data generated with a chemical potential greater than the one used to create the training set.

Results on the same lattice size as training
The loss over the whole test set is a metric for how well the network performs. It is displayed in Fig. 2 for different training sets with a varying number of training samples. Essentially all of our models are trained until convergence, since we choose a very high number of maximum epochs and employ early stopping based on validation loss. Therefore, the comparison in Fig. 2 shows how the different architectures perform under limited information for smaller training set sizes. The plot at the top shows that the performance of the EQ architecture improves with the size of the training set, as can be expected. The other two architectures do not seem to benefit from increasing the number of samples in the training set, which is quite surprising. Another remarkable result is that data augmentation does not seem to lead to an increase in performance either, as can be seen in the plot in the middle and at the bottom. At first sight, one may draw the conclusion that the ST and the FL architectures do not allow for approximations that are as precise as the one of the EQ architecture. If the model has already converged to an optimal solution, adding more training samples, irrespective of them being newly created or coming from data augmentation, will not improve its performance. However, the blue downward spikes in the loss of the ST model show that some models succeed in finding a good approximation of the observables. Therefore, we draw the conclusion that although possible, it is more unlikely for the ST and FL models than for the EQ models to learn a good approximation of n and |φ| 2 .
The predictions of the individual observable's ensemble averages per µ made by the best EQ model, according to the test loss, are displayed in Fig. 3. It shows that the model, although trained only on samples generated with µ = 1.05, can generalize to all other values of the chemical potential in the investigated interval. This seemingly astonishing generalization ability can be understood by recognizing that the network does not need to generalize from one µ to all others but from the training samples to other samples, each consisting of a lattice configuration and two observables. Even though the training set contains only lattice configurations that have been gen- We exemplify this point using ST models that have been trained on 18000 samples: Figure 4 shows the predicted versus the true values of both observables of the best (top) and the worst (middle) performing ST model (according to the test loss) evaluated on the test set. The performance of the worst ST model on the training data is shown at the bottom in Fig. 4. Note that in this scat-  ter plot we do not average over the ensemble but show the predictions of the network for each individual example. Both networks are able to predict the larger values of both observables, but the worst one fails to predict the smallest values, since they are missing from the training set. The difference between the better and the worse ST models is the ability to generalize to lattice configurations and ranges of observable values that is has not seen during training. The bad performance overall with some better performing outliers, which is shown in Fig. 2, suggests that ST networks succeed only sometimes with this generalization. FL models show a similar behavior to ST models, but the predictions are less precise throughout. A more detailed discussion of the input value distributions is given in appendix B.

Results on different lattice sizes
One big disadvantage of FL architectures impedes them from predicting on other lattice sizes than the one it was trained on: it requires a fixed input size. For this reason, we can compare only the performance of the EQ and the ST architecture. Since the results for the latter with and without data augmentation are very similar, we will show only the results without data augmentation. Also, we will fix the size of the training set for this comparison to 20000 training samples. Figure 5 displays the overall test loss (top) and the individual losses of the observables (middle and bottom). Even though the ST architecture keeps its worse performance from the 60 × 4 lattice, the generalization ability to the different lattice sizes is comparable for the EQ and the ST architecture, with the exception of the 100 × 5 lattice for the latter. This kink in the blue curve shows up in the prediction of both observables, whereas this particular lattice size does not seem to be extraordinary to   . Predicted versus true observables for the best and the worst ST network that have been trained on 18000 samples. It shows that the ST architecture's best instance is able to accurately estimate the whole ranges of observable values (top) and that its worst instance is failing to do so for smaller values of n and |φ| 2 (middle). The reason for this is that the training set includes only larger values of the observables (bottom) and that the worst model is not able to generalize beyond that. The top and the middle plot show 1% of the test data; the bottom plot shows 4% of the training data.
the EQ architecture. The problem is the odd number in the lattice dimension. This behavior can be explained by a closer inspection of the ST architecture: the first three convolutions leave the input size unchanged because they employ circular padding. Then, these 100 × 5 data are passed to a spatial pooling layer with a 2 × 2 kernel and a stride of 2. This layer disregards 20% of the data and outputs data with a shape of 50 × 2. Consequently, the ST networks cannot use all of the data to come to a prediction, which is therefore less precise. This is far less severe on the 125 × 8 lattice, because there the spatial pooling layer disregards only 1/125 of the data, which is not enough to be visible in Fig. 5. A more detailed analysis of said kink in the blue curve can be found in appendix D. lattice size test MSE: |φ| 2 Figure 5. Overall test loss (top) and its two parts (middle and bottom) that come from each observable, on various lattice sizes. The training has taken place on the 60 × 4 lattice. Both architectures generalize well to lattice sizes different from the one they were trained on, but the ST architecture (blue) performs visibly worse on the 100 × 5 lattice. The reason for this is the spatial pooling layer within the architecture, which drops 20% of the data, leading to a less accurate prediction for both observables.

Silver Blaze phase transition
The Silver Blaze [41] phenomenon refers to a secondorder phase transition at vanishing temperature T , where thermodynamical observables are independent of the chemical potential µ below a critical value µ c [32]. This means that the observables n and |φ| 2 are constant for µ < µ c , whereas they start rising if the chemical potential surpasses its critical value. The particle density n is an order parameter of the Silver Blaze phase transition. As a result of the finiteness of our lattices, the temperature is non-zero (T ∝ 1/N t , where N t is the number of lattice sites in the time direction), and thus the transition is not necessarily sharp. Because of the networks being trained to approximate the particle density and the lattice averaged squared absolute value of the field, this phase transition should also be visible in their predictions. Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5) Best EQ (100 × 5)  the smaller lattices show no phase transition in the range of µ that we analyzed. This is because µ c decreases for increasing temperature. The Silver Blaze phase transition is also predicted correctly by the ST models that accurately generalize to the smaller values of the observables, e.g. by the model that is shown at the top in Fig. 4, but not all ST models generalize well.

Extrapolation to larger chemical potentials
After inspecting the remarkable results that the EQ architecture and some models of the ST architecture achieved on the interval µ ∈ [0.91, 1.05], the question remains as to how the different architectures perform on data corresponding to chemical potentials greater than  Fig. 7. The individual rows correspond to the respective best model of each architecture, according to the validation loss. Although the extrapolation to higher |φ| 2 seems to be more difficult than to higher n, the predictions of the EQ architecture's best model remain close to the identity line, and they are visually better than the predictions of the other two architectures' best models, the FL model performing the worst. This leads to a visible deviation in the ensemble averages of the observables only for µ = 1.5 and is comparable on all lattice sizes under consideration, with the exception of the FL architecture, which allows only for predictions on the 60 × 4 lattice without adapting the architecture and retraining. Note that "best" refers to the validation loss and that there are models of each architecture that extrapolate better than the respective ones depicted in Fig. 7. However, since we are analyzing the generalization capabilities of the networks, we are restricted to metrics that take into account only the training and the validation data, and we chose the validation loss. Figure 8 shows the total and individual test losses over µ on the 60 × 4 lattice. While the large difference between the different architectures in performance on chemical potentials smaller than µ = 1.05 is quite substantial, the performance on larger values of µ differs by less. At µ = 1.5, for example, the mean and median losses of the EQ architecture are lower than their respective counterparts belonging to the other architectures, but there the ST architecture's best model leads to the lowest. An analogous comparison between the EQ and the ST architecture on other lattice sizes leads to similar results, with the exception of the 100 × 5 lattice, on which the latter fails. Note that Fig. 7 depicts the model with the lowest validation loss pertaining to each individual architecture. For the EQ architecture, it is a model of the ensemble that has been trained on 20000 training samples, whereas for the ST and the FL architecture, it is a model that has been trained on 18000 training samples. Figure 8, however, shows the ensemble of models trained on 20000 training examples for each individual architecture.

Results summary
In summary, the best translationally equivariant architecture performs better than the respective best model of the other two types on the lattice size that they have been trained on. Only some ST networks are able to generalize beyond values of observables that they have encountered during training, while EQ networks show no such problem. The FL architecture shows similar behavior to the ST architecture, but its predictions are less precise overall. Models of FL architectures cannot be applied to different lattice sizes. The EQ and the ST architecture are both capable of generalizing to different lattice sizes, although the latter retains the higher average test loss from the 60 × 4 lattice due to the bad generalization to observable values that were not in the training set. Furthermore, ST architectures are not suited to make predictions on every arbitrary lattice size. Each lattice dimension has to regard the behavior of the spatial pooling layers in the network in order to use all the data for the prediction. EQ architectures have the advantage to impose no such restriction. Even though all the models have been trained only on µ = 1.05 on the 60 × 4 lattice, many of them are able to predict the Silver Blaze phase transition on a different lattice size, where µ c 1.05. The EQ architectures do this especially well. We found that data augmentation does not help in the training of ST and FL architectures, which is why we refrain from using it in the next two tasks.
Lastly, we want to make a comparison to the results of [22] where the same regression task was performed. Our best model needs much fewer trainable parameters than the one in [22], i.e. approximately 3 × 10 4 compared to over 10 7 as extracted from their network architecture. We also found well-performing models that contain by an order of magnitude fewer parameters than our best one. Furthermore, their network architecture would fall in our FL category, which means that it can be employed on only one specific lattice size.

V. CLASSIFICATION: DETECTING FLUX VIOLATIONS
In the previous section, we have found that rather simple CNN models can approximate the functions n and

t t t t t t t t
field configuration

t t t t t t t t flux violation (a) Example field configuration
Best EQ model

t t t t t t t t
channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 1 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 channel 2 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3  channel 3   |φ| 2 sufficiently well. In fact, the function n can be exactly represented by a linear, equivariant model with a single 1 × 1 convolution. Similarly, while |φ| 2 does not admit an exact representation in terms of 1 × 1 convolutions, it is easy to see that the lattice averaged quantity x |φ x | 2 /(N t N x ) can be written as a sum over a function that receives dominant contributions from k x,µ and l x,µ at the same lattice site x.
In order to study models that require larger kernel sizes, we need to shift our focus to quantities that cannot be computed by taking into account only field values at a single lattice site. One example for such a quantity is the local flux violation Evaluated at some lattice site x, it specifically requires information from nearest neighbors surrounding x.
We therefore propose to solve the following classification task: an arbitrary field configuration X = {k x,µ , l x,ν } is mapped to the label y(X): Since the worm algorithm generates only physical field configurations which by design satisfy the flux constraint F x = 0, ∀x, we adapt it to generate configurations including open worms. The field configurations generated this way exhibit flux violations at each end of the open worm (see Fig. 9). While we will be using such open worm configurations only for the purpose of classification and regression tasks, they are typically utilized in the calculation of n-point functions of φ [42,43]. For this task (and the following counting task in section VI), we have generated field configurations on square lattices of various sizes given by (N t × N x ) ∈ {8 × 8, 16 × 16, 32 × 32, 64 × 64}. The value of the coupling constant is fixed to λ = 1, the mass m takes values given by η = 4 + m 2 ∈ {4.01, 4.04, 4.25}, and possible values of the chemical potential µ are given by µ ∈ {1, 1.25, 1.5}. Training is performed only on the smallest lattice size (8 × 8) and two specific choices for the pair (η, µ): (η 1 , µ 1 ) = (4.25, 1) and (η 2 , µ 2 ) = (4.01, 1.5). We use a fixed number of training examples, N train = 4000, distributed equally between the two classes: on half of the generated field configurations, we generate an open worm on top of a flux-constraining configuration. Other combinations of parameters and lattice sizes are used only during testing. Further details regarding the datasets can be found in appendix B.

A. Architecture search, training and testing
We aim to make a comparison between the three different architecture types that have been presented in Fig. 1. As discussed previously, both EQ and ST architectures can be applied to field configurations of varying lattice size, while FL architectures are compatible only with a fixed lattice size. As we are dealing with a binary classification problem, a sigmoid activation function is applied to the output of our models.
To facilitate a fair comparison among architecture types, we use optuna to perform a search for wellperforming architectures using validation loss (binary cross entropy loss) as the metric to optimize for. In all three cases, we allow for up to N conv,max = 3 convolutional layers with circular padding and a maximum kernel size of K = 3 and N ch ∈ {4, 8, 16, 32} possible channels. Every convolution is followed by applying a LeakyReLU activation function. In addition, after every convolution except the last we allow for the insertion of a pooling layer (either average or max pooling) with stride s = 1 in the case of EQ networks and s = 2 in the case of ST and FL networks. For non-equivariant architectures we require at least one pooling layer with s = 2 to break translational equivariance. Following this convolutional part of the network, we either apply a global max pooling layer (EQ and ST) or flatten the remaining lattice structure (FL). Although other global pooling layers are possible (e.g. average pooling or sum pooling), global max pooling seems to be the most fitting choice when the task is to detect point-like defects in the field configuration. We note that, as an additional search parameter, we allow for explicitly setting bias terms to zero in every convolutional layer. The resulting feature map is then fed to a dense network with up to N dense,max = 2 layers with N nodes ∈ {4, 8, 16, 32} nodes. Every linear layer is followed by the application of LeakyReLU. A final linear layer is used to map the activation values to a single output node, which is followed by a sigmoid activation function. As before, we use binary search parameters for setting bias terms to zero in each linear layer.
For each architecture type, we perform two optuna search runs with 400 trials each. Each model candidate (i.e. a set of hyperparameters) is trained five times with randomly initialized weights to reduce random fluctuations from the stochastic optimization algorithm. Among the two searches, the best-performing architecture (according to validation loss) is chosen and retrained 50 times to build an ensemble of models for each architecture type.
Training proceeds similar to the regression task of section IV. We use the AMSGrad variant of the AdamW optimizer without weight decay, a learning rate of λ lr = 10   lattice size test accuracy Figure 11. Top: test loss for best equivariant (EQ, green) and non-equivariant strided (ST, blue) classification architectures as a function of lattice size. Bottom: test accuracy as a function of lattice size. The networks have been trained on the 8 × 8 lattice only. We observe that both types of architecture lead to good generalization across lattice sizes with slightly less variation in the performance of the EQ architecture.

B. Results
Our main results are presented in Figs. 10 and 11. Figure 10 shows a comparison of all three architecture types evaluated on 8 × 8 lattices as a function of µ. Both EQ and ST exhibit very good classification accuracy and test loss, while our ensemble of FL models contains a few outliers which increase the average test loss. Figure 11, which shows an average (loss and accuracy) of all available test datasets, demonstrates that both EQ and ST architectures generalize well on larger lattices. FL architectures are not included, since they can be used for only one specific lattice size (in this case, 8 × 8). It is evident that FL models perform worse on average compared to EQ and ST networks, but, in contrast to the previous regression task, the non-equivariant architecture without flattening (ST) exhibits similar performance to the equivariant architecture (EQ). The loss of spatial information due to pooling operations with stride s > 1 does not seem to affect the ability of ST models to correctly classify flux violations.
In light of the results in Figs. 10 and 11, the question arises how EQ and ST models are able to make predictions with such high accuracy and if the computation that is performed by the networks can be easily understood and interpreted. To answer this, we "dissect" fully trained EQ and ST models by examining the feature maps that are generated by the convolutional part of the network, i.e. we modify models by removing the global pooling operation and the dense network. Examples of these feature maps are shown in Fig. 9 (b). We see that some of the channels of the output of the convolutional part of the network highlight flux violations in the vicinity of one of the open ends of the worm (see Fig. 9 (a)). At first, it seems surprising that only one of the two open ends is detected. However, one has to keep in mind that the models were not directly trained on the local flux violation as in Eq. (22) but instead were only given global information about whether or not a field con-figuration contains a violation. Detecting a single defect is sufficient to make the correct prediction.
It is further noteworthy that, compared to typical deep learning models, the models found in our architecture search are rather small, with ∼ 2700 parameters in the case of EQ models and ∼ 1000 parameters for our best ST architecture on this task.

VI. REGRESSION: COUNTING FLUX VIOLATIONS
A natural extension of last section is the study of lattice configurations with more than one open worm, meaning a regression task where the inputs are lattice configurations that are labeled by the number of open worms N worms that they contain. The addition of an open worm implies the emergence of a flux violation at its head and tail, meaning that the quantity defined in Eq. (22) respects F x = ±1 at an open worm endpoint. As discussed in more detail in Appendix B, we explicitly forbid endpoints of different worms to lie on top of each other; therefore, a configuration with N worms open worms is characterized by 2N worms points where F x = ±1. With this clarification, the task we are going to tackle in this section can be formally expressed as the approximation of the function where X is a lattice configuration {k µ , l ν }. We note that this task resembles a simplified version of other counting problems, such as crowd counting [44]. The physical parameters are the ones mentioned in the previous section, with the addition of a number of open worms ranging from 0 to 10, yielding a total of 36 × 11 = 396 combinations of parameters. While the test set includes data coming from all these combinations, the training set consists of data created at only a small subset of such combinations to inspect the generalization capabilities of the architecture under consideration. We use a training set with N train = 20000 samples distributed equally between two different numbers of open worms N worms ∈ {0, 5} and physical parameters (η, µ) ∈ {(4.01, 1.5), (4.25, 1)}. The validation set contains N val = 2000 samples. For more details regarding the datasets, see appendix B.

A. Architecture search, training and testing
A preliminary phase is carried out in order to explore trends with different hyperparameter choices. We also empirically confirm the relationship between the prediction of an extensive quantity and the necessity of a global sum after the convolutional part of the neural network, as discussed in section II. The information gathered in this initial stage is then used to determine the architecture search space for optuna. As in the two previous tasks, this is done for the three architecture types shown in Fig. 1. The search spaces are designed to be as similar as possible to eliminate favorable conditions for any of the three architecture types.
The EQ architecture search space is characterized by N conv ∈ {2, 3, 4} convolutional layers with a kernel size K ∈ {1, 2, 3}, followed by a global sum pooling layer which leads to a dense network, composed of N dense ∈ {0, 1, 2} layers. The ST architecture search space is structured in the same way with the additional insertion of N pool ∈ {1, 2} spatial pooling layers with stride s = 2. Since training is conducted on 8 × 8 lattices, three such pooling layers would reduce the lattice to only one site and render global sum pooling ineffective, which is why we limit the choice of N pool . The FL architecture search space features two mandatory convolutions with a 2 × 2 or a 3 × 3 kernel, each followed by a spatial pooling layer. A 1 × 1 convolution can be inserted before and after each mandatory convolution, leading to a total number of convolutions N conv ∈ {2, 3, 4, 5, 6}. This part is followed by the flattening layer and a dense network consisting of N dense ∈ {1, 2, 3} layers, where the maximum number of layers is increased with respect to the other two architecture types to compensate for the possible absence of 1×1 convolutions. All three types share the following features: circular padding is used in every convolution; the channels in the convolutions and the nodes in the dense layers are selected from the set N ch/nodes ∈ {4, 8, 16, 32}; a LeakyReLU activation function is used after every convolution and every linear layer not leading to the output; the bias in both the convolutions and the linear layers is turned off. We also mention that an independent search is run also for EQ architectures with the optional inclusion of spatial pooling layers with stride s = 1, in the same fashion described for ST models. However, none of the EQ models found in this run are better than the EQ models found in the previous search.
As in the previous section, two metrics are employed for performance analysis: the MSE loss and the accuracy, for which predictions are rounded to the closest integer. The quantity monitored during the optimization phase is validation loss. Since the hyperparameter search spaces are large, two optuna runs are executed to reduce the risk of overlooking promising regions. For each hyperparameter selection, three models are trained, in order to attenuate initialization influences, for 200 epochs with no early stopping. The other hyperparameters are defined prior to the optimization: we adopt a batch size of 16, a learning rate λ lr = 10 −3 and the AMSGrad variant of the AdamW optimizer with zero weight decay.
Out of 100 different architectures from the two optuna searches, the best three for each type are selected according to the validation loss averaged over their three initializations. These architectures become the starting point of the next step: training the most promising architectures from scratch. We keep all the same hyperparameters, except for the number of epochs which is increased to 500, and the same training and validation sets. For a fair comparison 20 instances of the same architectures are trained to mitigate the influence of random initializations, and for each of them the best model is saved. We sort the architectures according to the average over the 20 models of the validation loss. Table VII portrays the details of the feedforward networks.

B. Results
Since FL models cannot be evaluated on input sizes different from the ones they have been trained on, we make two kinds of comparison between architectures: one involves all three types tested only on 8 × 8 lattices, and the other focuses on EQ and ST tested on all lattice sizes available. The first analysis is featured in Fig. 12 and the second in Fig. 13, where the dashed lines indicate the mean values and the markers represent the medians.
A common takeaway of these plots is that for this task equivariance proves to be an important property to incorporate into the network. Interestingly, Fig. 12 suggests that ST and even more so FL have difficulties in recognizing certain numbers of open worms, with the lowest performance at N worms = 1, which is compatible with the fact that the training set consists only of N worms ∈ {0, 5}.
We observe that the podium ordering depends on the metric chosen; for example, in table VIII, the mean and the median of the validation loss lead to different winners for all architecture types. As the training and validation sets are characterized by the same physical parameters, which represent a very small subset of the whole set of parameters used for testing, metrics on the validation set may not be indicative of the generalization capabilities of the network, so we investigate the same metrics on the test set in the two manners described at the beginning of this subsection.
An in-depth analysis is shown in Figs. 14 and 15, depicting the relationship between the test and validation loss for all 20 models of each architecture. If an architecture is prone to generalization issues, its instances are scattered mostly vertically, which manifestly happens to most of the ST and FL architectures. EQ models are instead distributed closer to the black line, where test and validation loss are equal. These results suggest that for EQ architectures low validation loss correlates with low test loss, and therefore they tend to reliably generalize.
Remarkably, there is also a non-equivariant architecture featuring this behavior, specifically the 2nd ST, whose best two instances even outperform the best EQ models by almost an order of magnitude both in the validation and in the test loss. This is an illustration that the validation procedure does not guarantee generalization if the validation set is restricted to a small set of physical parameters. Indeed, the test loss on 8 × 8 lattices in the central column in table VIII already contains a generalization in terms of physical parameters, which implies that it would be sufficient to include at least some of those configurations in the validation set in order to select the best generalizing architecture on different lattice sizes.

VII. CONCLUSIONS AND OUTLOOK
In this work, we studied the effect of imposing global translational invariance on convolutional neural network architectures. We did so by comparing three different architecture types that are commonly used and which differ with regards to their equivariance and generalization properties. Network architectures that use only convolutions or pooling operations of stride one and a global pooling layer before a subsequent dense network preserve translational equivariance. Such networks are also able to generalize to different input sizes if the global pooling operation is compatible with the intensive or extensive property of the output quantity. Network architectures that contain pooling operations with a stride greater than one generally break translational equivariance. Using a flattening operation instead of global pooling further impairs translational equivariance of the network and restricts its usage to one particular input shape, preventing a straightforward generalization to other lattice sizes without retraining. This latter architecture type has been particularly popular in image classification tasks and has subsequently also been used in physics applications.
We chose three different tasks related to characterizing complex scalar field configurations on a two-dimensional lattice with periodic boundary conditions that are given in the flux representation. This representation contains integer-valued field configurations which have to obey a flux conservation law. Valid configurations are generated using the worm algorithm. The first task we performed was a regression task to predict the particle density n and the field average of |φ| 2 , given just the plain field configurations in flux quantities. The predicted observables also depend on various physical parameters, including the chemical potential µ, that are set during the generation of the configurations. We found that it is sufficient to train at only one value of µ in order to be able to generalize to other values of µ, in particular also to extrapolate beyond the Silver Blaze phase transition. While this result seems surprising at first, it can be explained by the fact that different input configurations at fixed training chemical potential µ train already cover a wide range of possible input values that are shared between physical parameters, i.e. other values of µ. Comparing the three architecture types by selecting their best-performing representatives from a network architecture search using optuna, we generally find that equivariant architectures perform best at this task. They excel, in particular, when increasing the size of the training set. We also explored whether data augmentation on the input side can compensate for missing equivariance, but this turned out to have a barely noticeable effect on the result. Furthermore, we investigated the generalization properties to smaller and larger lattice sizes, which is possible only for architectures that contain a global pooling layer. Again, across all lattice sizes, the equivariant architecture wins. Strided architectures can generalize only to lattice sizes that are multiples of the stride combinations used in the network, whereas flattening architectures are not able to generalize at all to other lattice sizes. These architectures would have to be retrained for each input size separately.
The next two tasks were related to detecting and counting flux violations from open worm configurations. Such configurations can appear in the calculation of npoint functions. They are particularly interesting as the result cannot be approximated by a purely local function, which would involve only 1 × 1 convolutions, but requires at least a 2 × 2 convolution. In the case of detecting flux violations, flattening models perform worst, while equivariant and strided architectures with global pooling layers are both able to predict the result with comparably high accuracy. Inspecting the feature maps of the trained models, we found that these models learn to detect only one end of the open worms, but this is sufficient to solve this task. For the third task to count the number of flux violations, we trained only on configurations containing either zero or five open worms and tested on configurations that contained any number of worms from zero to ten. In this setup, again the equivariant architecture wins compared to strided or flattening architectures. Interestingly, the networks have most problems to differentiate between zero and one worms, while a larger number of worms poses fewer problems. Another interesting observation is that the selection procedure of the network architecture search can lead to different optimal choices with very different generalization properties. Because of our particular choice of validation and test data, the validation loss alone is not sufficient to select the architecture that can generalize best to other physical parameters or network sizes. The optimal models we found were much smaller regarding the number of weight parameters than models used in comparable studies in the literature.
Based on our findings, we can clearly recommend using global pooling layers in future machine learning tasks that involve systems with global translational invariance. Global pooling layers allow to easily generalize results to different lattice sizes in regression and classification tasks. Whether the advantages of using pooling layers with a stride greater than one outweigh the possible disadvantages of breaking translational equivariance depends on the system being studied. An interesting aspect that warrants further study is the question of why some architectures seem to generalize better than others and whether there is a way to identify or characterize such architectures already before testing on an extended test set. Moreover, physical parameters may not be the best quantities for assessing generalization capabilities, but one should rather study the distribution of input and output values of the network. While in this work we concentrated on translational symmetry, it would be interesting to extend this study in the future to further symmetries on the lattice using, for example, G-CNNs. One could also examine coset pooling at intermediate layers that respect translational invariance. Finally, based on our findings, it seems worthwhile to investigate and study possible translationally equivariant versions of current architectures that explicitly break translational invariance, even though the underlying theory would respect this symmetry.

VIII. ACKNOWLEDGEMENT
We thank Kai Zhou for correspondence. DM thanks Jimmy Aronsson for valuable discussions regarding group equivariant neural networks. This work has been supported by the Austrian Science Fund FWF No. P32446-N27, No. P28352 and Doctoral program No. W1252-N27. The Titan V GPU used for this research was donated by the NVIDIA Corporation.
The Euclidean action, which is given by Eq. (A3), can be discretized, which makes it possible to analyze the complex scalar field on the lattice. The result reads (see e.g. [32]) where η = 2D + m 2 = 4 + m 2 , and δ ν,2 is the Kronecker delta. The first sum is over all lattice sites x, and the second one is over the two directions: space and imaginary time. The position x +ν is reached by moving one unit vectorν from x in the ν direction. Naturally, periodic boundary conditions are employed. In Eq. (A4) we have explicitly set the lattice spacing to unity. This implies that all dimensionful quantities such as m and µ are understood to be given in appropriate units of the lattice spacing. We limit the extension of the system to L in the spatial direction and to 1/T in the temporal one, where T denotes the temperature.
For non-zero chemical potential µ, the action in Eq. (A4) becomes complex. This is problematic because in this case the term e −S lat cannot be interpreted as a probability distribution, and therefore it is not possible to use standard Monte Carlo sampling to determine the partition function and its derivatives. To circumvent this so-called complex action problem, which is also known as the sign problem, one can work in a dual formulation, known as flux representation. The derivation of the partition function in the flux representation can be found in [32]. The result reads where the N lattice sites have been labeled with numbers x ∈ {1, 2, . . . , N }. The degrees of freedom are the four integer fields k x,ν and l x,ν , where ν = 1, 2. The former must obey the flux conservation law at all lattice sites x for the Kronecker delta not to vanish; the latter are non-negative. The function W (f x ) is given by and its integer valued argument reads Observables can be derived from the partition function and written in terms of the dual variables k x,ν and l x,ν . In this paper, two quantities are of special interest, namely the particle number density n and the lattice averaged squared absolute value of the field |φ| 2 . Their ensemble averages · · · are given by where N x (N t ) is the number of lattice sizes in the spatial (temporal) direction. For our machine learning tasks, we associate each individual configuration {k x,µ , l x,µ } with particular values of n and |φ| 2 in Eqs. (20) and (21), even though the dual formulation does not allow for a direct mapping between field configurations φ x and link configurations {k x,µ , l x,µ }.

Appendix B: Datasets
In this appendix, we discuss the Monte Carlo procedure we use to generate the datasets for our machine learning tasks.
The flux representation, which is given by Eq. (A6), is characterized by the positive field l and the field k constrained by Eq. (A8). Since they are different in nature, a suitable algorithm is composed of two distinct parts, each of which takes care of the modifications of the respective field. The link variables l are updated using a standard Monte Carlo algorithm, where the Metropolis acceptance probabilities are ratios of Boltzmann weights of the dual action. The links k are updated by means of the worm algorithm, originally proposed in [33], where the acceptance probabilities follow the prescriptions given in [32].
Using these algorithms, we generate all datasets in this work.
The initial configuration is set to zero at every lattice site for both k and l. Before reaching equilibrium, the system undergoes a thermalization phase, which we discard. Since autocorrelation in the dataset can affect the learning process, we monitor it and set an appropriate number of waiting sweeps between each measurement.

Regression: predicting observables on the lattice
The dataset contains lattice configurations and corresponding n and |φ| 2 values, the first ones being the input for the CNN and the latter being the quantities to predict. We create data with the following set of physical parameters: η = 4.01, λ = 1 and µ ∈ {0.91, . . . , 1.5}, where values in the range [0.91, 1.05] are separated by ∆µ = 0.005, while ∆µ = 0.1 in the range [1.1, 1.5]. We choose five different lattice sizes: 50 × 2, 60 × 4, 100 × 5, 125 × 8 and 200 × 10, where the first number is N t = 1/T and the second one N x = L. Different N t means different temperature T , which influences the properties of the phase transition, as shown in Fig. 4. The total amount of training data is N train = 20000, generated at µ = 1.05 on the 60 × 4 lattice, and the whole validation set consists of N val = 2000 at the same µ and lattice size. We define two distinct test sets, both containing 4000 data points per each µ at each lattice size. The first test set (test set A) is characterized by values of µ ∈ {0.91, . . . , 1.05}, which correspond to the ones used in [22], in order that a direct comparison with the results found there is possible. The second test set (test set B) is designed to examine the extrapolation abilities of the neural networks to chemical potentials higher than the one they have been trained on, specifically in the range µ ∈ {1.1, . . . , 1.5}. The total amount of test data is N test = 4000 × 5 × (29 + 5) = 680000. We discard the first 1000 sweeps to disregard thermalization and then save a configuration and the respective observables every five sweeps. For some combinations of chemical potential and lattice size, we observe a high autocorrelation. In these cases, the number of sweeps is increased to 50, which sufficiently reduces the autocorrelation.
We now closely inspect the distribution of the two fundamental quantities needed for the computation of the observables n and |φ| 2 , namely k x,t and f x . This is meant to give an additional insight into the dataset properties and the generalization capabilities of the architectures. We remind that k x,t can take any integer value, while f x is either 0 or a positive even number. In the following discussion, we omit the lattice index x and use k t , k x and f instead.
In Figs. 16 and 17, the first histogram corresponds to the distribution of the training set, while the second and third show the distributions of test sets A and B, respectively. The domains covered by the training set and test set A are approximately the same, meaning that the gen- . The test sets maintain a similar distribution along different lattice sizes. Even though training and test set A cover the same domains, their distributions are different, which is the origin of the generalization issues of some architectures. The distribution of test set B also reaches higher values of kt, which can make a generalization to data in test set B even more difficult than to data in test set A. Bars corresponding to weights smaller than 10 −4 in each plot are not shown. eralization we require does not involve an extrapolation. Given this, it is easy to see why models that perform well during training and validation are able to generalize to different physical parameters. Despite having the same domain, there is an evident discrepancy in the distributions of the training set and test set A, which is caused by the different values of µ that are used to generate the lattice configurations. This could be a possible source of generalization issues affecting some architectures. Note that while the distributions appear to be somewhat similar, there may still be additional differences in the correlations of these quantities, which could further impair generalization capabilities of networks. The domain that is covered by test set B, however, is larger than the domain of the training set, arguably making the generaliza- tion to these data more demanding.
While the relationship between n and k t is linear, as shown in Eq. (A11), |φ| 2 is highly non-linear in f , as indicated by Eqs. (A12) and (A9). One might find it surprising that a CNN is able to learn such a complicated function and even generalize to other physical parame-ters. Alongside the observations on domain and distributions, we have to consider the ratio W (f + 2)/W (f ) that enters in Eq. (A12). As shown in Fig. 17, it can be effectively approximated by a linear function in the range where most of the distribution of the training set and test set A is concentrated. This explains why even simple models can easily learn to predict |φ| 2 on these data. The larger values of f that are represented in test set B, however, lead to larger values of said ratio. At these values, the linear approximation that might be a good approximation on the training set and test set A worsens, and so do the predictions of |φ| 2 .

Classification: detecting flux violations
The algorithm presented in [32] is designed to generate only closed worm configurations, which respect the flux conservation and allow to compute the observables n and |φ| 2 . For the classification and regression tasks in sections V and VI, we want to create field configurations where the flux conservation (A8) is violated. In order to do this, we modify the algorithm of the previous subsection in the following manner: after equilibrium is reached via the original l and k alternate update, we start a new worm and save the configuration with one open worm. As the worm moves on the lattice, we replace the stored configuration with probability 1/L, where L is the current worm length, until the worm closes. One can easily check that this corresponds to selecting one of the open worm configurations with equal probability.
The dataset consists of closed worm configurations, labeled as class 0, and open worm configurations, labeled as class 1, each originating from two independent runs of the algorithm. Both classes are characterized by the same physical parameters, namely η ∈ {4.01, 4.04, 4.25}, λ = 1, µ ∈ {1, 1.25, 1.5} and N t = N x ∈ {8, 16, 32, 64}. The training set is generated on a particular subset, specifically the two combinations (η, µ) ∈ {(4.01, 1.5), (4.25, 1)} on the smallest lattice size, i.e. 8 × 8, with a total number of N train = 4000 samples equally distributed among each class and parameter combination. The validation set has the same structure, the only difference being the number of samples of N val = 400. The test set contains 100 instances per each class and parameter combination, summing up to 100×2×36 = 7200 samples. The number of skipped configurations to avoid picking samples while the thermalization process is still ongoing is chosen as 2000. We use 100 waiting sweeps between each measurement. The dataset created for this task and the next one share very similar characteristics. We address the analysis of only the third task in the following subsection, implying that the considerations we make there are also valid in this context.

Regression: counting flux violations
The algorithm designed in the previous task is extended to account for multiple worms. After the first configuration with an open worm is saved as described in the last section, it becomes the starting configuration for the next worm to be drawn. We explicitly prohibit that a worm can cross heads and tails of previous worms, i.e. lattice sites where the flux is violated. By doing this, we ensure the absence of mathematical ambiguity in the definition of the Metropolis acceptance probability. As a consequence, three values for the flux are possible: 0, +1 and −1. The procedure is repeated until the required number of open worms is reached and the configuration is saved. Then the last configuration without open worms is restored, the established waiting sweeps are performed and another set of open worms is drawn.
The same set of physical parameters of the previous task is used with the addition of the number of worms N worms ∈ {0, 1, . . . , 10}. The subset of parameters for the training set is chosen as the combinations (η, µ) ∈ {(4.01, 1.5), (4.25, 1)} and N worms ∈ {0, 5}, again on the smallest lattice size. The total amount of data used is N train = 20000 when training only on 0 and five worms. The validation set consists as usual of a number of configurations such that N val = N train /10 of the number of training samples. The test set contains again 100 samples per parameter combination, leading to a total of 100 × 11 × 36 = 39600 instances. Initially skipped configurations and waiting sweeps are the same as in the previous task.
The two quantities necessary for the computation of the flux are the two integer fields k t and k x , as suggested by Eq. (A8). Their distributions are depicted, respectively, in Figs. 18 and 19. We can draw the same conclusions as in the first task concerning the similarity of the domains and the difference in the distributions between training and test sets. We add that the choice of (η, µ) for the training set is made specifically to include the lower and higher values of k t , in such a way that the domain covered is the same in the training and in the test set. This explains the two peaks in the k t training distribution in the top histogram in Fig. 18. Such behavior does not emerge in the case of k x because, unlike k t , it is not coupled with the chemical potential.
The global average pooling over an N ×N feature map f is given by The spatial average pooling P can be interpreted as a special convolutional layer using the filter ψ(x) = 1/k 2 , where The resulting feature map f : F ⊂ Z 2 → R has the dimensions N/k × N /k. The validity of Eq. (C1) can be seen by = GAP(f (y)).
The first step uses Eqs. (C2) and (C4); note that the GAP is performed over the feature map f . The second equality holds for s = k, k|N and k|N , and the last one utilizes again Eq. (C2) but for the feature map f .
Remark. Even though the spatial average pooling layer with s > 1 breaks translational equivariance under arbitrary translations, the result after the GAP is still invariant under translations.
The following lemma shows that a convolutional layer that is directly followed by a global average pooling layer does not have the effect that one might expect a regular convolution to have. Loosely speaking, it does not effectively increase the network's depth because it collapses with the global pooling layer. This lemma should therefore also highlight the importance of the usage of an activation function between the last convolutional and the global average pooling layer.
The first equality combines the definitions of the GAP, given by Eq. (C2), and the convolution, given by Eq. (10). The second one utilizes the fact that the bias does not depend on the lattice site x. The third step takes advantage of the fact that ψ does not depend on x, and the fourth one makes use of the periodic boundary conditions and of Eq. (C2). The last equality holds because the result of the GAP does not depend on φ .
Remark. This is possible only without an activation function between the convolutional layer and the global average pooling. Also note the importance of periodic boundary conditions.
The following theorem combines both lemmas and shows that a spatial average pooling layer, followed by a convolutional and a global average pooling layer, still leads to an output that is invariant under translations of the input if the strides and kernel sizes are chosen appropriately. It emphasizes once again the importance of an activation function before the global average pooling.
Theorem 1. Given an N × N feature map, a k × k spatial average pooling layer with stride s = k, k|N , k|N and an l × l convolution ψ with a stride of one and a single output channel, applying the spatial average pooling layer, then the convolution and then the global average pooling layer is equivalent to applying the global average pooling layer, multiplying the result by the sum of the convolution's weights and adding the bias b.
Proof. Combining lemmas 1 and 2 with M = N/k and M = N /k leads to the desired result.
Remark. The generalization to more than one feature map and multiple output channels is straightforward.

Appendix D: Partially occluded input
In this appendix, we analyze the worse performance of the ST models on the 100 × 5 lattice from section IV, . Test loss (top) and its two parts (middle and bottom) that come from each observable, corresponding to discarding 20% of the input data, on various lattice sizes. The networks are trained on a 60 × 4 lattice. The dimension along which the input data are occluded is marked in red. For n, the predictions become worse if the data are concealed in the spatial direction. For |φ| 2 , the predictions become worse if any data are hidden, but they are slightly worse if data are suppressed in the spatial direction. The reason for this lies in the nature of the observables.
which is depicted in Fig. 5. The source of the problem is that a model ignores part of the data at a strided operation if kernel and stride are not compatible with the data's shape that the layer in question receives. In the case of the ST models, these operations are the strided spatial pooling layers. To examine this problem in more detail, we perform an experiment in which we hide a portion of the input data from the network for both architectures, ST and EQ: on every lattice size, the network is shown only a part of the input. This is done by discarding 20% of it either on the right or on the bottom of the lattice, so that the resulting restricted input still has a rectangular shape and thus a valid input size for the networks. This way, only 80% of the input data are shown to the network, but the value of the observable still corresponds to the full lattice configuration. The input size of the 50 × 2 lattice becomes 40 × 2, 60 × 4 becomes 48 × 4, 100 × 5 be-comes 100 × 4, 125 × 8 becomes 100 × 8, and 200 × 10 becomes 160 × 10 and 200 × 8, respectively. Note that the data are discarded four times in the temporal and twice in the spatial direction. The result of this experiment is shown in Fig. 20. The overall test loss (top) shows two kinks, namely for the lattices, for which the input was restricted in the spatial direction. These kinks are also seen in the loss curve corresponding to |φ| 2 (bottom) but are much more pronounced for n (middle). In fact, if the data are discarded in the temporal direction, the quality of the prediction of n barely changes at all, as can be seen by comparing the middle plot in Fig. 5 to the middle plot in Fig. 20. The reason for this is the way the configurations are generated, namely with the worm algorithm.
The nature of said algorithm is local, and each step involves adjacent points, giving rise to modifications of the field values that are contiguous on the lattice. This can be formally expressed by interpreting worms as paths on the torus-like space corresponding to the lattice with its periodic boundary conditions. For this task we deal with only closed worm configurations, so the paths are, in fact, loops. With this picture in mind, Eq. (20) represents an (averaged) winding number in the temporal dimension of the torus. Discarding data in this direction does not alter the winding number. On the other hand, if data are discarded along the other dimension, parts of the worm might be discarded as well, leading to a very high discrepancy from the true winding number.
The small kinks in |φ| 2 can be explained by means of some additional remarks: first of all, in the range of f x in our dataset, the ratio W (f x + 2)/W (f x ) is almost linear, so |φ| 2 can be viewed as the average of f x in first approximation. The functions W and f x are given by Eqs. (A9) and (A10), respectively. The link variables l t and l x are not modified by the worm but by a standard Monte Carlo process; hence, their distribution is not biased in any direction, and the average over the truncated lattice has small deviations from the one over the whole lattice. Note that these deviations decrease as the lattice increases in size. Unlike the integer field k t , k x is not coupled to the chemical potential; therefore, it is less likely for worms to wind around the spatial dimension than the temporal. This means that the average of k x is affected by a cut in the time dimension, but deviations from the true value do not occur often and decrease as the lattice size increases. Combining all these remarks finally yields the reason for those two kinks being at the same value of the restricted lattice size for both observables in Fig. 20 and the reason why they are less pronounced for |φ| 2 .