Equivariant Energy Flow Networks for Jet Tagging

Jet tagging techniques that make use of deep learning show great potential for improving physics analyses at colliders. One such method is the Energy Flow Network (EFN) - a recently introduced neural network architecture that represents jets as permutation-invariant sets of particle momenta while maintaining infrared and collinear safety. We develop a variant of the Energy Flow Network architecture based on the Deep Sets formalism, incorporating permutation-equivariant layers. We derive conditions under which infrared and collinear safety can be maintained, and study the performance of these networks on the canonical example of W-boson tagging. We find that equivariant Energy Flow Networks have similar performance to Particle Flow Networks, which are superior to standard EFNs. However, equivariant Particle Flow Networks suffer from convergence and overfitting issues. Finally, we study how equivariant networks sculpt the jet mass and provide some initial results on decorrelation using planing.


Introduction
In the past half-decade, there has been a burst of activity surrounding the application of deep learning methods (particularly neural networks) to a variety of problems in collider physics. Much of this research has been focused on tagging jets . Jets are collimated sprays of particles produced from energetic partons and contribute significantly to the complexity of a collider event. Optimal methods of jet tagging are therefore highly desirable in order to maximally exploit the physics capabilities of the LHC and future colliders. Examples include quark/gluon discrimination [26][27][28][29][30][31][32] and the discrimination of QCD jets from overlapping jets produced by the hadronic decay of a boosted heavy resonance [33][34][35][36][37][38], a topic of much interest at the LHC. Indeed, the "standard candle" for jet tagging studies is the identification of boosted W-bosons from their decay products, a process which we study in this work.
The new ML-based methods leverage the fine resolution of modern detectors through their ability to automatically construct powerful discrimination variables from low-level inputs. By choosing a suitable representation for the inputs, existing deep learning models can be directly applied to the task of classifying jets. For example, multilayer perceptrons (MLPs) can be trained on lists of jets' constituent particle momenta [12], and convolutional neural networks (CNNs) can be trained on jet images [7][8][9][10][11]. These methods have led to substantial improvements on the performances of traditional taggers. However, the jet representations they use have drawbacks. For example, jet images tend to be sparsely populated by particles, making a large number of the input pixels redundant. On the other hand, some methods based on the treatment of jets as lists (such as recurrent neural networks) fail to account for the fact that the particles in a jet have no intrinsic ordering. Accordingly, it is important to continue to thoroughly explore the landscape of machine learning-based taggers, and better understand their performance through analytic calculations [39].
A specific recent example is the Energy Flow Network (EFN), introduced in Ref. [4]. EFNs are based on the Deep Sets framework [40], which treats data as point clouds -sets of points in space. Point clouds do not have a fixed length or notion of ordering, similar to particle jets at colliders. The Deep Sets framework demonstrates how to incorporate permutation invariance and equivariance of the data -jet constituents in our case. The permutation invariant case was extensively studied in Ref. [4] which treats jets as sets of particle four-momenta, and was also able to show that the resulting observables can be made infrared and collinear (IRC) safe. More general non-IRC safe networks were also studied, and dubbed Particle Flow Networks (PFNs).
Along with permutation invariant networks the Deep Sets framework can incorporate permutation equivariance. A permutation equivariant function is one which commutes with the permutation operation. That is, for input data x, a permutation π and a function f equivariance requires f (π( x)) = π(f ( x)). Clearly this requires the dimensionality of the output of f to be the same as the input. For binary classifiers the output is usually one-dimensional. Accordingly, it is straightforward to have an entire network which exhibits invariance of its output under permutation of the input parameters, but equivariance is implemented on individual internal layers of the network. While equivariance under permutations of input particles is less obviously motivated than invariance, it is still interesting to investigate the performance of such networks. Another recent work that exploits equivariance is Ref. [41] which constructs networks that are equivariant under the action of the Lorentz group. Since the Lorentz group is a Lie group and hence continuous, this necessitates a different approach from the finite permutation groups relevant to this work.
In this paper, we continue the study of Energy Flow Networks by implementing permutationequivariant network layers into the EFN architecture. We are able to find a variant which maintains infrared and collinear safety of the observables learned by the resulting network. We study the performance of equivariant Energy Flow and Particle Flow networks on boosted W -tagging, including some networks that use information about the jet constituent ID. While the equivariant EFNs slightly outperform the invariant EFNs, we find that this can change under decorrelation of the network output using planing. We also consider some possible future directions.

Energy Flow Networks
In principle, the output of any neural network that takes jets or jet constituents as input can be interpreted as a jet observable. This output could be sensitive to the ordering of the constituents, contrary to the notion that collider events can be viewed as unordered list of particles. This has motivated the development and application of neural network architectures which are insensitive to the ordering of the jet constituents, where jets are viewed as point clouds: a set of data points in space. The representation of functions on sets from which neural network architectures may be modeled has been much studied recently in the ML community [40,[42][43][44][45]. Collider physics studies in this vein include the use of architectures based on Graph Neural Networks such as Ref. [6] and related Message Passing Neural Networks [46][47][48].
We will explore the Energy Flow Network (EFN), an example which is based closely on the Deep Sets model introduced in Ref. [40]. Among deep learning models for jet tagging EFNs have the important feature that they are engineered to enforce infrared and collinear (IRC) safety on the network output. IRC safety is the requirement that a jet observable be invariant under soft and collinear emissions. We will demonstrate IRC safety of equivariant EFNs in Section 3.

Deep Sets framework
Deep Sets [40] provides a framework for constructing neural networks that incorporate permutation invariance of the inputs. An Energy Flow Network is an implementation of this model which acts on jets (considering jets as sets of particles with associated momenta), with the additional requirement of IRC safety.
The Deep Sets framework represents permutation invariant functions f in the following way. For a function acting on a set X, permutation invariance is ensured by first mapping the individual elements of X to an intermediate latent space using a function Φ. The results Φ(x i ) are then summed in the latent space. Finally, a different map F takes the result of the summation in the latent space to the desired range of f (X). Given that the sum is invariant under permutations of elements in X, the same property is inherited by the total function f . The sum also allows such networks to accept inputs of varying length.
The authors of Ref. [40] prove the following Deep Sets Theorem: A function f operating on a set X having elements from a countable universe X is invariant to the permutation of instances in X if and only if it can be decomposed in the form for suitable transformations Φ and F .
Ref. [4] applies this decomposition to jet observables, which we may consider as functions on the set of constituent particle features (four-momentum, charge, flavour etc.). Such observables can therefore be written in the form where each p i ∈ R d is a vector containing d features of particles in the jet, Φ : R d → R is a perparticle map to the latent space, and F : R → R is a continuous mapping. The index i runs over the M particles in the jet. In fact, even before applying the function F , each of the components of the summed per-particle maps is a jet observable. The final observable O is some combination of these, dictated by F . A number of familiar examples can be written in this way. For example, the jet mass corresponds to choices Φ(p µ ) = p µ and F (x) = √ x µ x µ .
Ref. [4] makes use of the Stone-Weierstrass Approximation Theorem [49] to derive the specialisation of Eq. 2 to the case where O is IRC safe. They arrive at where the per-particle mapping Φ : R 2 → R now depends only on the angular coordinates of the jet constituent momenta throughp ≡ (y, φ), and the sum is weighted by the transverse momentum fractions z i = p T,i / j p T,j . Finally, we comment on the dimensionality of the latent space. While any IRC safe observable can be represented this way, this may require a latent space of high dimensionality. In Ref. [40], the Deep Sets Theorem was only proven for the case where instances of the set X are elements of a countable domain. Ref. [50] argues that extending consideration to uncountable sets is necessary since continuity on countable set (such as Q) is a vastly different condition to continuity on the (uncountable) reals. They show that the Deep Sets theorem can only be extended to finite subsets of uncountable domains, with the condition that the latent dimension should be at least the number of elements in the set. That is, in order to ensure universal function approximation of a network in the Deep Sets framework, one should have ≥ M . This is consistent with the observation in Ref. [4] that an EFN's performance saturates near = 64, since the multiplicity distributions of the jet samples peak around 40-50. We will present results below for = 128 and 256.

Architecture
To realise Eq. 3 as a neural network, Ref. [4] uses two multi-layer perceptrons (MLPs) to approximate the functions Φ and F . A schematic of this construction for an EFN with latent dimension is presented in Fig. 1. The network on the left represents the per-particle mapping and the number of nodes in its output is equal to the dimension of the latent space. Each of these output nodes is called a filter and is a function of the jet constituents' angular coordinates y and φ. Each filter is transformed into a jet observable by performing the p T -weighted sum O a = i z i Φ a (y i , φ i ), where the sum runs over particles in the input jet. This set of observables is then passed as input to the second network which represents the map F . As is customary for the application of neural networks to classification tasks, the output layer of F has two nodes and uses the softmax activation, which correlates these nodes such that their outputs take values in the interval [0, 1] and sum to unity. The output of the network may then be interpreted as probabilities, F S and F B , that the input is a W (signal) or QCD (background) jet respectively. However, since the outputs are correlated, the network really only represents a single IRC safe jet observable for which W jets have a value close to 1, and QCD jets, 0 (or vice-versa).

Augmented Energy Flow Network architecture
In addition to presenting a permutation-invariant neural network architecture, on which the EFN is based, Ref. [40] also defines permutation-equivariant neural network layers. Equivariance of a function means that it commutes with permutations, rather than being invariant to them. Here we detail the way in which an equivariant layer can be incorporated into the existing structure of an EFN and seek an architecture that is able to maintain IRC safety of the network output. Figure 2: An example implementation of two equivariant layers (blue) into the EFN architecture. The first equivariant layer has input channels and the second has output channels. A pooling operation is required to merge the vectors E i into a single vector O containing jet observables.

Equivariant network layers
Permutation equivariance constrains the form of neural network layers. Consider a network layer E Θ (x) = σ(Θx), having activation σ, weight matrix Θ ∈ R M ×M , and acting on a set x with elements x i ∈ R. A result from Ref. [40] states that such a layer is permutation-equivariant if and only if Θ can be decomposed as with λ, γ ∈ R, I = id ∈ R M ×M and 1 = (1, · · · , 1) T ∈ R M . That is, the matrix Θ has all diagonal terms equal and all off-diagonal terms equal. The function E therefore depends only on a linear combination of the input x itself and 11 T x -a vector with each component being the sum of components of x. The equivariance of the layer arises from the fact that this summation respects permutation symmetry. One can also consider functions E whose arguments are not of the linear form Θx. For example, the function also defines an equivariant network layer. Eqs 4 and 5 are valid when the elements of the set x i ∈ R. However, the x i could themselves be vectorial x i ∈ R D , so that x ∈ R M ×D . In that case the parameters λ and γ can be promoted to matrices Λ, Γ ∈ R D×D with D being the dimension of each output node. This leads to where maxpool is an operation that merges feature vectors in the set x ∈ R M ×D to a single summary vector in R D with i th component equal to max x j i M j=1 . We will see that EFNs using these layers are IRC unsafe in general. In our studies below we will use layers of the form Eq. 7 as an example of such a network, and further discuss IRC safety below.

Layer implementation
As well as introducing the equivariant layer itself, Ref. [40] also discusses how such layers can be implemented in a specific model. Two properties of the equivariant layer will be exploited. The first is that the composition of equivariant layers is also equivariant. This can be seen by considering equivariant functions f, g acting on a set x with n elements and π ∈ S n a permutation. Then so the composition of f and g is equivariant. This implies that equivariant layers can be stacked to build deep models. The second property is that following an equivariant layer with any commutative pooling operation (such as sum, mean or max) across set elements yields a permutation-invariant function. This allows us to replace the sum over per-particle maps (the permutation-invariant function in the standard EFN architecture) with a sequence of equivariant layers followed by a pooling operation. The network is then able to learn its own particular permutation-invariant function by which to combine the per-particle information.
A schematic of the architecture is presented in Fig. 2. Just as with the standard permutation invariant EFN, on the left we have a dense network representing a per-particle map Φ : R 2 → R that acts on the y-φ plane. In the centre of the diagram the vector representation of each particle under this map is then passed as input to a number of equivariant layers shown with blue shaded nodes, the first of which has input channels. The number of input (output) channels in an equivariant layer is defined as the dimensionality of each input (output) node. Since we use fixed network layers to replace the permutation-invariant function, our architecture is not able to accept variable-length sets as the EFN does. This is addressed by fixing the layer size, M , to be suitably large and zero-padding the inputs.
The final equivariant layer , having output channels, is followed by a pooling operation that projects the M × -dimensional output of the layer to a vector O of length . There is freedom to choose this operation, for example some of the models described in Sec. 4.2 use a maxpooling to construct the components of the vector according to

IRC safety
The networks generated from the method above are distinct from the IRC safe EFN studied in Ref. [4], which corresponds to taking the pooling operation as a z-weighted sum and the function E as the identity. As such, IRC safety of the jet observables from the new networks is not guaranteed. In this section, we aim to find an architecture that maintains IRC safety. This allows for a closer comparison with the EFNs of Ref. [4].
There are a number of potential architectures of the type described in the previous section. We will consider an equivariant energy flow network denoted by EV-EFN, which for simplicity we assume has a just a single equivariant layer E : R M × → R M × . We will consider the implications of multiple equivariant layers later in the section. We may express the action of the network as where P : R M × → R is the pooling operation, F : R → R is the final function, and Φ(p) ∈ R M × denotes the set of all per-particle latent vectors Φ(p 1 ), · · · , Φ(p M ) . To specify a network, we must make choices for the pooling operation P and the form of the equivariant layer (using Eqs. 6, 7 above for instance etc.). The maps F and Φ are given by fully-connected MLPs. We can write the vector of jet observables O by dropping F from Eq. 9. This gives Since we know that taking P as a z-weighted sum P (x) = i z i x i is an IRC safe choice in an EFN, we will only consider this operation for this section. This may not be the unique choice for ensuring IRC safety, but we were able to rule out some other possibilities. For example, a max function is not necessarily invariant to the collinear splitting of a particle.
In an EFN with equivariant layers, this choice of P alone is not enough to ensure that the observables vector O is IRC safe. The reason for this is that in the new architecture, the output nodes of the equivariant layer are not strictly per-particle maps. That is, the vector E i has dependence on the momentum of particles other than i. However, the fact that this mixing comes through an equivariant layer imposes helpful constraints.
Specifically, permutation equivariance demands that the i th output of the layer depends on inputs other than i only through permutation-invariant functions of the inputs (in linear combination). For example, in Eq. 4 for (Θx) i the first term depends only on x i and the second term is proportional to the permutation invariant sum i x i . Thus, despite the outputs E i of the equivariant layer not being per-particle maps, they do have a unique dependence on the i th particle.
In addition, the permutation-invariant functions that give the dependence on all particles can be identified as jet observables. We show below that the IRC safety of the z-weighted sum of such an output, which depends on the linear combination of single particle information and a jet observable, is tied to the IRC safety of the jet observables in the equivariant layer. We therefore parameterise an equivariant layer as where Q ∈ R holds jet observables that depend on the set of per-particle latent vectors, the matrices Λ and Γ are defined as in Eq. 6, and σ is the activation function.
Taking P as a z-weighted sum and the discussion above into account, Eq. 10 for a network with a single equivariant layer can be written out in more detail as We wish to show that the observable O is IRC safe if every component of Q is IRC safe. To study infra-red safety, we add an arbitrarily soft particle labeled by index s that can have momentum in any direction: where Q IR = Q(Φ(p s ),p 1 , . . . ,p M ). For collinear safety a particle splits with momentum fractions λz and (1 − λ)z with λ ∈ [0, 1]. Without loss of generality we take this to be the first particle. Under a collinear splitting the observable becomes and Q C = Q(Φ(λp 1 , (1 − λ)p 1 ,p 2 , . . . ,p M )). If Q IR = Q C = Q, then Eq. 12 is recovered from both Eq. 13 and Eq. 14. Thus, O is IRC safe if Q is IRC safe. One obvious choice for the components Q a is to again use the z-weighted sum over perparticle maps, Q a = i z i Φ a (p i ). However, as it stands, the equivariant layer only takes directional informationp as input through the filters Φ. Any IRC safe observable must have dependence on the p T of the jet constituents, and so in order to proceed we need a way of incorporating the p T fractions, z i , as weights associated with each input into the equivariant layer.
To achieve this, we consider a simple extension of the layer E(x) = σ(xΛ + 11 T xΓ) from Eq. 6. This operation depends purely on a linear combination of the input x itself and a vector where each component is the sum of elements in x. If we allow the layer to accept an additional input representing weights for each element of the set x, then the sum over elements can be replaced by a weighted sum. Setting the weights to be the p T fractions, z, yields an IRC safe equivariant layer with Q a = i z i Φ a (p i ).
The IRC safe jet observables are then obtained by composing with the z-weighted sum. In full, they read Interestingly, one finds that if the activation function, σ, is not applied, then the network is equivalent to a standard EFN: where the first step follows from the fact that the p T fractions sum to unity, and we have redefined the per-particle map Ψ a (p i ) ≡ b=1 Φ b (p i ) Λ ba + Γ ba . This is a consequence of the fact that IRC safety imposes a linear structure on the transformations prior to σ. Removing the activation allows the action of the equivariant layer to be absorbed into the definition of the per-particle map. Next, we must check that IRC safety is maintained in a network consisting of more than one such equivariant layer. Composing two layers leads to observables where E denotes the first equivariant layer, which has matrix parameters Λ and Γ . Thus the vector O is IRC safe. To see this, notice that the summand for each sum over particles is weighted by z and that these weights constitute all of the z dependence in the expression. The same treatment can be applied for additional stacked layers. Although it would be impractical to present the equations here, one finds that the sums over particles are always z-weighted, leading to IRC safety. This is because adding an additional equivariant layer corresponds to replacing each Φ(p {i,j,k} ) by E {i,j,k} . Since E {i,j,k} contains only IRC safe observables and Φ(p {i,j,k} ), this replacement leaves the same z-weighted structure of the latent vectors.

Training and performance
Here we evaluate the performance of equivariant EFNs in discriminating jets from boosted W bosons against the QCD background. We will consider both IRC safe and unsafe equivariant networks. The IRC unsafe networks will be contrasted with the performance of Particle Flow Networks (PFNs) [4]. PFNs are similar to EFNs, but based around the general observable decomposition of Eq. 2 and are thus IRC unsafe. We begin by detailing the data generation and preparation steps before outlining the construction of each model.

Data generation and pre-processing
The W and QCD jets used for this work are obtained respectively from pp → Z + W and pp → Z + jet events generated and hadronised with PYTHIA 8.2 [51] at centre of mass energy √ s = 13 TeV. For the W events, we use the WeakDoubleBoson:ffbar2ZW channel with the W decaying to quarks. For the QCD events we use WeakBosonAndParton:qg2gmZq and qqbar2gmZg with the photon contribution switched off. The Feynman diagrams for the tree-level contributions to these processes are shown in Fig. 3. In both cases, the Z is required to decay invisibly to neutrinos. Multiple parton interactions as well as initial and final state radiation are left on.
To study network performance on jets of different transverse momenta we produce two datasets: the first containing jets with p T ∈ [250, 300] GeV and the second with p T ∈ [500, 550] GeV. The datasets are produced as follows: after each event is generated, particles with p T less than 500 MeV are removed with the intent of mitigating the influence of contamination from the underlying event. The event is then clustered by the anti-k t algorithm [52] using FastJet [53] with a jet radius of R = 1.0 for the p T ∈ [250, 300] GeV dataset and R = 0.6 for the p T ∈ [500, 550] GeV set.
If the hardest jet in the event has mass m ∈ [65, 105] GeV, rapidity |y| < 2.5, and transverse momentum within the relevant range, the set of its constituent particle feature vectors is added to the dataset, with each vector in the format: (p T , y, φ, m, pdg id). The field pdg id follows the Particle Data Group Monte Carlo particle numbering scheme [54]. We centre the selection window slightly beyond the mass of the W since the final jet will contain some additional radiation from the underlying event, translating the distribution. The final datasets contain 1 million W and QCD jets in equal proportion and are split into three subsets for training (75%), validation (15%), and testing (10%).
We employ four pre-processing steps intended to isolate the differences between the W and QCD jets. Each jet in the dataset is first re-clustered using the k t algorithm [55] with a jet radius of R = 0.2 or R = 0.3 for the high-or low-p T datasets respectively. The resulting subjets are ordered by p T and the following steps applied: 1. Normalise the p T of all particles in the jet such that their sum is unity.
2. Translate all particles in the y-φ plane such that the leading subjet lies at the origin.
3. Rotate all particles in the jet such that the secondary subjet lies directly below the origin.

4.
Reflect all particles about the axis φ = 0 such that the third subjet has φ > 0. If there is no third subjet, reflect such that the p T -weighted sum of φ over all particles in the jet is positive.
The normalisation step serves to partially reduce the dependence on the p T of the jet. Although the radial scale of the event still varies with the transverse momentum, we do not remove this dependence (through zooming [11] or a similar procedure) since the angular scale of the radiation pattern is correlated with the jet type whereas the jet p T itself is not. The remaining steps remove redundancies arising from spatial symmetries, namely the location and orientation of the event in the detector.

Model details and performance
To evaluate the performance of the equivariant architecture, we employ six neural network models which we call EFN, PFN(-ID), EV-EFN and EV-PFN(-ID). The details of each model are given below. All nodes in the networks (except for those in an output layer) use the ReLU activation [56] and all weights and biases are initialised by He-uniform parameter initialisation [57]. The Adam optimisation algorithm [58] is used to minimise the binary cross-entropy loss function with data split into batches of 250. Equivariant layers are fixed to a size of M =140 (larger than the number of particles in any of jets in the datasets). The EFN, PFN and PFN-ID models are implemented Figure 4: Schematic of the network representing the per-particle map, Φ, in the IRC-unsafe architectures.
In the ID variants, the input layer is also passed particle identification information. The standard PFN-ID model represents these with small floats, while the EV-PFN-ID model uses 14-dimensional basis vectors.
using the Energyflow package 1 for Python [4]. The remaining networks, which involve equivariant layers, were trained through Keras [59] with a Tensorflow [60] backend. We decide on the number of epochs over which to train a model by monitoring the accuracy and loss on the validation set during training and stopping when the results begin to diverge from the test set. The choices of number and size of layers in the EV models are such that the networks contain the same number of parameters as their unaugmented partners (to within one percent). The networks are defined as follows.
EFN: An = 256 energy flow network where the network representing the per-particle map has two hidden layers of 100 nodes, and the network representing F has three, also with 100 nodes each. To avoid overfitting, the model is trained for 30 epochs.
PFN and PFN-ID: These two networks have the same general construction as the EFN network above, however the constituent particles' p T fractions are given as input to the per-particle map instead of weighting the sum over filters. These networks represent the observable decomposition in Eq. 2, which is not IRC safe in general. In the ID variant, the input feature vector contains both momentum and particle identification. The identities are represented by small floats beginning at 0 for an absent particle, then increasing by 0.05 for each category of constituent: γ, p,p, n,n, π + , π − , K + , K − , K 0 L , e + , e − , µ + , µ − . A diagram showing the altered per-particle map for both PFNs is shown in Fig. 4. The models are trained for 25 epochs.
EV-EFN: This is an = 128 EFN containing two IRC safe equivariant layers as defined in Eq. 15.
We use a smaller latent dimension of = 128 and reduce the depth of the MLP representing the function F such that the total number of parameters in the model is comparable to that of the = 256 EFN. Specifically, the networks for Φ and  Table 1: Performance metrics obtained by the models listed in text on both p T datasets. The background rejection metric R is defined as 1/ε QCD at 50% signal efficiency. For both the AUC score and R, a greater value corresponds to better classification. The reported uncertainties are interquartile ranges of values from 10 separately trained models.
are of the type in Eq. 7, which are not IRC safe. The per-particle map for this architecture is the same as the PFN(-ID), shown in Fig. 4, except that we use 14 dimensional basis vectors to represent each category of constituent particle in the ID variant. This choice lead to better stability of training and avoids the network associating closely labelled particles with one another. To produce the vector of jet observables, we use the max-pooling operation. Models using summation to pool the outputs of the equivariant layer were also trained, however these networks frequently failed to learn any information, having AUC scores of 0.5 after training. In the instances where training did converge, they achieved similar results to the models using max-pooling. Due to strong overfitting, the models are trained for 20 epochs.
The AUC scores and background rejections, R ≡ 1/ε QCD , for the models trained on the lowand high-p T datasets are presented in Tab. 1. Comparing the IRC safe models, we see that the EV-EFN achieves slightly better classification than the EFN across both measures in both p T ranges. We also show the ROC curves averaged over 10 networks for these models (as well as the PFN) in Fig. 5. The curves here again reflect the classification advantage that the EV-EFN achieves over the EFN. This suggests that the addition of equivariant layers aids the network's ability to learn from IRC safe information. In fact, the improvement obtained by the EV-EFN is greater than the improvement seen from the PFN model, which is not IRC safe and uses a latent space of twice the dimensionality. However, the results in Tab. 1 show that the EV-PFN does not discriminate better than the PFN model. In fact, on both datasets it performs worse than even the IRC safe EFN across both metrics. The result for the EV-PFN-ID model is not as poor -although it does not match the classification power of the PFN-ID networks, it does at least outperform the PFN.
A major issue that the EV-PFN(-ID) networks face is overfitting. To illustrate this, in Fig. 6 we plot the classification accuracy of the PFN and EV-PFN models on the training and validation datasets over 50 epochs. The test accuracy for the EV-PFN diverges from the validation accuracy after just 15 epochs of training, whereas this behaviour is not seen to such a strong degree for the PFN despite both models containing the same number of parameters. The plots reveal that the EV-PFN does indeed classify jets from the training set more accurately than the PFN, however its poor generalisation means that this performance is not transferred to test results. The strong overfitting could possibly be addressed by regularisation techniques. One such option is to apply dropout -a method wherein nodes in a layer are randomly selected to be switched off during training [61]. We investigate application of dropout on layers in the network for F , but find that for a variety of dropout rates, only the performance on the training set is affected. So, while the overfitting is technically reduced, no increase in generalisation is achieved. The same result is observed for the ID variants. In the interest of avoiding potentially unfruitful trial and error, we leave regularisation of the EV-PFN(-ID) models to future work.

Dependence on jet mass
In this section, we compare the effect of the EV-EFN and EFN on the QCD jet mass distribution. We follow closely the approach taken in Ref. [62]. An unrestricted neural network will make use of all available information, jet mass included. Consequently, the output of such networks will be correlated with the jet mass, leading to a sculpting effect of the jet mass distribution. This is not a desirable feature, since background modeling is often done by extrapolating from a region containing no signal, relying on an approximately flat background. A background distribution that is smooth and featureless is associated with less systematic uncertainty. There exists a trade-off between the raw discrimination power a neural network offers and the degree to which it reshapes the background jet mass distribution. For this reason, jet observables that are strongly correlated with the jet mass are less valuable for experimental use compared to those that are uncorrelated. This problem has been studied extensively from both analytic [63,64] and machine learning perspectives [25,62,[65][66][67][68][69][70][71].
Here we compare results first for the standard training procedure described above and then including the application of a method for decorrelating network output from the jet mass. To compare the two models, we train 10 EFNs and 10 EV-EFNs on the p T ∈ [500, 550] GeV dataset, split into training (75%) and testing (25%) subsets. By increasing the size of the testing set, we reduce the impact of statistical fluctuations on the results. This is important since, at low signal efficiencies, a very small number of QCD jets are tagged. In Fig. 7 we visualise the sculpting effect of the EFN and EV-EFN models on the training set by showing the jet mass distributions of QCD jets that are tagged as signal by successively tighter cuts the network output. Both models strongly sculpt the QCD distribution to the W mass peak. In order to quantify the degree of sculpting we make use of the Hellinger distance, defined as where p and q are discrete probability distributions. Hellinger distances are normalised between 0 and 1, with a distance of 0 corresponding to p and q being identical histograms. For each trained model, the Hellinger distances between the jet mass distributions of tagged QCD jets and the original are calculated (using histograms of 25 bins) for a range of cuts on network output. We then average the distances over network instances. A plot of these average distances as a function of the network background rejection for the two models is shown on the left of Fig. 8. In these terms a better classifier corresponds to a curve that is closer to the bottom-right corner, reaching greater background rejection at smaller Hellinger distance. The general behaviour of the two models is very similar, however the EV-EFNs achieve smaller Hellinger distances across all cuts on the network output. The greatest difference is seen at a background rejection of 1/ε QCD ∼ 90 (corresponding to a signal efficiency of ε W = 0.5), where they improve upon the EFNs by approximately 5%. While the EV-EFNs have a smaller effect on the background distribution compared to the EFNs, both models sculpt to such a great degree that this improvement can be considered somewhat inconsequential. Having said this, a more meaningful result can be obtained by investigating the networks' sculpting after decorrelating the network output from the jet mass. Ref. [62] found that for neural network taggers, techniques that achieve this through adjustments to the input data provide decorrelation almost on par with those that alter the training procedure, while having significantly lower computational cost. For this reason, we compare the response of EFNs and EV-EFNs to mass-planing -a particular case of data planing [8,65]. A dataset can be planed in some observable, ρ, by assigning each instance of the training data x i with a weight w according to where dσx dρ is the differential cross-section for jets in x. In practice, this simply corresponds to inverting the histogram for ρ at the bin in which x i lies. After such a weighting, the data has a uniform distribution in the planed observable.
To implement mass-planing in the networks' training procedure, we calculate weights from the W and QCD mass distributions histogrammed into 100 bins. The weights are used to give the relative importance of a jet's contribution to the loss function during training. The Hellinger distances are then calculated and averaged in the same fashion as for the un-planed case. On the right side of Fig. 8, we again present a plot of the calculated distances against background rejection for the newly trained networks. Interestingly, the advantage that the EV-EFN displayed without planing is no longer apparent. In fact, the standard EFN exhibits less background sculpting across most cuts on the network output. One should therefore be cautious about interpreting the absolute performance of some given taggers without considering decorrelation, as this may change the order of the rankings. However, the EV-EFN still achieves greater background rejection at fixed signal efficiency. This implies that in a bump hunt scenario, neither model has a clear advantage in terms of achieved significance.

Conclusion
The development of new and increasingly accurate jet taggers remains a high priority for the collider physics community. The use of techniques based on machine-learning methods has provided a number of advances in this area in the past few years. Recent research has explored incorporating permutation invariance of the constituent particle momenta into the network architecture via the Deep Sets formalism, leading to Energy Flow Networks (EFNs). However, Deep Sets also allows for the possibility of permutation equivariance, which is what we have explored in this paper. We have successfully implemented permutation-equivariant neural network layers into the Energy Flow Network architecture. Furthermore, we identified a specific construction, the EV-EFN, that maintains the infrared and collinear safety of the learned jet observable. When trained to discriminate W from QCD jets in simulated samples in two boosted p T windows these EV-EFNs exhibited a marginal improvement in performance over permutation invariant EFNs, but outperformed other networks such as CNNs and DNNs. The performance of the EV-EFN was equal to that achieved by a Particle Flow Network, which is not restricted by IRC safety. Due to strong overfitting, the same improvement was not observed when extending PFNs by equivariant layers, although this could possibly be remedied by using appropriate regularisation.
Finally, we compared the action of the networks on the QCD jet mass distribution before and after applying mass planing. The planing decorrelates the tagger performance from the jet mass, at some cost in performance. However, it also changes the relative performances of the EFN and EV-EFN networks. While the EV-EFN initially had slightly less impact on the background distribution compared to the EFN, this advantage was lost after mass planing the input jets. Consequently it would be interesting to apply recent ideas such as Distance Correlation [69], adversarial networks [67], Mass Unspecific [25] (or Agnostic [62]) Supervised Tagging to see if networks with equivariant layers continue to outperform invariant ones. A number of different future directions present themselves. Future work could investigate the use of other types of equivariant layers and their impact on IRC safety including those which are equivariant to continuous symmetries of the Lorentz Group as in Ref. [41] and similar to Ref. [24]. It may also be interesting to attempt to translate the EV-EFN into a low-dimensional humaninterpretable space using the method outlined in Ref. [72]. This could potentially illuminate the effect that equivariant layers have on the representation of IRC safe jet observables compared to the original EFN.