Comparing Point Cloud Strategies for Collider Event Classification

In this paper, we compare several event classification architectures defined on the point cloud representation of collider events. These approaches, which are based on the frameworks of deep sets and edge convolutions, circumvent many of the difficulties associated with traditional feature engineering. To benchmark our architectures against more traditional event classification strategies, we perform a case study involving Higgs boson decays to tau leptons. We find a 2.5 times increase in performance compared to a baseline ATLAS analysis with engineered features. Our point cloud architectures can be viewed as simplified versions of graph neural networks, where each particle in the event corresponds to a graph node. In our case study, we find the best balance of performance and computational cost for simple pairwise architectures, which are based on learned edge features.

When analyzing data collected at the Large Hadron Collider (LHC), the ability to distinguish between specific production and decay channels is vital for picking out signal events among overwhelming backgrounds. In the context of Higgs boson studies, the ATLAS and CMS collaborations rely heavily on dense neural networks (dNNs) [1, 2] and boosted decision trees (BDTs) [3-10] for event classification. These classifiers are typically trained on Monte Carlo (MC) simulated events to separate signal events from expected background processes. Both dNNs and BDTs expect collider events to be represented by fixed-sized inputs. Creating a robust fixed-sized representation of a collider event is challenging, however, often requiring hand engineering of a fixed number of features to distill relevant information from a variable number of particles.
In this paper, we compare event classification architectures defined on point clouds, which are a natural variable-sized representation of collider events. Our architectures draw inspiration from two complementary approaches to point cloud processing. The first is deep sets [11,12], which compute global information about an event based on permutation symmetric functions. The second is edge convolutions (EdgeConvs) [13,14], which compute local information associated with each particle and its neighbors. We propose architectures which center around three different strategies for event classification: • Emphasizing local information with multiple summations; • Improving latent representations of collider events from iterated convolutions; and • Emphasizing global information through nested concatenation of global features.
The nested concatenation structure provides an interesting alternative to the equivariant layers of Refs. [11,15], with the same degree of expressivity. To test our architectures, we perform a case study involving signal/background binary classification of Higgs boson decays to tau leptons. This channel has been studied by ATLAS [3][4][5], whose results we use as a baseline for comparison, and by CMS [16][17][18][19]. Based on our study, we recommend the pairwise architecture in Eq. (11) below, which can be interpreted as either: • A Deep Set acting on pairs of particles; or • A symmetric pooling over EdgeConvs.
Compared to more complex graph neural networks, this pairwise structure balances classification performance, computational efficiency, and conceptual simplicity. We study the performance of the pairwise architecture as a function of the latent dimension size, finding that even a single latent dimension outperforms the baseline AT-LAS strategy. Interestingly, we find that the discriminatory features identified by the pairwise architecture have some correlations with the traditionally selected handengineered features.
The point cloud representation and corresponding architectures have appeared before in the literature for various classification tasks [12,14,15,[20][21][22][23][24]. We refer to Refs. [25,26] for a more thorough review of the use of point clouds in particle physics. Using point clouds, collider events are represented as an unordered set of ndimensional vectors, where each vector corresponds to a measured particle in that collision event. We need a set since there are a variable number of particles in each collision event. This set is unordered since there is no inherent ordering to the particles. 1 As discussed below, a key benefit of using architectures defined on point clouds is 1 It is sometimes convenient to sort particles according to some measure of energy. When comparing our point cloud architectures to fixed-sized input dNNs, we will sort over the particle transverse momenta (p T ).
that it bypasses the traditional feature engineering game and associated combinatorial problems. The remainder of this paper is organized as follows. In Sec. II, we describe our architectures and review their motivation and inspirations. We make these architectures available along with example code on GitHub [27], and we describe the connection between nested concatenation and equivariant layers in App. A. In Sec. III, we perform a classification case study of the H → τ τ decay channel, comparing our proposed architectures with a baseline ATLAS strategy, and we show visualizations of the latent space for the pairwise architecture. Details of the neural network parameters are given in App. B. We conclude in Sec. IV with a summary of our recommendations and areas for future exploration.

II. POINT CLOUD ARCHITECTURES
In this section, we describe our point cloud architectures and their motivation. We start by describing why the point cloud representation is more natural for event classification compared to traditional fixed-sized inputs. We then review Deep Sets and EdgeConvs, which are the main inspirations for our architectures. Following this, we describe our proposed architectures, as summarized in Fig. 1.

A. Why Point Clouds for Event Classification
The point cloud representation of collider events avoids two of the key challenges when trying to construct robust fixed-sized inputs for event classification: • Combinatorial ambiguities. One way of creating a fixed-sized representation of an event is to define high-level kinematic variables, but this can lead to combinatorial challenges. For example, if we find a good discriminatory feature that is derived assuming the final state has two b-tagged jets, but mistagging leads to three measured b-tagged jets, then one has to decide which of the three pairs should be used to compute the high-level feature. Similarly, if there is only one measured b-tagged jet, due to mistagging or kinematic acceptance, then the high-level feature is ill defined, even if in principle there is enough information available for event classification.
• Truncation ambiguities. Another way to create a fixed-sized representation of an event is to input the kinematics of a fixed number of particles. This, however, introduces a dependence on how the particles are ordered and which ones are truncated away. While some orderings, like taking the most energetic particles, have a physical motivation, they might not yield the best discrimination power, especially if particle correlations are relevant. There is also a question about how to properly pad events when there are fewer particles than the desired fixed-sized representation.
The point cloud representation avoids these combinatorial and truncation ambiguities. By enforcing permutation invariance, all possible particle combinations are automatically considered for event classification. By allowing for variable-sized inputs, there is no need for ordering or truncation. Of course, it is possible that cleverly engineered fixed-sized features could outperform generic point-cloud architectures, though this turns out not to hold for the case study in Sec. III. Graph neural networks are a popular approach to point cloud processing, but the simpler architectures studies in this paper also avoid these ambiguities with reduced computational complexity.
The architectures below take as input a set of M particles: where each particle x i is described by n features. These features could include particle characteristics like p T and b-tag score. Let the classification of an event X be called: where f represents an event classification architecture for m possible channels. For binary classification as studied in this paper, m = 2.

B. Review of Deep Sets
Deep Sets are a way to parametrize permutationsymmetric functions with neural networks. They were defined in Ref. [11] and first introduced to the particle physics community in Ref. [12], under the name of particle flow networks. Deep Sets have achieved state-of-theart performance on various collider physics tasks, such as quark/gluon jet discrimination and boosted top tagging [28].
As shown in Ref. [11], a function f (X ) operating on a set X is permutation invariant if (and, under certain assumptions, only if) it can be decomposed into the form: Each particle x i ∈ X is transformed into a latent representation of dimension by a function: The particle-wise outputs Φ(x) are averaged over and processed by an event-wise function: To approximate the optimal event classifier, the functions Φ and F are parametrized by neural networks. The factor of 1/M in Eq.
(3) differs from the presentation in Refs. [11,12], where sum pooling (instead of average pooling) was the default. Average pooling simplifies some of the later notation, and we use this pooling operation to define the action of Φ on the set X : This notation makes more clear that the function Φ effectively maps the entire set X into a latent representation of dimension .

C. Review of Edge Convolutions
EdgeConvs are a way to incorporate local neighborhood information within point clouds for learning tasks. They were introduced in Ref. [13] and first used in particle physics by Ref. [14], under the name of Par-ticleNet. Architectures incorporating EdgeConvs have also achieved state-of-the-art performance for collider tasks [28].
The EdgeConv mechanism in Ref. [13] boils down to the following transformation of a particle x i ∈ X : where the notation mirrors that of Eq. (6). Here, the particle x i is transformed into a latent representation through a function Φ 2 based on pairwise information: Like before, is the latent dimension, and Φ 2 is parametrized by a neural network. For later purposes, architecture. The first line corresponds to applying the Deep Sets formalism in Eq. (3) to the set of (ordered) particle pairs. 3 The second line corresponds to applying the EdgeConvs in Eq. (7) to each particle, and then performing average pooling and postprocessing. The last line emphasizes that the role of Φ 2 is to map particle pairs to an -dimensional latent space, which makes this local pairwise information directly accessible when constructing a latent representation. If we choose Φ 2 (x i , x j ) = Φ(x i ), such that the x j input is ignored, then the Pairwise architecture reduces to the Particlewise one.
The natural generalization of the above constructions is the Tripletwise architecture, which involves a function that maps triplets of particles into a latent space: Mirroring the notation in Eq. (9), we can define this architecture in multiple ways: This tripletwise structure makes available even more local information when constructing a latent representation of an event. If we choose Φ 3 (x i , x j , x k ) = Φ 2 (x i , x j ), such that the x k input is ignored, then the Tripletwise architecture reduces to the Pairwise one. We found it impractical to use architectures with more nested summations due to their heavy computational cost.

E. Iterated Convolutions
One way to make neural networks more expressive is to introduce more nonlinearities. This is the motivation for our iterated convolution architectures, which can be viewed as a special case of DGCNNs [13].
A natural evolution of the Pairwise architecture in Eq. (11) involves inserting an additional non-linear function Π implemented with a neural network between the two sums, or equivalently after the EdgeConv layer, (X (L−1) ) .
In the last line, we are using the same notation as Eq. (15), to emphasize that Φ (L),Π 2 transforms the point cloud into an -dimensional latent representation. When L = 1, this reduces to the Nonlinear Pairwise architecture from Eq. (14). A version of this architecture is also possible for the tripletwise case, but we shall not pursue it due to its heavy computational cost.

F. Nested Concatenation of Global Features
Our final class of architectures combines global information about the whole event with local information about particles. Letf (X ) be a permutation invariant function, which captures global information about the collider event X . For example,f (X ) could simply be a Deep Set applied to the set of particles: The global information fromf (X ), can then be concatenated (⊕) with local features associated with each particle: This concatenation is a conceptually simple way to let individual particles see global information about the point cloud.
The Nested Concatenation architecture iterates the structure in Eq. (23) for L times. Let the base case be a standard Deep Set: At level d, we have: In the last line, we have introduced the notation which emphasizes that Φ (d) ⊕ transforms the point cloud into a latent representation of dimension , just as in the previous architectures. The final architecture at total level L is such that Eq. (25) reduces to Eq. (3) when L = 0.
To allow for more dynamic manipulation of the point cloud information, we define the Nested Concatenation with Memory architecture. In this architecture, the intermediate latent representations of particles generated by (d − 1)-th nested layer are also utilized in the d-th nested layer. We define the same base case function as Eq. (24), with the base case set as X (0) ≡ X . After d nested layers, the set takes the form where the i-th element is determined via: The function f (d) (X (d) ) sums and processes the latent representation X (d) with a function F (d) : The function Φ ⊕ is the same as in Eq. (26) but now applied to the set X (d−1) . Like before, we let be the classification of collider event X at level L.
It is worth mentioning that Ref. [11] defined an alternative method to incorporate global information based on permutation equivariance; this structure was used for jet tagging in Ref. [15]. In App. A, we show that permutation equivariant Deep Sets are a special case of our Nested Concatenation architecture. We prefer to use the more general concatenation structure, though, due to its flexibility.

III. EVENT CLASSIFICATION CASE STUDY
In this section, we describe the setup and results of our event classification case study to benchmark our proposed architectures against more traditional methods. We start with a description of the signal and background processes that will be the context for our case study. We then describe how we generate synthetic data sets and preprocess the inputs for both traditional architectures and our proposed architectures. Following this, we present several performance metrics of the tested architectures and advocate for the Pairwise architecture as the best balance between computational cost and performance. We then perform a latent dimension study of the Pairwise architecture and visualize the separation of signal and background events in the latent space. Finally, we examine correlations between the features found to be useful to represent collider events by our Pairwise architecture and the hand-engineered features chosen by the ATLAS collaboration.

A. Signal and Backgrounds Processes
Our case study is based on a problem relevant to analyzing the H → τ τ decay channel [3][4][5]. For leptonic Higgs boson decays, the H → τ + τ − channel has the largest branching ratio of 6.3% [29,30] which makes this channel a prime candidate to study the Yukawa-Higgs mechanism for mass generation. The presence of neutrinos in the final state of this process, however, degrades the resolution of the measured Higgs boson four momentum. This degraded resolution makes the signal process much more difficult to distinguish from background processes, thereby motivating a machine learning approach.
To tackle the challenge of identifying the H → τ + τ − final state, one typically isolates different event topologies and studies them individually. One H → τ τ topology of interest-which will serve as the signal process in our classification case study-is the production of a Higgs boson associated with a pair of top quarks where both top quarks and both τ leptons decay hadronically. We denote this signal process as A schematic of this process is shown in Fig. 2a. Ideally, the final state for this process would result in two τtagged jets, two b-tagged jets, and four additional jets from W ± decay. This channel was considered by ATLAS in Ref.
The main background process that mimics this ttH signature-and significantly hinders the analysis of this channel [3]-is the production of a top-antitop quark pair where each top quark decays as t → τ νb and both τ decay hadronically. We denote this background process as tt(→ τ νb) or tt for short.
A schematic of this process is shown in Fig. 2b. The tt channel can mimic the signature of the ideal ttH process if there are four additional jets from gluons radiated before the hard scattering. In our case study, we focus on distinguishing between ttH events and tt events. A full analysis, of course, would consider multiple Higgs production topologies. These two channels both exhibit high multiplicity final states with a diverse range of final state objects. This leads to many potential combinatorial reconstructions and a high probability of detecting extra or losing relevant objects. These characteristics make manually constructing features difficult and motivates the need for flexibility in the number of input objects. Our case study is therefore representative of situations where we hope to make gains from using point-cloud-based architectures, which naturally account for combinatorial ambiguities and incomplete reconstruction. The particular processes we study are among the highest multiplicity channels currently analyzed at the LHC.

B. Data Generation
Following the tt(H → τ τ ) analysis strategy in Ref.
[3], we select events that satisfy the following properties: • Two visible τ -tagged jets, with kinematic conditions: - • (≥ 5 jets and ≥ 2 b-tags) or (≥ 6 jets and ≥ 1 b-tags), with kinematic conditions for the leading (non-τ ) jet: other jets: * p T ≥ 20 GeV, * |η| ≤ 5; • ≤ 15 jets total. 4 Here, the transverse momenta (p T ) and pseudorapidities (η) are defined with respect to the beamline, ∆R 2 = ∆η 2 + ∆φ 2 is the distance between objects in the pseudorapidity-azimuth (η-φ) plane, and x 1 , x 2 are the momentum fractions carried away by visible τ decay products as computed by the collinear approximation [32][33][34]. For both the signal and background channels, we generate events with MadGraph 5 v3.1.1 [35] and Pythia 8.245 [36]. These events are passed through the Delphes 3.5.0 [37] detector simulation with the ATLAS card. 5 Jets are then clustered with the R = 0.4 antik T algorithm [38] using the FastJet 3.3.4 [39] package. From all of the generated events, we extract 80k events from each channel that satisfy the event selection criteria, such that we have balanced data sets. For the machine learning study, we split each data set into 70% for training and 30% for testing. These ATLAS features are used as inputs for a BDT to mimic the ATLAS analysis. We also use these ATLAS features to train a dNN, which yields comparable performance to the BDT. For discrimination between ttH and other background processes such as Z + jets, alternative features are chosen by ATLAS.
For the point cloud architectures, we perform the following preprocessing of the inputs. Let K tot be the scalar sum of the kinematic quantity K for all particles in an event andK tot be the kinematic quantity K derived from the sum of the four-momenta of all particles in the event.
We also define K i as the kinematic feature for object i in the event. Each event is represented as a set of particles with the following kinematic features representing each particle: Here, E is energy in the lab frame, M is invariant mass, and b tag : 1 if particle is b-tagged and 0 if not, τ tag : 1 if a particle is τ -tagged and 0 if not.
We found that taking the logarithm of the dimensionful features improved the training across architectures. One additional hand-engineered feature we must consider is the di-tau invariant mass M τ τ . This feature is left out of standard ATLAS training but discoverable by the point cloud architectures. Because the di-tau mass would peak at the Higgs mass for signal events but not for background events, it is expected to be a good discriminant. In the context of the ATLAS analysis, this feature is intentionally left out so that it can be used as a sanity check on the classification. To get a more complete performance comparison for our case study, we concatenate M τ τ with the previously described ATLAS features to be used as input to a BDT. Specifically, we include: 9. M Coll τ τ : The reconstructed di-tau invariant mass using the collinear approximation [32][33][34] to account for the energy carried off by neutrinos.
The ATLAS analysis in Ref.
[3] uses the Missing Mass Calculator [33], but we chose to test against the collinear approximation instead for its relative simplicity.
As a cross check, we train a dNN that takes as input a p T -sorted and flattened version of the event representation in Eq. (34) with padding to make each event have 15 objects. We call this a flattened point cloud. By training a dNN on the flattened point cloud, we can assess whether the improvement in performance we find from the point cloud architectures is truly due to how we structure the architectures rather than just because of an increase in available information.

D. Performance of Point Cloud Architectures
We now compare the performance of our proposed point cloud architectures on the ttH versus tt event classification problem. The point cloud architectures from Sec. II that we test are: The parameters for all architectures are summarized in App. B. We chose hyperparameters such that each architecture has approximately 100k trainable parameters, to make sure we were comparing architectures based on their structure and not just on model size. For comparison, we test four more traditional architectures: • BDT trained with the ATLAS features; • BDT trained with the ATLAS features and M Coll τ τ ; • dNN trained with the ATLAS features; and • dNN trained with the flattened point cloud, where again the parameters are specified in App. B.
To assess the classification performance of each architecture, we plot their receiver operator characteristic (ROC) curves in Fig. 3. These curves show the inverse background false-positive rate (1/ b ) as a function of the signal efficiency ( s ), as the cut on the architecture output is varied. The best performing architectures are based on tripletwise or pairwise information. The next best architectures use the nested concatenation structure from Sec. II F. The point cloud architecture with the weakest performance is the Particlewise architecture.The dNN acting on the flattened point cloud has significantly worse performance, which indicates the importance of linking the point cloud data inputs to a suitable architecture for processing.
In Table I, we tabulate the number of trainable parameters in each architecture, the area under the ROC curve (AUC), and various performance metrics at the operating points of s = {0.3, 0, 7}. The choice of s = 0.7 mimics the choice made in Ref. [3]. At this operating point, our best performing models (Nonlinear Pairwise, Iterated Nonlinear Pairwise, Tripletwise, and Pairwise) achieve a 2.5 times increase in s / b over the more traditional dNN and BDT models. This jump in performance would be even more pronounced if we choose an operating point of s = 0.3, which leads to a nearly fourfold increase in discrimination power. 6 Furthermore, the poor performance 6 This operating point is also closer to the one that would maximize s/ √ b , i.e. the one that would yield the largest significance improvement [40]. of the dNN on the flattened point cloud implies to us that the increase in performance in our architectures comes from their improved methods of processing information from a collider event and not from simply increasing the amount of information given to an architecture.
The processing of local information via tripletwise or pairwise features yields the most powerful classifiers, but this local information does not need to be processed in overly complex ways. Notably, the architecture which processes information in nearly the simplest manner, the Pairwise architecture, is comparable to the performance of architectures with (iterated) non-linear structures. This implies that more complex strategies to process local information do not necessarily lead to better performance, at least in this context of this event classification problem.
Finally, we note that machine learning architectures trained on Monte Carlo event samples are sensitive to simulation-specific behavior that may affect model performance on real data. For example, the analysis in Ref.
[3] did not utilize some input features because they were imperfectly modeled in simulation. In our case study, though, some of these imperfectly modeled variables were used as inputs, so our architectures could be learning unphysical correlations. Thus, the calibrated performance of our architectures on real data will likely be inferior compared to the performance on simulated data, unless some method is used to minimize the dependence on simulation-specific behavior. Despite this caveat, we expect the relative performance of the different point cloud strategies to be similar.  ROC curves for event classification between ttH and tt, comparing our proposed point cloud architectures to more traditional strategies. The signal efficiency s is on the x-axis and the inverse background false-positive rate 1/ b is on the y-axis, such that better performance corresponds to curves that are more up and to the right. The Pairwise architecture, which is our recommended strategy, is one of the best performing methods for this task.

E. Computational Cost of Point Cloud Architectures
Of the best performing architectures, the Pairwise architecture is the most computationally efficient. To give an idea of the computational cost of our architectures, we tabulate the approximate time/epoch and total number of epochs to train each architecture in Table II. These training times are obtained on a server equipped with two NVIDIA Tesla K80s. The Pairwise architecture and Nonlinear Pairwise architecture are close in both performance and runtime efficiency. The other two architectures that achieve similar performance, the Iterated Nonlinear and Tripletwise architectures, are significantly more computationally expensive.
Since the Pairwise architecture is nearly the best performing architecture while still being one of the simplest conceptually and most efficient computationally, we recommend the use of the Pairwise architecture for event classification. For other point cloud tasks that have seen gains from using graph neural networks, we recommend their performance be benchmarked against the Pairwise architecture.

F. Latent Dimension Studies of the Pairwise Architecture
Having convinced ourselves that the Pairwise architecture is a prime candidate for event classification, we now study the -dimensional latent representation of events generated by this architecture. Specifically, we study what happens if we restrict the latent dimension of the Pairwise architecture. This gives us a way to study how powerful this architecture is at identifying useful discriminatory features from the event kinematics. Furthermore, we can visualize the latent representations to get some picture of what the architecture is physically learning.
The latent dimension corresponds to the number of discriminatory features the architecture can extract from the point cloud. Concretely, if we restrict the latent dimension of the Pairwise architecture to = 2, we are essentially asking the architecture to extract two features from the point cloud that, when processed by the F function, can robustly distinguish between signal and background events. In the Pairwise architecture from Eq. (11), the latent representation of an event X is the result of a double summation over pairs of particles: Fig. 3 and Table I above, we used = 2 6 , as described in App. B. We can think of the BDTs and dNNs trained on the ATLAS variables [3] as having = 2 3 , because they take as input 8 handengineered features, as described in Sec. III C.
As shown in Table III, we can significantly restrict the

FIG. 4.
Visualization of the latent representations of events in the = 2 1 Pairwise architecture, using KDE for density estimation. In (a), we plot the latent representations corresponding to the ttH and tt events. In (b), we apply t-SNE to disentangle the two distributions and see a clearer separation. Each plot also shows the marginalized distributions along each axis.
size of the latent dimension of our Pairwise architecture and still outperform traditional methods of event classification. This table shows the performance of the Pairwise architecture as a function of the size of the latent dimension. The strong performance even for = 2 0 = 1 implies that deep-learning-driven feature engineering is extremely powerful for finding robust discriminatory features for event classification. These discriminatory features could in principle be found from the traditional feature engineering game, but we expect they would be extremely difficult to find in practice without the use of machine learning.
For = 2, we can directly visualize the latent space of the Pairwise architecture, as shown in Fig. 4. Here we plot two versions of the latent dimension, where densities are approximated with kernel density estimation EMD: 0.584 (t-SNE)

ATLAS Features
ttH Events tt Events Pairwise Architecture with = 2 6 ttH Events tt Events , which has the same information as Fig. 4b, (c) the Pairwise architecture with = 2 3 , and (d) the Pairwise architecture with = 2 6 . The t-SNE embeddings have been standardized such that the distributions have mean 0 and standard deviation 1 along both dimensions. The standardized embedding is then rotated such that the ttH events are centered on the right side of the figure. For each plot, we report the EMD between the distribution of ttH and tt events, which roughly measures the separation of the two distributions, with larger EMD corresponding to better separation. We also plot marginalized densities along each axis. (KDE) [41,42]. 7 In Fig. 4a, we plot the densities of the raw latent representations. We see that one of the latent space variables yields a clean separation between signal and background events, while the other one yields an approximately Gaussian distributed feature with different widths for signal and background. In Fig. 4b, we apply t-distributed stochastic neighbor embedding (t-SNE) [43][44][45][46] to the latent representation, which shows more clearly the separation between signal and background events. This t-SNE visualization serves as a reference for later plots with higher . As we increase the latent dimension , the trained architectures identify more discriminatory features, leading to better separation between signal and background events in the latent space. In Fig. 5, we plot the 2-dimensional t-SNE embeddings of the 8-dimensional AT-LAS feature space and compare it to our = {2 1 , 2 3 , 2 6 } Pairwise architectures. To quantify the separation between the two (t-SNE projected) distributions, we approximate the Earth Mover's Distance (EMD) [47][48][49] using the Euclidean distance as the ground metric between the ttH and tt distributions. Since the length scales within a t-SNE embedding are not physical and we wish for the EMD to be a meaningful metric of comparison, we standardize the whole distribution of events in each plot so that along each dimension we have zero mean and unit variance. Qualitatively, we see that for our proposed Pairwise architecture, the joint and marginalized distributions of ttH and tt are more clearly separated than for the ATLAS features. This observation is reinforced quantitatively by seeing that the EMD between the embedded distribution of signal and background events is smallest for the ATLAS features, implying the largest degree of overlap. 25  for ttH signal events Predicted labels are from the Pairwise architecture with = 2 6 at a fixed signal efficiency of s = 0.7, where the purple curves correspond to ttH-labeled events and the pink curves correspond to tt-labeled events. Events classified as ttH have a sharper M Coll τ τ feature, suggesting that the Pairwise architecture has learned features with some correlation to M Coll τ τ .

G. Correlations between Pairwise Architecture Features and Hand-Engineered Features
The features found to be useful to represent collider events by our Pairwise architecture are correlated to several of the features chosen by ATLAS to classify events. In Fig. 6, we tabulate the Spearman's rank correlation coefficients (Spearman's ρ) between the ATLAS features used in Ref.
[3] (see Sec. III C) and the learned latent features of our Pairwise architecture with = 2 and = 8. Spearman's ρ quantifies the degree to which two variables are monotonically related, with ρ = 0 meaning no correlation and ρ = 1 meaning perfect correlation. Unlike the more common Pearson correlation coefficient, Spearman's ρ does not care if the relationship is linear or not, which makes it a more robust notion of correlation in the context of non-linear neural networks.
Because Spearman's ρ quantifies monotonic relations, we have to be mindful of features whose classification performance is related to how close they are to the true W boson mass (m W = 80.4 GeV), top quark mass (m t = 172.5 GeV), and Higgs boson mass (m H = 125 GeV). Thus, instead of tabulating correlations for MŴ , Mt, and M Coll τ τ , we look at |MŴ − m W |, |Mt − m t |, and |M Coll τ τ −m H |. We see that of all the ATLAS features, the scalar sum of all jet p T is most closely related to the Pairwise architecture features. Other ATLAS features that have some correlation with the Pairwise architecture features are the smallest ∆R between two jets, the p T of the τ τ dijet, the ∆R between the two τ -tagged jets, and the invariant mass of the dijet/trijet with invariant mass closest to the W boson/top quark mass.
From Fig. 6, we can see that there exists a correlation between M Coll τ τ and the Pairwise architecture features, as anticipated in Sec. III C. This correlation is subtle, though, and not captured by a single latent space feature. In Fig. 7, we consider the Pairwise architecture with = 2 6 , and plot the M Coll τ τ distribution for ttH events correctly and incorrectly classified by this architecture at fixed signal efficiency s = 0.7. We see that correctly labeled events have a sharper peak in M Coll τ τ , albeit shifted a bit to the right of m H . This behavior suggests that the Pairwise architecture with = 2 6 , which was used for our comparison comparison in Sec. III D, has learned features with some correlation to M Coll τ τ .

IV. CONCLUSIONS
In this paper, we compared neural network architectures defined on the point cloud representation of collider events with traditional approaches for event classification. The point cloud representation allows us to circumvent many difficulties arising from trying to robustly represent an event with a fixed-size input. Our architectures explore three complementary strategies to process information within a point cloud: (1) using multiple summations to improve local information processing; (2) using iterated convolutions to increase an architecture's power to build latent representations; and (3) using nested concatenation of global features to improve global information processing. These can be viewed as simplified versions of the strategies used to build graph neural networks.
To benchmark our architectures, we performed a case study of event classification in the H → τ τ channel and compared the results to an ATLAS study using handengineered features [3]. At a comparable signal efficiency operating point to the one used by ATLAS, we found a 2.5 times increase in background rejection. This gain in performance was not simply due to the increased size of the input space. Indeed, when the flattened point cloud representation was processed with a dNN, we found worse performance than for the ATLAS baseline. We therefore recommend further explorations of point cloud architectures for event classification problems.
Among the tested architectures, the Pairwise architecture exhibited the best balance of classification performance, computationally efficiency, and conceptual simplicity. The Particlewise architecture, based on a straightforward application of the Deep Sets formalism [11], yielded performance similar to the ATLAS baseline. By considering the set of pairs of particles, we found a boost in performance without requiring more complex nonlinear or iterated structures as in EdgeConv architectures [13]. The Pairwise architecture continues to have good discriminatory power as the latent space dimension is decreased, and by visualizing the learned latent representations, we conclude that the Pairwise architecture is able to identify discriminatory features that are well suited for event classification. We found that these learned features are correlated with traditionally chosen hand-crafted features. While more complex graph neural networks might provide better performance for certain event-wide tasks, we recommend that they be benchmarked against the simpler Pairwise architecture.
A key open question regarding our proposed architectures is how they will perform as the final state multiplicity increases. In this paper, we considered a final state of O(10) objects, but final states of interest with O(100) or even O(1000) objects also appear in particle physics applications such as jet substructure studies. We found that for our case of O(10) final state objects, the pairwise architecture achieved the best balance between performance and cost. As we scale up to more final state objects, however, it will be important to understand the performance/cost balance, as pairwise architectures scale quadratically with the number of objects being studied. To extend our architectures to more final state objects without suffering from quadratic scaling, one could follow the strategy of the original EdgeConv construction [13] and consider only k-nearest neighbors instead of all pairs. These trade-offs are an important area for future studies.
There are several more directions for further explorations. First, our case study focused on binary classification, but as described in Eq. (2), these point cloud architectures could be applied to multi-category classification or regression. Second, the Pairwise architecture is based on learning a generic function Φ 2 , but it may be possible to improve performance and interpretability by restricting its functional form. Finally, there has been a rising interest in incorporating physical symmetries into neural networks. While our point cloud architectures already exhibit manifest permutation invariance among the particles, event classification could benefit from directly incorporating Lorentz symmetry [50][51][52][53][54][55] or infrared and collinear safety [12,15,22,[56][57][58][59].
We therefore conclude that a single permutation equivariant Deep Set layer is a special case of the concatenation approach. This in turn means that the composition of L-equivariant Deep Set layers from Eq. (A1) is equivalent to the resulting point cloud after L-nested layers of Eq. (29). Thus, the architecture in Ref. [11] based on iteratively applied equivariant Deep Set layers is a special case of our Nested Concatenation with Memory architecture.

Appendix B: Model Parameters
In this appendix, we specify the model parameters used for the event classification study in Sec. III D. All architectures were implemented with Keras [61] using the TensorFlow [62] back-end. We chose model parameters so that the total number of trainable parameters in each model was roughly the same across all architectures. -We implement dNNs with a batch normalization layer [63] followed by 3 layers each 256 nodes wide.
We use leaky relu activation functions between each layer in all of our neural networks. A two-unit layer followed by a softmax activation function is used as the output layer in all models. To train, we minimize categorical cross entropy using the Adam optimization algorithm [64] with the AMSGrad enhancement [65]. When training, we reserve 30% of the training data as validation data and monitor the validation loss. To avoid overfitting, we stop training if validation loss has not improved in 32 epochs and restore the weights of the model to the point when the validation loss was lowest.
The following BDT parameters were used in our model implemented with XGBoost [66]. These parameter were chosen using the hyperparameter tuning library Hyperopt [67] with the Tree of Parzen Estimators algorithm [68].