Permutationless Many-Jet Event Reconstruction with Symmetry Preserving Attention Networks

Top quarks, produced in large numbers at the Large Hadron Collider, have a complex detector signature and require special reconstruction techniques. The most common decay mode, the"all-jet"channel, results in a 6-jet final state which is particularly difficult to reconstruct in $pp$ collisions due to the large number of permutations possible. We present a novel approach to this class of problem, based on neural networks using a generalized attention mechanism, that we call Symmetry Preserving Attention Networks (SPA-Net). We train one such network to identify the decay products of each top quark unambiguously and without combinatorial explosion as an example of the power of this technique.This approach significantly outperforms existing state-of-the-art methods, correctly assigning all jets in $93.0%$ of $6$-jet, $87.8%$ of $7$-jet, and $82.6%$ of $\geq 8$-jet events respectively.


I. INTRODUCTION
At the Large Hadron Collider (LHC), protons are collided at the highest energy ever produced in the laboratory. Many of these collisions produce a high multiplicity of jets; collimated sprays of particles that originate from the strongly coupled quarks and gluons inside the proton. Final-states containing only jets occur through a number of physical processes, and the copious production of these "all-jet" topologies presents opportunities for precision measurements [1,2], and searches for rare Standard Model [3] or new physics [4,5] processes, but also raises particular challenges. Specifically, it is typically difficult to connect an observed jet with its quark origin, and the factorial dependence on the number of jets leads to the so-called "combinatorial explosion." For example, top quark pair production with subsequent hadronic W-boson decays t → Wb → qqb has a ≥ 6 jet final-state in the socalled "resolved" regime in which each of the three decay products produced from each top are reconstructed as a single jet. In some events this can be mitigated using "boosted" reconstruction [6], though this is limited to a small subset of events [7]. Thus, one of the biggest obstacles to extracting physical information from these events is correctly determining which jets originate from each of the parent top quarks.
The top quark, as the most massive fundamental particle in the Standard Model, is the only quark to decay before hadronization. This presents a unique opportunity to study an isolated quark-if its decay products can be correctly identified. Top quarks decay almost exclusively via t → Wb in the Standard Model, and events are categorized by the decay modes of the W bosons: dileptonic (9%), single-lepton (45%), or all-jets (46%) [8]. To date, the most precise measurements of top quark properties are typically performed in the single-lepton or dilepton channels [9]. The alljet channel, held back by the ambiguous event reconstruction and large backgrounds, is comparatively underexplored.
In this paper, we propose a novel architecture for assignment of particle origin to jets, symmetry preserving attention networks (SPA-NET). Applying attention networks that naturally reflect the permutation symmetry of the task, SPA-NET significantly outperforms existing state-of-the-art techniques while avoiding combinatorial explosion. In the following, we define the nature of the jet assignment task, describe invariance and attention mechanisms in neural networks, describe the dataset and training, and demonstrate the performance of our technique relative to the stateof-the-art.

II. JET ASSIGNMENT
The jet assignment task is the identification of the original particle which leads to a reconstructed jet. In a collision which produces N jets, there are N! possible assignments. Fortunately, symmetries can reduce this number.
Top quarks decay via the chain t → Wb → qqb, suppressing charge labels. A pair of top quarks, tt 0 , therefore produce six quarks, qqbq 0 q 0 b 0 . This process is shown in Fig. 1. The task is to correctly identify six observed jets with six labels: b, b 0 , 2 × q, and 2 × q 0 . The symmetries between the two tops and between the decay products of the W-bosons reduce this to 6!=ð2 × 2 × 2Þ ¼ 90 permutations. Further complicating things, ∼50% of tt events at the LHC are expected to contain at least one additional jet which is not the result of top quark decay, leading to 7!=ð2 × 2 × 2Þ ¼ 630 or 8!=ð2 × 2 × 2 × 2Þ ¼ 2520 permutations 1 in 7-or 8-jet events, respectively. Higher jet multiplicity events are also likely; nonetheless, the current state-of-the-art techniques depend on enumerating and evaluating each permutation to identify the best candidate. The many incorrect assignments obscure the true assignment, diluting the scientific power of the data, and represents a significant computational penalty when all permutations are evaluated for every event. Sometimes, each permutation must be evaluated per systematic uncertainty per event, which is often intractable in the datasets typical in high-energy-physics (HEP).
The most common technique is a χ 2 -minimization method, which scores a permutation based on the consistency of the reconstructed W-boson masses with known values and similarity of the two reconstructed top quark masses 2 : We set m W ¼ 81.3 GeV, and find for our dataset (described in Sec. IV) values of σ W ¼ 12.3 GeV and σ Δm bjj ¼ 26.3 GeV following a Gaussian fit to the relevant distributions. χ 2 is evaluated for every permutation, and the parton assignment with the minimum value is chosen. This method typically uses b-tagging to consider only permutations with b-jets in the place of b-quarks. This reduces the permutations but prevents the correct solution being found in the presence of mistagged jets. We apply this requirement in our studies for consistency with recent experimental results-this implementation has been the preferred reconstruction for ATLAS [1,10], while CMS have used a similar method [11].
Another approach is the use of boosted decision trees or neural networks as permutation classifiers, though these have mostly been employed in the leptonic channels where combinatoric explosion is reduced [12,13], or to reconstruct tops individually [14]. CMS has also used a hybrid χ 2 and BDT method in the all-jet channel [15].
A more advanced minimization technique is implemented in KLFitter [16]. Transfer functions are used to represent the detector response in a likelihood function per permutation, which is minimized to find the assignments. This operates similarly to the χ 2 technique, though has greater CPU requirements due to the more complex model. KLFitter has to date been almost exclusively used in the single lepton channel, and with the χ 2 already CPU limited in many cases, it is not discussed here.

III. INVARIANCE AND ATTENTION IN DEEP NEURAL NETWORKS
Equivariance and invariance properties can play an important role in the design of both feed-forward and recursive neural networks across different forms of learning [17][18][19][20][21][22]. For instance, classical convolution neural networks can produce object recognition outputs that are invariant with respect to translations in their two-dimensional inputs, and this invariance property has been generalized to apply to other manifolds and groups [23,24]. In the problem considered here, the network output should be invariant under permutations of the input jet order. Such permutation invariance has been explored in set-based [25][26][27] and graph networks [28,29]. The output should further identify two distinct interchangeable triplets, qqb and q 0 q 0 b 0 , each including an interchangeable pair, qq or q 0 q 0 . This permutation invariance on the output is a unique property of our dataset which our architecture must account for.
Attention mechanisms allow the network to selectively propagate information (gating)-the activities of a set of neurons are multiplied, component-wise, by the activities of another set of neurons. These gating mechanisms allow neural networks to dynamically modulate neuron activity as a function of the other neurons or inputs. Recursive and attention-based networks, which allow the network to infer relationships between different elements in a sequence, have achieved state-of-the-art performance in natural language processing in machine translation [30], language understanding [31], and text generation [32].
Attention architectures are permutation invariant because rearranging the order of the elements in the input sequence induces the same rearrangement in the attention weights. This feature can be leveraged to endow the network with permutation symmetry [25,26]. We leverage the permutation invariance present in attention-based methods to efficiently model the symmetries of the top quark pair system. We generalize the ideas present in dot-product attention to allow for a three-way symmetry-preserving attention mechanism which can perform jet assignments into qqb and q 0 q 0 b 0 triplets.

IV. DATASETS
A sample of 50M simulated pp → tt events was generated at ffiffi ffi s p ¼ 13 TeV using MADGRAPH_AMC@NLO [33] (v2.7.2), interfaced to PYTHIA8 [34] (v8.2) for showering and hadronization. Detector response was simulated using DELPHES [35] (v3.4.2) with the ATLAS parametrization. Events are generated at leading order in quantum-chromodynamics (QCD), with the top mass m top ¼ 173 GeV, and the W-boson forced to decay hadronically. Jets, reconstructed using the anti-k T algorithm [36] as implemented in FASTJET [37] (v3.2.1) with radius R ¼ 0.4, are required to have transverse momentum p T ≥ 25 GeV and absolute pseudorapidity jηj < 2.5, and are tagged as originating from b-quarks with p T -dependent efficiency and mistag rates. 5,926,407 events meet the preselection requirements of ≥6 jets and ≥2 b-jets. This dataset is available at [38]. The supervised learning technique employed here requires a training sample in which the correct assignments are identified. We define the correct jet assignments by matching them to the simulated truth quarks within an angular distance of ΔR ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Δη 2 þ Δϕ 2 p < 0.4. Requiring all six quarks to be unambiguously matched leaves 2,967,955 total events for training, and 119,283 for performance evaluation and hyperparameter tuning. This matching requirement has an efficiency of 24% on 6-jet events, 32% on 7-jet events, and 40% on ≥8-jet events. We also evaluate the performance on 365,939 additional testing events which contain any number of reconstructable top quarks. Of these events, 191,597 contain only one top quark with all decay products matched, with filter efficiencies of 56%, 52%, and 48% in 6-jet, 7-jet, and ≥8-jet events respectively.
We note that the matching of partons to jets via the parton-shower history in this way is not strictly physical in QCD, as these partons are not directly observable in a physical system and are modified by color correlations with the rest of the event which may not have been fully taken into account at the level of the matching procedure [39]. Nonetheless, this is a common paradigm, including in the χ 2 benchmark process [1,10], which we adopt in order to define the learning task. The physical interpretation of any measurement which utilizes our technique is unaffected by this procedure, as the output of the network is used only to pick the jet triplets which are to be measured. This is exactly analogous to the use of machine-learning based boosted top taggers [40,41].

V. SPA-NET ARCHITECTURE
We perform the jet classification with SPA-NET: an attention-based neural network which encodes the symmetries of the problem. Inputs to the network are an unsorted list of jets, each represented by their 4-vector ðp T ; η; ϕ; MÞ as well as a boolean b-tag. M and p T are logarithmically scaled, and then each component is independently normalized to have zero mean and unit variance. The network, shown in Fig. 2, consists of six components: a jet-independent embedding which converts each jet into a D-dimensional latent space representation; a stack of transformer encoders which learn contextual relationships; two additional transformer encoders on each branch to extract top-quark information; and two tensor-attention layers to produce the top-quark distributions. The transformer encoders employ a variant of attention known as multi-head self-attention [30], though they may use any permutation-invariant architecture in general.

A. Symmetry preserving tensor attention
Jet assignment in the context of tt events presents several unique problems for typical classification networks, as a variety of symmetries complicate the output generation and training. Primarily, the physics quantities are invariant under permutation of the W decay products qq. The network must not differentiate between predictions which prefer one ordering to the other, and match each qq pair to the appropriate b. This naturally creates a triplet relation qqb, with qq obeying permutation symmetry. In order to encode this into the network, we develop a generalization of the attention mechanism which can learn n-way relationships with included symmetries. We call this technique tensor attention.
Each tensor attention layer contains a set of weights θ ∈ R D×D×D . This tensor is not inherently symmetric: in order to produce an invariant attention weighting, we first transform it into an auxiliary weights tensor which conforms to the classification permutation group. We produce S ∈ R D×D×D and use it to perform weighted dot-product attention on a list of embedded jets X ∈ R N×D , where N is the number of jets. Working in flat Euclidean space, we express the attention mechanism in Einstein notation, with The summation in Eq.
(2) guarantees that the first two dimensions of S will be symmetric, ensuring that O ijk ¼ O jik , and enforcing qq invariance. Afterwards, we perform a 3-dimensional softmax on O to generate the joint triplet probability distribution We produce individual distributions for each of the two topquarks, P L and P R , and we produce a single triplet from each by selecting the peak of these distributions. An important note is that the weights tensor rank depends only on the hyperparameter D, and thus, it is possible to include any number of jets in each event. Furthermore, the network evaluation scales only as OðN 3 Þ with respect to the number of jets because we only need to produce a triplet distribution P. This removes a crippling limitation of the χ 2 method, which grows as OðN 6 Þ. In our dataset, the largest jet multiplicity in a single event is N ¼ 18.

B. Training
We train these distributions via cross-entropy between the output probabilities and the true target distribution on the all-jet tt problem, naming the resulting network SPAttER (SPA-NET for tt reconstruction). This formulation contains another symmetry which can be exploited: the top quark pairs are invariant with respect to the labels tt 0 ↔ t 0 t. We create a symmetric loss function based on cross-entropy, HðX; YÞ ¼ P ðx;yÞ∈ðX;YÞ −x logðyÞ, which allows either of the networks two output distributions, P L and P R , to match either one of the targets T 1 and T 2 . The target distributions T have two symmetric nonzero entries, one for each permutation of the qq pair. The loss L is expressed as L ¼ min ðL 1 ðP L ; T 1 ; P R ; T 2 Þ; L 1 ðP L ; T 2 ; P R ; T 1 ÞÞ ð5Þ where L 1 ðP 1 ; T 1 ; P 2 ; T 2 Þ ¼ HðT 1 ; P 1 Þ þ HðT 2 ; P 2 Þ. The resulting distributions may classify the same jet to be part of both triplets. To enforce unique predictions, we select the assignment of the higher probability P first, and set the colliding bins of the other P to zero. We then evaluate the second P to select the best noncontradictory classifications. We also note that SPAttER does not enforce that b-tagged jets are selected in the position of the b-quarks. This allows the network to correctly predict events in which there are mistagged jets, while still utilizing b-tagging information. In the χ 2 , allowing this means b-tagging information is completely lost, and greatly increases the number of permutations.
SPAttER contains 2.1M parameters in each tensor attention layer and 600 k parameters in the central transformer encoder stack. SPAttER was trained using the ADAMW optimizer [42] and four Nvidia Titan-V GPUs, converging after approximately 4 hours.

VI. PERFORMANCE
To assess performance we define two metrics, which can only be evaluated on identifiable top quarks where the correct assignment has been identified as described previously. The first metric is ϵ top , the fraction of identifiable top quarks which have all three jets correctly assigned. This is reported in two event subsets, those where only one top quark is identifiable (ϵ top 1 ), and those where both top quarks are identifiable (ϵ top 2 ). We further define ϵ event , the fraction of events with two identifiable top quarks in which both top quarks have all jets correctly assigned. Table I shows these metrics for both the χ 2 and SPAttER methods.
The χ 2 has an ϵ event of 41.2%, while SPAttER achieves an ϵ event of 63.7%. The χ 2 suffers from a large reduction in performance as jet multiplicity increases, peaking at 55.2% for events with exactly 6 jets and dropping to 20.5% in events with at least 8 jets. This drop is less pronounced for SPAttER, at 80.7% in 6-jet events and 52.3% in ≥8-jet events. We observe a similar trend in the per-top efficiencies; for ϵ top 2 , SPAttER achieves 73.5% inclusively, compared to just 49.7% for the χ 2 . These numbers drop to 66.2% and 33.6% respectively in ≥8-jet events. The performance is lower in events in which only one top is identifiable, though SPAttER still strongly outperforms the χ 2 , with an ϵ top 1 of 55.2% and 28.6% respectively. We also note that in our evaluation dataset, 8.1% of events in which both tops are identifiable have at least one b-quark matched to non-b-tagged jets. These quarks, which are impossible for the χ 2 to correctly reconstruct, are reconstructed by SPAttER with an efficiency of 29.4%.
We inspect the reconstructed W mass using the assignments generated by both methods in Fig. 3, broken down into three categories: "correct," "incorrect," and "unmatched," corresponding to the cases in which: all three top decay products are correctly assigned, all three top decay products are present in the event but at least one is incorrectly assigned, and one or more of the top decay products is not identifiable, respectively. The χ 2 has a narrower peak around m W than SPAttER, though much of this shape comes from the incorrect and unmatched events. This is explained by the presence of m W in Eq. (1), and demonstrates that SPAttER is utilizing more information than just the invariant mass peak. Figure 4 shows the m top distributions for the same three categories. SPAttER has the more peaked distribution, with the χ 2 showing a much larger tail to high masses in the incorrect and unmatched events. The correctly reconstructed events are centred at the expected masses, and the fraction of events in this category is larger with SPA-NET than the χ 2 , as expected. Additional distributions used for cross-checks are available in Appendix A (Figs. 5-12).
In 11.2% of fully matched events, the network predicted the same jet to be part of both top quarks. SPAttER correctly predicts 25.0% of these events after the reselection described above, compared to only 12.7% for the χ 2 , indicating that these were generally difficult events to classify correctly. We also note that the average softmax value of the predicted jets in these events is only 36%, compared to 74% for all fully reconstructable events.
A final important performance metric is the computation time per event. The time required to evaluate the χ 2 scales approximately as PðN; 6Þ ¼ OðN 6 Þ with the number of jets in the event, and this often leads to analyses setting a maximum number of jets to consider, degrading the performance for purely CPU-time reasons. On the author's laptop, a 2019 Dell XPS13 with an Intel-Core i7-1065G7 1.30 GHz CPU, SPAttER took an average of 4.4 ms to evaluate per event, with no dependence on jet multiplicity. 3 A further order of magnitude speed improvement is found when evaluating on a GPU. In contrast, the χ 2 took an   average of 20 ms in 6-jet events, 48 ms in 7-jet events, and 369 ms in ≥8-jet events.

VII. CONCLUSIONS
SPA-NET's have inherent permutation symmetries which make them very well suited to the task of jet assignment, where the permutations otherwise lead to an increase in computation time and dilutes the scientific value of the data. Our network SPAttER demonstrates superior performance on this task. The adoption of our technique by the experimental collaborations ATLAS and CMS will lead to significantly improved precision of analyses in the all-jet tt final-state by improving the fraction of events that are well reconstructed from 41.2% using existing methods to 63.7% using our new technique. This paradigm shift will allow greatly enhanced sensitivity in high jet multiplicity events making many new analyses viable in this final-state.
This paper describes just one of many possible applications of SPA-NET's to event reconstruction in HEP. Future work may include extending these techniques to alternative all-jet final-states, to other tt decay modes, or many other classes of problem. Though not studied here, the output of SPAttER may also be used in additional ways, such as setting a minimum reconstruction quality requirement that will act to suppress backgrounds, analogously to how the χ 2 is used in [1]. Additional input information, such as jet substructure [43] or (pseudo-)continuous b-tagging [44], may also improve performance.
This letter contributes to a family of work which help endow machine learning methods with problem specific invariances. We have presented an efficient polynomialtime approach for tackling classification tasks where the targets must obey a set of permutation symmetries. Such symmetries underlie the mathematical foundation of the Standard Model, but they may be found in many other common classification tasks, such as graph matching [45] and hierarchical clustering algorithms. Well trained deep neural networks can replace permutation based algorithms, and avoid combinatorial explosion, by effectively estimating symmetry-aware pair-wise similarities. Symmetries can also be used to create smaller models that reduce the amount of data necessary for training [46]. Understanding and exploiting the invariances present in any modeling task is vital for effective learning.
Our code is available at [38].

APPENDIX B: SPAttER MATHEMATICAL FORMULATION
In this section, we provide a more detailed description of SPAttER's architecture which is necessary to recreate the model. Figure 2 provides a high-level graphical overview of the network. Here we describe the mathematics necessary to implement each of the sections in the figure. SPAttER has a very recursive architecture, so we define several terms to simplify this description. SPAttER consists of several large stacks, which are compositions of several identically structured blocks, each of which contains one or more layers, and each of which contains one or more parameters.

Independent embedding
The embedding stack consists of several embedding blocks, each of which progressively increases the latent dimensionality of the input jets up to the final dimensionality D. The embedding blocks are feed-forward, fully connected neural networks which are applied independently to each jet via weight sharing. Each block consists of a fully connected matrix multiplication layer L with parameters W ∈ R D o ×D i and b ∈ R D o ; a PReLU [47] nonlinearity with a parameter a ∈ R D o ; and a one-dimensional BatchNorm [48] layer with parameters μ; σ ∈ R D o . A single embedding block with an output dimensionality D o can be described as We stack several of these embedding block, with each block doubling their latent space dimensionality, starting from the original 5 jet features. SPAttER has a target latent space dimensionality of D ¼ 128, so the D o values follow the sequence: 8 → 16 → 32 → 64 → 128. The full embedding stack can be described as the following composition:

Transformer encoder
The encoder stack consists of a sequence of transformer encoder blocks as described by Vaswani et al. [30]. A single encoder block contains a multi-head attention layer AttentionðQ; K; VÞ [30]; two LayerNorm operations [49]; two feed-forward layers L 1 and L 2 ; and a PReLU nonlinearity. In this paper, we provide a general overview of the encoder structure, but we omit a detailed description of the multihead attention and layer normalization operations. A single transformer encoder block T i can be expressed as AðxÞ ¼ LayerNormðL 2 ðPReLUðL 1 ðxÞÞÞ þ xÞ ðB3Þ The complete transformer encoder stack, T, is simply a composition of k encoder blocks, one after the other. We use k ¼ 6 in SPAttER.

Branch encoders
After passing through the shared, central embedding and encoder stacks, SPAttER's signal path splits into two branches, one for each of the target top quarks. Figure 2 demonstrates this branch splitting. We name these two paths the left and right branch, although this distinction is arbitrary. Each of these branches contains an independent embedding and encoder stack. The branch embedding stacks share a near-identical structure to the initial independent embedding stack, except they preserve the latent dimensionality: D i ¼ D o . The embedding blocks E L i and E R i are applied independently to each jet. The branch encoder stacks also share an identical structure to the central encoder stack. The left branch, T L , and right branch, T R , can be described as compositions of j embedding blocks and l transformer encoder block. We use j ¼ l ¼ 4 in SPAttER.

Tensor attention
We provide a detailed description of the tensor attention output layers in the main text with Eqs. (2)-(4). Here, we replicate the description in a concise manner.
Each tensor attention layers contains a single parameter θ ∈ R D×D×D . We produce an intermediate symmetric tensor S ∈ R D×D×D who's indices obey the symmetry group of the top quark triplet. This is combined with the list of input vectors X ∈ R N×D into an output tensor O ∈ R N×N×N . Finally, this output tensor is passed through a softmax nonlinearity in order to produce valid three-way joint distributions P L and P R . The complete equations for this layer are

Efficient tensor attention
Equation (B8) can be expression as a cubic tensor form, which naively requires OðN 3 D 3 Þ operations. When actually computing this expression, we have to construct several very large intermediate tensors. Since we use optimized GPU linear algebra libraries in order to perform tensor operations, we have to split the evaluation into several, more fundamental, operations. We use opt_einsum [50] in order to generate an optimal set of operations for Eq. (B8). The summation can be expression using the following intermediate operations, each of which is a generalized matrix-multiplication (GEMM).
With the intermediate tensors A mli and B lij have dimensionalities ðD × D × NÞ and ðD × N × NÞ respectively. Since D > N in our situation, this operation requires a large amount of memory to store all of these intermediate components.
Instead of storing the complete cubic weights tensor θ and explicitly finding the summation, we limit the possible θ weights that we can learn in order to greatly reduce the intermediate tensors. Instead of learning θ ∈ R D×D×D , we decompose θ into three matrices θ 1 ; θ 2 ; θ 3 ∈ R D×D . Then we compute three intermediate vectors from X by simply performing regular matrix multiplication These can be easily implemented with fully connected neural network layers and computed in parallel. Finally, we can estimate our original output by performing a trivial three-form with these tensors.
where 1 nml ¼ 1 for all possible indices. The space complexity of this decomposition when including the intermediate tensors reduces to OðN 3 þ ND þ D 2 Þ ≈ OðD 2 Þ, and improvement over the naive approach which has space complexity of OðN 3 þ ND 2 þ D 3 Þ ≈ OðD 3 Þ. Additionally, when using opt_einsum to evaluate the three-form, the runtime is reduced from OðN 3 D 3 Þ to OðN 3 D 2 Þ. We can replicate the symmetry constrain we impose in Eq. (B7) by simply requiring the first two decomposed matrices to be equal to each other.
This decomposition is not one-to-one, so not every θ can be represented in this form. However, in our experiments, we found no drop in predictive performance when using this decomposition.

Batching
Since most deep neural network implementations use efficient batch matrix multiplication routines in order speed up computation, it is beneficial to feed batches of several events into the network simultaneously. However, events can contain a varying number of jets: a single event can have a length of anywhere from 6 to 20 momentum vectors. In order to process all of these events together in batches, we pad all events to the maximum size of 20 jets by appending 0-vectors, creating a batched input tensor X ∈ R B×20×5 where B is the batch size. We also construct a secondary masking input M ∈ f0; 1g B×20 . This vector indicates if the given jet in a given event is a real jet or a padding jet. This vector is employed during multi-head attention in order to prevent the attention weights from including the masked jets [30]. The masking vector is also employed during softmax calculation to prevent the masked vectors from skewing the output distributions. When applied to a batch output tensor O ∈ R B×20×20×20 , the softmax calculation can be described as:

Hyperparameters
We select optimal hyperparameters for SPAttER by running a parallel Gaussian process hyperparameter search using the SHERPA hyperparameter optimization library [51]. We evaluate 500 different sets of hyperparemeters using a subset of the original training dataset, training on only the first 50% of events. We also sample the last 5% of the training dataset to act as the validation dataset for the hyperparameter search. We evaluate each model by computing the event purity, ϵ top 2 , on this validation dataset. The first 100 sets of hyperparameters are randomly sampled from a uniform distribution. Afterwards, we use a Gaussian process optimizer in order to suggest future hyperparamers which may improve the purity. A complete listing of the final hyperparameters for this network can be found in Table II.