The Metric Space of Collider Events

When are two collider events similar? Despite the simplicity and generality of this question, there is no established notion of the distance between two events. To address this question, we develop a metric for the space of collider events based on the earth mover's distance: the"work"required to rearrange the radiation pattern of one event into another. We expose interesting connections between this metric and the structure of infrared- and collinear-safe observables, providing a novel technique to quantify event modifications due to hadronization, pileup, and detector effects. We showcase how this metrization unlocks powerful new tools for analyzing and visualizing collider data without relying upon a choice of observables. More broadly, this framework paves the way for data-driven collider phenomenology without specialized observables or machine learning models.

High-energy particle collisions produce a tremendous number of intricately correlated particles, especially when energetic quarks and gluons are involved. Behind this apparent complexity, however, the overall flow of energy in an event is a robust memory of its simpler partonic origins [1][2][3][4][5][6][7][8]. Surprisingly, no definition of the similarity between events presently exists that sharply captures this correspondence. In the absence of a metric, efforts typically fall back upon ad hoc methods such as comparing specific observables [9][10][11][12][13] or matching the pixels of calorimeter images [13][14][15][16][17]. These approaches suffer from significant pathologies: disparate event topologies can give rise to identical observable values, while pixels lack stability under small perturbations. A theoretically and experimentally robust definition of the "distance" between events would profoundly expand our ability to explore the structure of collider data and unlock entirely new ways to probe events.
In this letter, we advocate for the earth (or energy) mover's distance (EMD) [18][19][20][21][22] as a metric for the space of collider events. We propose a variant of the EMD, inspired by Refs. [21,22], that allows events with different total energies to be sensibly compared. The EMD is the minimum "work" required to rearrange one event E into the other E by movements of energy f ij from particle i in one event to particle j in the other: where i and j index particles in events E and E , respectively, E i is the particle energy, θ ij is an angular distance between particles, and E min = min( i E i , j E j ) is the smaller of the two total energies. R is a parameter that controls the relative importance of the two terms. While energies and angles are used here for clarity, we will use transverse momenta p T and rapidity-azimuth (y, φ) distances for our applications relevant for the Large Hadron Collider (LHC). The optimal movement to rearrange one top jet (red) into another (blue). Particles are shown as points in the rapidity-azimuth plane with areas proportional to their transverse momenta. Darker lines indicate more transverse momentum movement. The earth mover's distance in Eq. (1) is the total "work" required to perform this rearrangement.
The EMD that we propose in Eq. (1) has dimensions of energy, where the first term quantifies the difference between the two radiation patterns and the second term accounts for the creation or destruction of energy. It is a true metric (satisfying the triangle inequality) as long as θ ij is a metric and R ≥ 1 2 θ max , where θ max is the maximum attainable angular distance between particles. For instance, R must be at least the jet radius for conical jets. Formally, the EMD metrizes the energy flow, as it treats events differing only by soft particles or collinear splittings identically. This hints at a deep connection to infrared and collinear (IRC) safety of observables [23][24][25][26], which we explore further below. A metric for comparing events is particularly relevant for probing the substructure of jets [27][28][29][30][31][32][33][34][35][36][37], collimated sprays of particles resulting from the fragmentation and hadronization of high-energy quarks and gluons via quantum chromodynamics (QCD). Here, we will consider three classes of jets which have different intrinsic topologies: three-pronged boosted top quark jets, two-pronged boosted W boson jets, and single-pronged QCD (quark or gluon) jets. We generate proton-proton collision events at the LHC with Pythia 8.235 [38] at √ s = 14 TeV including hadronization and multiple particle interactions. Anti-k T jets [39] with a jet radius of 1.0 are clustered using FastJet 3.3.1 [40], and up to two jets with p T ∈ [500, 550] GeV and |y| < 1.7 are kept. Jets are boosted and rotated to center the jet four-momentum and vertically align the principal component of the constituent transverse momentum flow in the rapidity-azimuth plane. We record the final-state hadrons, as well as the partons (before hadronization) and the hard W /top decay products, that are within a jet radius of the jet four-momentum. We use the Python Optimal Transport [41] library to compute EMDs with the minimal choice of R = 1.0, the jet radius. The energy difference penalty in Eq. (1) is implemented using a fictitious particle at a distance R from all other particles. Fig. 1 shows the optimal energy movement between two example top jets. We begin by highlighting a remarkable mathematical property of the EMD which provides a quantitative un-derstanding of an observable's sensitivity to the radiation pattern. Specifically, we relate the EMD to additive IRC-safe observables via the Kantorovich-Rubinstein [42] duality theorem. Applying this theorem to our variant of the EMD, we derive the following mathematical bound between two events E and E : where i, j index E, E , respectively,p i is the particle angular position, and Φ is any L-Lipschitz function (essentially, with gradient size bounded by L) which vanishes at the center of the space (e.g. the jet axis). The implications of Eq. (2) are simple yet profound: the similarity of events according to the EMD metric guarantees the closeness of values in a precise way that depends on Φ. By formulating IRC-safe observables in the language of additive energy-weighted structures [43,44], Eq. (2) can be applied to provide a robust bound.
As a concrete example, we demonstrate how the EMD bounds hadronization modifications of jet angularities [45] (see also Refs. [46][47][48][49]), λ (β) = i p T,i θ β i where θ i is the rapidity-azimuth distance to the jet axis. These angularities are evidently of the form in Eq.
The EMD between two events yields a robust upper bound of the difference in their β ≥ 1 angularity values. This bound is borne out in Fig. 2, where the angularity differences and EMDs are computed for the same QCD jets before and after hadronization. For this jet p T range, hadronization modifies events by EMD 30 GeV and correspondingly modifies λ (β=1) by no more than this amount. The intuitive picture of parton-hadron duality [5], that the energy flow in an event is robust to nonperturbative effects, is quantified by considering the EMD that these nonperturbative effects can induce. A metric space is also useful for classification without requiring specially-designed observables or parametrized machine learning algorithms. One of the simplest examples of a non-parametric classifier is the k-nearest neighbor (kNN) algorithm [50], whereby a given event's closest k neighbors in a reference set are used to determine class membership. We build a kNN classifier applied to the problem of discriminating W jets from QCD jets using a balanced training sample of 100k total jets. The classifier output is the number of W jets among the k = 32 nearest neighbors by EMD. This method should approach the optimal IRC-safe classifier with a sufficiently large dataset. The performance of the resulting EMD kNN classifier is shown in Fig. 3  of N -subjettiness observables [51,52] designed to identify two-prong substructure.
It is worth noting that while searching through a large reference set of events to find neighbors naively requires every possible pairwise comparison, in a metric space the triangle inequality can provide a great deal of simplification. Specialized data structures known as metric trees [53][54][55][56] have been developed to achieve query times that are approximately logarithmic in the size of the dataset. While we use direct searches throughout this letter, this is not a fundamental limitation and we leave metric tree query optimizations to future work.
Once a space has been equipped with a metric, it is natural to ask about the structure of the induced manifold. The most basic aspect of the manifold underlying the data is its dimension, and several notions of its intrinsic dimension exist [57]. The correlation dimension [58,59], a type of fractal dimension, is suitable for our purposes and is defined using only pairwise distances: (4) where N is the total number of events and the summand indicates whether event k is within EMD Q of event .
The correlation dimension is an intrinsically scaledependent quantity, which is particularly useful as we anticipate different physical effects to dominate jets at different scales. Shown in Fig. 4 is the intrinsic dimension of our top, W , and QCD samples over energy scales Q ranging from 10 GeV to 1000 GeV obtained from Eq. (4) with 25k jets. At high energy scales Q, the EMD is governed by the hard decay kinematics, resulting in a relatively simple manifold with low intrinsic dimension. At energy scales Q approaching the fragmentation and hadronization scales, the structure of the events becomes increasingly complex and the dimension correspondingly increases. It is satisfying that the dimension is relatively low for a wide range of relevant energies, which is critical for a variety of metric-based techniques such as classification and low-dimensional visualization to work effectively with a realistic amount of data. Beyond probing its dimension, the entire space of jets can be visualized using techniques such as t-Distributed Stochastic Neighbor Embedding (t-SNE) [60][61][62][63], which finds a low-dimensional embedding of the data that attempts to respect the distances between points. Fig. 5 shows a t-SNE embedding of 5k W jets with p T ∈ [500, 510] GeV into a two-dimensional manifold using scikit-learn [64]. The narrower p T range focuses the EMD on the jet substructure and was found to yield sharper visualizations, with other choices also yielding sensible results. The W jets populate a circular subspace roughly corresponding to the energy sharing of the two prongs. As the W jet originates from a resonant decay, the two decay quarks (after rotation) are solely described by their energy sharing, which satisfyingly emerges from the manifold of W jets. Moreover, the center of the ring, distant from the annulus, tends to contain the most complex jet topologies, resulting in a type of automatic anomaly detection.
Finally, we illustrate the use of EMD for a new kind of visualization strategy that clusters events to better understand observable distributions. To describe a given set of events, such as those in a histogram bin, we find the k events (called medoids) which best describe the set in that the sum of distances of each event to its closest medoid is minimized. This procedure works for any observable and provides an immediate glimpse of the types of event topologies that correspond to a given observable value. We use an iterative approximation of k-medoids from the pyclustering Python package [65].
As an illustration, Fig. 6 shows the jet mass for QCD jets with k = 3 medoids per bin, providing a snapshot of the different event topologies at different masses.
In conclusion, we have equipped the space of events with a metric, thereby allowing a powerful suite of new tools and techniques to be directly applied to collider physics. There are many potential applications of the EMD at colliders beyond those presented here. Pileup mitigation or detector reconstruction could use the EMD to benchmark performance and thus benefit from the quantitative bounds on IRC-safe observable modifications. Further, machine learning models could be trained to optimize the EMD, related to recent efforts in generative modeling [66][67][68][69]. By counting neighbors, one could also perform density estimation in the space of events [70]. While we have focused on jet substructure, analogous studies could be carried out at the event level, which may require working with composite objects such as jets for realistic computation times. It would be interesting to explore an EMD strategy for unfolding by matching detector-level and simulated events. One might consider alternatives to the EMD, such as symmetryprojected metrics [22] or p-Wasserstein metrics [71,72] beyond our p = 1 case, though our conclusions should hold for any physically-sensible metric. Further, using the EMD for model-independent anomaly detection [73][74][75][76][77][78][79] by finding isolated or clustered event topologies could empower searches for physics beyond the Standard Model at the LHC.